Bank Marketing Data Classification

Binary classification using Bank Marketing Dataset.

In this tutorial, we’ll construct the neural network model to predict the success of direct marketing campaigns (phone calls) of a bank. The dataset includes 41188 examples and 20 input variables, and the classification goal is to predict if the client will subscribe (yes/no) a term deposit.

The reference of the dataset is below.

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 https://archive.ics.uci.edu/ml/datasets/bank+marketing

Required Libraries

In [1]:
#!/usr/bin/env python
# encoding:utf-8

from __future__ import division, print_function
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report

import renom as rm
from renom.utility.trainer import Trainer
from renom.utility.distributor import NdarrayDistributor
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active
# If you would like to use GPU, set True, otherwise you should be set to False.
set_cuda_active(False)

Load data

Some inputs are numerical, and the others are categorical. The numerical inputs are normalized, and the categorical inputs are converted into one-hot vector representation.

In [2]:
def load_data(filename):
    df = pd.read_csv(filename, sep=";")
    cols_numerical = list(df.columns[(df.dtypes=="int")|(df.dtypes=="float")])
    cols_categorical = list(df.columns[df.dtypes=="object"].drop("y"))

    # Output labels (success/failure)
    y = pd.get_dummies(df["y"])["yes"].values.astype("float32")
    # Numerical inputs
    X = df[cols_numerical]
    X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    # Categorical inputs
    for name in cols_categorical:
        X = pd.concat((X, pd.get_dummies(df[name])), axis=1)
    return X.values.astype("float32"), np.reshape(y, (-1,1))
In [3]:
X, y = load_data("./bank-additional-full.csv")
print("X:{}, y:{}".format(X.shape, y.shape))
X:(41188, 63), y:(41188, 1)

Define the neural network using Sequential model

In [4]:
sequential = rm.Sequential([
    rm.Dense(128),
    rm.Relu(),
    rm.Dropout(dropout_ratio=0.5),
    rm.Dense(64),
    rm.Relu(),
    rm.Dropout(dropout_ratio=0.5),
    rm.Dense(32),
    rm.Relu(),
    rm.Dropout(dropout_ratio=0.5),
    rm.Dense(1)
])

Define Trainer and Distributor

In [5]:
trainer = Trainer(sequential,
                  num_epoch=30,
                  batch_size=128,
                  loss_func=rm.sigmoid_cross_entropy,
                  optimizer=Adam())
trainer.model.set_models(inference=False)
In [6]:
dist = NdarrayDistributor(X, y)
train_dist, test_dist = dist.split(0.8)

Set some callback functions for training

For displaying the training process, we set some callback functions as below.

In [7]:
loss = 0
learning_curve = []
test_learning_curve = []

# Called when each epoch starts.
@trainer.events.start_epoch
def start_epoch(trainer):
    global loss, start_t
    loss = 0
    start_t = time.time()

# Called after weight parameter update executed.
@trainer.events.updated
def updated(trainer):
    global loss
    loss += trainer.losses[0]

# Called at the end of each epoch.
@trainer.events.end_epoch
def end_epoch(trainer):
    train_loss = loss / (trainer.nth + 1)
    learning_curve.append(train_loss)

    test_loss = 0

    trainer.model.set_models(inference=True)
    for i, (x, y) in enumerate(test_dist.batch(trainer.batch_size)):
        test_result = trainer.test(x)
        test_loss += trainer.loss_func(test_result, y)
    trainer.model.set_models(inference=False)

    test_loss /= i + 1
    test_learning_curve.append(test_loss)

    print("epoch %03d train_loss:%f test_loss:%f time:%f" %
          (trainer.epoch, train_loss, test_loss, time.time() - start_t))

Execute training

In [8]:
trainer.train(train_distributor=train_dist, test_distributor=test_dist)
epoch 000 train_loss:0.345062 test_loss:0.262687 time:3.356673
epoch 001 train_loss:0.292452 test_loss:0.249271 time:2.998212
epoch 002 train_loss:0.266095 test_loss:0.220135 time:2.592718
epoch 003 train_loss:0.239894 test_loss:0.201134 time:2.735315
epoch 004 train_loss:0.224812 test_loss:0.190251 time:2.749983
epoch 005 train_loss:0.212756 test_loss:0.188753 time:2.912541
epoch 006 train_loss:0.206352 test_loss:0.184922 time:3.235472
epoch 007 train_loss:0.202500 test_loss:0.187596 time:3.347289
epoch 008 train_loss:0.202192 test_loss:0.188474 time:3.131775
epoch 009 train_loss:0.198750 test_loss:0.184238 time:3.204706
epoch 010 train_loss:0.194940 test_loss:0.183835 time:3.313729
epoch 011 train_loss:0.193502 test_loss:0.181228 time:3.390766
epoch 012 train_loss:0.194440 test_loss:0.184395 time:3.042407
epoch 013 train_loss:0.190342 test_loss:0.181561 time:2.942652
epoch 014 train_loss:0.190609 test_loss:0.181128 time:3.619759
epoch 015 train_loss:0.189735 test_loss:0.180320 time:4.803418
epoch 016 train_loss:0.190195 test_loss:0.179710 time:2.598008
epoch 017 train_loss:0.188654 test_loss:0.181476 time:2.856803
epoch 018 train_loss:0.187877 test_loss:0.182296 time:3.691624
epoch 019 train_loss:0.188469 test_loss:0.182921 time:3.614568
epoch 020 train_loss:0.186209 test_loss:0.180571 time:4.354409
epoch 021 train_loss:0.186263 test_loss:0.181828 time:4.186099
epoch 022 train_loss:0.183541 test_loss:0.181282 time:4.006507
epoch 023 train_loss:0.184907 test_loss:0.187596 time:4.020967
epoch 024 train_loss:0.184686 test_loss:0.185104 time:3.255115
epoch 025 train_loss:0.183653 test_loss:0.181722 time:2.289766
epoch 026 train_loss:0.180959 test_loss:0.180327 time:1.898204
epoch 027 train_loss:0.181882 test_loss:0.180301 time:3.538731
epoch 028 train_loss:0.180532 test_loss:0.185120 time:3.345265
epoch 029 train_loss:0.179896 test_loss:0.184543 time:4.294404

Prediction and Evaluation

In [9]:
threshold = 0.5
X_test = test_dist.data()[0]
y_test = test_dist.data()[1]
trainer.model.set_models(inference=True)
pred = rm.sigmoid(trainer.test(X_test))>threshold

# Display confusion matrix and classification report.
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred, target_names=["failure", "success"]))

# Plot learning curves.
plt.plot(learning_curve, linewidth=3, label="train")
plt.plot(test_learning_curve, linewidth=3, label="test")
plt.ylabel("error")
plt.xlabel("epoch")
plt.grid()
plt.legend()
plt.show()
[[7056  293]
 [ 419  470]]
             precision    recall  f1-score   support

    failure       0.94      0.96      0.95      7349
    success       0.62      0.53      0.57       889

avg / total       0.91      0.91      0.91      8238

../../../_images/notebooks_clustering_bank-marketing-data-classification_notebook_16_1.png