Bank Marketing Data Classification

Binary classification using Bank Marketing Dataset.

In this tutorial, we’ll construct the neural network model to predict the success of direct marketing campaigns (phone calls) of a bank. The dataset includes 41188 examples and 20 input variables, and the classification goal is to predict if the client will subscribe (yes/no) a term deposit.

The reference of the dataset is below.

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 https://archive.ics.uci.edu/ml/datasets/bank+marketing

Required Libraries

In [1]:
#!/usr/bin/env python
# encoding:utf-8

from __future__ import division, print_function
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report

import renom as rm
from renom.utility.trainer import Trainer
from renom.utility.distributor import NdarrayDistributor
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active
# If you would like to use GPU, set True, otherwise you should be set to False.
set_cuda_active(False)

Load data

Some inputs are numerical, and the others are categorical. The numerical inputs are normalized, and the categorical inputs are converted into one-hot vector representation.

In [2]:
def load_data(filename):
    df = pd.read_csv(filename, sep=";")
    cols_numerical = list(df.columns[(df.dtypes=="int")|(df.dtypes=="float")])
    cols_categorical = list(df.columns[df.dtypes=="object"].drop("y"))

    # Output labels (success/failure)
    y = pd.get_dummies(df["y"])["yes"].values.astype("float32")
    # Numerical inputs
    X = df[cols_numerical]
    X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    # Categorical inputs
    for name in cols_categorical:
        X = pd.concat((X, pd.get_dummies(df[name])), axis=1)
    return X.values.astype("float32"), np.reshape(y, (-1,1))
In [3]:
X, y = load_data("./bank-additional-full.csv")
print("X:{}, y:{}".format(X.shape, y.shape))
X:(41188, 63), y:(41188, 1)

Define the neural network using Sequential model

In [4]:
sequential = rm.Sequential([
    rm.Dense(128),
    rm.Relu(),
    rm.Dropout(dropout_ratio=0.5),
    rm.Dense(64),
    rm.Relu(),
    rm.Dropout(dropout_ratio=0.5),
    rm.Dense(32),
    rm.Relu(),
    rm.Dropout(dropout_ratio=0.5),
    rm.Dense(1)
])

Define Trainer and Distributor

In [5]:
trainer = Trainer(sequential,
                  num_epoch=30,
                  batch_size=128,
                  loss_func=rm.sigmoid_cross_entropy,
                  optimizer=Adam())
trainer.model.set_models(inference=False)
In [6]:
dist = NdarrayDistributor(X, y)
train_dist, test_dist = dist.split(0.8)

Set some callback functions for training

For displaying the training process, we set some callback functions as below.

In [7]:
loss = 0
learning_curve = []
test_learning_curve = []

# Called when each epoch starts.
@trainer.events.start_epoch
def start_epoch(trainer):
    global loss, start_t
    loss = 0
    start_t = time.time()

# Called after weight parameter update executed.
@trainer.events.updated
def updated(trainer):
    global loss
    loss += trainer.losses[0]

# Called at the end of each epoch.
@trainer.events.end_epoch
def end_epoch(trainer):
    train_loss = loss / (trainer.nth + 1)
    learning_curve.append(train_loss)

    test_loss = 0

    trainer.model.set_models(inference=True)
    for i, (x, y) in enumerate(test_dist.batch(trainer.batch_size)):
        test_result = trainer.test(x)
        test_loss += trainer.loss_func(test_result, y)
    trainer.model.set_models(inference=False)

    test_loss /= i + 1
    test_learning_curve.append(test_loss)

    print("epoch %03d train_loss:%f test_loss:%f time:%f" %
          (trainer.epoch, train_loss, test_loss, time.time() - start_t))

Execute training

In [8]:
trainer.train(train_distributor=train_dist, test_distributor=test_dist)
epoch 000 train_loss:0.350524 test_loss:0.279391 time:1.344485
epoch 001 train_loss:0.295967 test_loss:0.256460 time:1.363860
epoch 002 train_loss:0.270867 test_loss:0.232115 time:1.353549
epoch 003 train_loss:0.249600 test_loss:0.215804 time:1.406746
epoch 004 train_loss:0.225739 test_loss:0.198004 time:1.330165
epoch 005 train_loss:0.215673 test_loss:0.194343 time:1.362455
epoch 006 train_loss:0.207059 test_loss:0.191972 time:1.423319
epoch 007 train_loss:0.207181 test_loss:0.187434 time:1.500286
epoch 008 train_loss:0.200185 test_loss:0.188905 time:1.338802
epoch 009 train_loss:0.197584 test_loss:0.189372 time:1.344511
epoch 010 train_loss:0.195181 test_loss:0.188537 time:1.347493
epoch 011 train_loss:0.196306 test_loss:0.190898 time:1.371795
epoch 012 train_loss:0.194158 test_loss:0.184670 time:1.310804
epoch 013 train_loss:0.192673 test_loss:0.187855 time:1.308729
epoch 014 train_loss:0.188961 test_loss:0.185264 time:1.308843
epoch 015 train_loss:0.189353 test_loss:0.187251 time:1.358915
epoch 016 train_loss:0.187933 test_loss:0.188169 time:1.487613
epoch 017 train_loss:0.189058 test_loss:0.185382 time:1.429426
epoch 018 train_loss:0.187757 test_loss:0.184907 time:1.302326
epoch 019 train_loss:0.187455 test_loss:0.186570 time:1.295127
epoch 020 train_loss:0.186361 test_loss:0.184171 time:1.295239
epoch 021 train_loss:0.183288 test_loss:0.185257 time:1.296383
epoch 022 train_loss:0.182532 test_loss:0.185194 time:1.289605
epoch 023 train_loss:0.184322 test_loss:0.187435 time:1.336238
epoch 024 train_loss:0.183143 test_loss:0.188176 time:1.518319
epoch 025 train_loss:0.181871 test_loss:0.188092 time:1.503596
epoch 026 train_loss:0.182405 test_loss:0.188630 time:1.316730
epoch 027 train_loss:0.181000 test_loss:0.187239 time:1.607486
epoch 028 train_loss:0.181954 test_loss:0.186246 time:1.416709
epoch 029 train_loss:0.180484 test_loss:0.186436 time:1.443533

Prediction and Evaluation

In [12]:
threshold = 0.5
X_test = test_dist.data()[0]
y_test = test_dist.data()[1]
trainer.model.set_models(inference=True)
temp_pred = rm.sigmoid(trainer.test(X_test)).as_ndarray()
print("shape:{}".format(temp_pred))
1/0
pred = temp_pred > threshold

print("y_test:{}".format(y_test))
print("shape:{}".format(y_test.shape))
# Display confusion matrix and classification report.
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred, target_names=["failure", "success"]))

# Plot learning curves.
plt.plot(learning_curve, linewidth=3, label="train")
plt.plot(test_learning_curve, linewidth=3, label="test")
plt.ylabel("error")
plt.xlabel("epoch")
plt.grid()
plt.legend()
plt.show()
shape:[[  3.95991266e-01]
 [  3.10748474e-05]
 [  4.93260659e-03]
 ...,
 [  1.95798725e-01]
 [  3.91770480e-03]
 [  2.17928998e-02]]
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-12-365a6c7bee56> in <module>()
      5 temp_pred = rm.sigmoid(trainer.test(X_test)).as_ndarray()
      6 print("shape:{}".format(temp_pred))
----> 7 1/0
      8 pred = temp_pred > threshold
      9

ZeroDivisionError: division by zero