Detecting overfitting problem from learning curve

Detecting overfitting problem and separation between training data and test data.

Normally, it is often the case that we want to predict the future data where we already got the past data and current data.
So, we'll divide the data already obtained into training data and test data 9:1 or 8:2 for example, and deal with the training data as past and current data and the test data as future data which should be predicted.
Next, we'll evaluate the model how well this model could predict the future value with test data after the model learned from the training data.
Sometimes the model works well for training data, but doesn't work well for test data, this phenomenon is called overfitting.
That means the test data could have the records are not seen in the training data, so our purpose is constructing more robust(not affected by sutle difference) model.
This time we'll see the overfitting problem phenomenon from the learning curve.

Required Libraries

  • matplotlib 2.0.2
    Used for making the learning curve graph
  • numpy 1.12.1
    Used for efficiently use of matrix data
  • scikit-learn(sklearn) 0.18.2
    Used for fetching the data, simple preprocessing, classification report
In [1]:
# This module is needed to calculate division with python2.x
from __future__ import division, print_function
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, classification_report

import renom as rm
from renom.cuda import cuda
from renom.optimizer import Sgd
cuda.set_cuda_active(False)

Dataset

  • MNIST dataset
The MNIST dataset contains 70000 digit images from 0 to 9.
Each digit image file is composed of 28pixels x 28pixels square and each pixel represents gray scale value(0 is black, 255 is white).
We'll make the model which can classify the each digit data to its class from 0 to 9 based on each image's pixel values.

Prerequisite for learning

  • Loading the data
  • Preprocessing

Loading the data

MNIST dataset will be downloaded in data_home directory.

In [2]:
mnist = fetch_mldata('MNIST original', data_home='.')

X = mnist.data
y = mnist.target

Preprocessing

  • Normalization
    When each explanatory variable have different scale of units, it is hard to learn than its same scale case.
  • Split the data into training data and test data
    Split the data into learning data and predict data.
    There is a case that predict data is not seen in the learning data in practice.
  • One-hot vectorization for the label data
    We have to convert the categorical data to onehot-vector.
In [3]:
X = X.astype(np.float32)
X /= X.max()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
labels_train = LabelBinarizer().fit_transform(y_train).astype(np.float32)
labels_test = LabelBinarizer().fit_transform(y_test).astype(np.float32)

Neural Network Model

  • Functional model There are mainly two types of neural network definition, sequential model and functional model. Sequential model is easy to understand and write, functional model is flexible. Sequential Model Functional Model
In [4]:
class Model(rm.Model):

    def __init__(self):
        super(Model, self).__init__()
        self._layer1 = rm.Dense(250)
        self._layer2 = rm.Dense(500)
        self._layer3 = rm.Dense(10)

    def forward(self, x):
        t1 = self._layer1(x)
        t2 = self._layer2(t1)
        out = self._layer3(t2)
        return out

Minibatch learning with training data

Learning parameters

  • Batch size
    When the data is very large, sometimes too large to load all the data at once because of the memory size or weight matrix of neural network causes calculation to process slower.
    Generally in neural network learning, we used to learn mini-batch divided data.
    It means we'll devide the data into small groups defined by batch size.
  • Epoch
    Epoch is the number of learning iterations of all the data.
    Too many epoch may cause overfitting, too small epoch may cause underfitting and causes the model to need more iterations.
    Thus we have to search the propery epoch.
  • Optimizer
    Optimizer decides to update rule of learning.
    Update rule and state of learning are different depends on optimizer.
  • Learning rate
    Learning rate is step size of learning.
    Loss value doesn't go down if learning rate is too high, and it takes too long time to converge if learning rate is too small.
In [5]:
batch = 256
epoch = 100
N = len(X_train)
optimizer = Sgd(lr = 0.05)
model = Model()

learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch):
        train_batch = X_train[perm[j * batch:(j + 1) * batch]]
        responce_batch = labels_train[perm[j * batch:(j + 1) * batch]]

        # The computational graph is only generated for this block:
        with model.train():
            l = rm.softmax_cross_entropy(model(train_batch), responce_batch)

        # Back propagation
        grad = l.grad()

        # Update
        grad.update(optimizer)

        # Changing type to ndarray is recommended.
        loss += l.as_ndarray()

    train_loss = loss / (N // batch)

    # Validation
    test_loss = rm.softmax_cross_entropy(model(X_test), labels_test).as_ndarray()
    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)
    if i % 10 == 0:
        print("epoch %03d train_loss:%f test_loss:%f"%(i, train_loss, test_loss))
epoch 000 train_loss:0.483988 test_loss:0.337702
epoch 010 train_loss:0.264421 test_loss:0.288495
epoch 020 train_loss:0.253830 test_loss:0.284980
epoch 030 train_loss:0.248383 test_loss:0.280919
epoch 040 train_loss:0.245048 test_loss:0.283287
epoch 050 train_loss:0.242510 test_loss:0.291389
epoch 060 train_loss:0.240432 test_loss:0.290518
epoch 070 train_loss:0.238449 test_loss:0.293869
epoch 080 train_loss:0.237188 test_loss:0.289975
epoch 090 train_loss:0.236099 test_loss:0.285034

Evaluation

  • confusion matrix, precision, recall and f1 score
    There are two types of evaluations for the neural network result, one is the method for the classification, the another is for the regression.
    Precision, recall and f1 score are often used for evaluation of classification ability of the model.
    Additionally, in the case of a classification, incorrect classification has two types, false positive and true negative, so we can see how the model would classify the data incorrectly through the confusion matrix.
  • learning curve Loss is the gap between the neural network output and correct label.
    Learning curve shows us a visualization of the loss decreasing process for each step.
    We can get the information about overfitting problem and underfitting problem from the learning curve.
In [6]:
predictions = np.argmax(model(X_test).as_ndarray(), axis=1)

# Confusion matrix and classification report.
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Learning curve.
plt.plot(learning_curve, linewidth=3, label="train")
plt.plot(test_learning_curve, linewidth=3, label="test")
plt.title("Learning curve")
plt.ylabel("error")
plt.xlabel("epoch")
plt.legend()
plt.grid()
plt.show()
[[650   0   0   0   1   2   2   0   1   1]
 [  0 740   3   3   1   4   0   2   5   0]
 [  6   7 664   6   7   4  10   7  14   5]
 [  9   3  13 658   0  21   3   7  15  10]
 [  2   1   4   0 643   0   4   3   1  22]
 [  7   2   4  31  10 544  17   2  14  11]
 [  8   3   9   1   6  10 683   1   1   0]
 [  3   3   8   2  10   1   0 680   0  25]
 [ 11  18   2  18   9  14   7   3 571  11]
 [  6   1   0  11  27   4   1  17   3 606]]
             precision    recall  f1-score   support

        0.0       0.93      0.99      0.96       657
        1.0       0.95      0.98      0.96       758
        2.0       0.94      0.91      0.92       730
        3.0       0.90      0.89      0.90       739
        4.0       0.90      0.95      0.92       680
        5.0       0.90      0.85      0.87       642
        6.0       0.94      0.95      0.94       722
        7.0       0.94      0.93      0.94       732
        8.0       0.91      0.86      0.89       664
        9.0       0.88      0.90      0.89       676

avg / total       0.92      0.92      0.92      7000

../../../_images/notebooks_basic_algorithm_Detecting_overfitting_problem_from_learning_curve_notebook_14_1.png

Conclusion

In neural network learing, our purpose is to minimize the loss, which means difference between label and neural network output.
Neural network input is a 28x28=784 dimension vector, output is a 10 dimension vector which is activated by softmax function because there are 10 kinds of labels(0 to 9) in this case.
Loss value is 0 when the output of neural network model corresponds to the label completely, that is, the model can predict 100% correctly.
There are two types of loss value, train loss and test loss.
During learning step, the neural network model tries to reduce the train loss.
But our purpose is that the model has good prediction ability for test data, and works well for future data.
We can see the transition of train loss and test loss from above learning curve.
From the above learning curve, we can confirm that test loss had been going up slightly from about 20 epoch.
This separation between train loss and test loss is called overfitting, and more severe overfitting causes bigger separation between them.
We think lowest test loss means best score of the defined neural network model. To prevent these phenomenon and boost the prediction ability, it might be helpful to use dropout or batch normalization or adjust the epoch or batch size.
  • Dropout
    When dropout is used in the neural network model, the neural network model starts to use not all the units but a part of units.
    Some units of the layer are dropped(not in use) depends on dropout ratio.
  • Batch normalization
    Batch normalization means that normalization is applied to the batch-size input data of the layer which is defined batch normalization.
    Generally, normalization preprocess all the data to make the data even scale before the data is given to the neural network model, but batch normalization is preprocess a batch-size input data of the layer when the batch normalization is defined in the layer.

Reference