Detecting overfitting problem from learning curve ¶
Detecting overfitting problem and separation between training data and test data.
Normally, it is often the case that we want to predict the future data where we already got the past data and current data.
So, we'll divide the data already obtained into training data and test data 9:1 or 8:2 for example, and deal with the training data as past and current data and the test data as future data which should be predicted.
Next, we'll evaluate the model how well this model could predict the future value with test data after the model learned from the training data.
Sometimes the model works well for training data, but doesn't work well for test data, this phenomenon is called overfitting.
That means the test data could have the records are not seen in the training data, so our purpose is constructing more robust(not affected by sutle difference) model.
This time we'll see the overfitting problem phenomenon from the learning curve.
Required Libraries ¶
matplotlib 2.0.2Used for making the learning curve graph
numpy 1.12.1Used for efficiently use of matrix data
scikit-learn(sklearn) 0.18.2Used for fetching the data, simple preprocessing, classification report
# This module is needed to calculate division with python2.x from __future__ import division, print_function import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import fetch_mldata from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelBinarizer from sklearn.metrics import confusion_matrix, classification_report import renom as rm from renom.cuda import cuda from renom.optimizer import Sgd cuda.set_cuda_active(False)
- MNIST dataset
The MNIST dataset contains 70000 digit images from 0 to 9.
Each digit image file is composed of 28pixels x 28pixels square and each pixel represents gray scale value(0 is black, 255 is white).
We'll make the model which can classify the each digit data to its class from 0 to 9 based on each image's pixel values.
Prerequisite for learning ¶
- Loading the data
Loading the data ¶
MNIST dataset will be downloaded in data_home directory.
mnist = fetch_mldata('MNIST original', data_home='.') X = mnist.data y = mnist.target
NormalizationWhen each explanatory variable have different scale of units, it is hard to learn than its same scale case.
Split the data into training data and test dataSplit the data into learning data and predict data.There is a case that predict data is not seen in the learning data in practice.
One-hot vectorization for the label dataWe have to convert the categorical data to onehot-vector.
X = X.astype(np.float32) X /= X.max() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) labels_train = LabelBinarizer().fit_transform(y_train).astype(np.float32) labels_test = LabelBinarizer().fit_transform(y_test).astype(np.float32)
Neural Network Model ¶
- Functional model There are mainly two types of neural network definition, sequential model and functional model. Sequential model is easy to understand and write, functional model is flexible. Sequential Model Functional Model
class Model(rm.Model): def __init__(self): super(Model, self).__init__() self._layer1 = rm.Dense(250) self._layer2 = rm.Dense(500) self._layer3 = rm.Dense(10) def forward(self, x): t1 = self._layer1(x) t2 = self._layer2(t1) out = self._layer3(t2) return out
Minibatch learning with training data ¶
Batch sizeWhen the data is very large, sometimes too large to load all the data at once because of the memory size or weight matrix of neural network causes calculation to process slower.Generally in neural network learning, we used to learn mini-batch divided data.It means we'll devide the data into small groups defined by batch size.
EpochEpoch is the number of learning iterations of all the data.Too many epoch may cause overfitting, too small epoch may cause underfitting and causes the model to need more iterations.Thus we have to search the propery epoch.
Learning rateLearning rate is step size of learning.Loss value doesn't go down if learning rate is too high, and it takes too long time to converge if learning rate is too small.
batch = 256 epoch = 100 N = len(X_train) optimizer = Sgd(lr = 0.05) model = Model() learning_curve =  test_learning_curve =  for i in range(epoch): perm = np.random.permutation(N) loss = 0 for j in range(0, N // batch): train_batch = X_train[perm[j * batch:(j + 1) * batch]] responce_batch = labels_train[perm[j * batch:(j + 1) * batch]] # The computational graph is only generated for this block: with model.train(): l = rm.softmax_cross_entropy(model(train_batch), responce_batch) # Back propagation grad = l.grad() # Update grad.update(optimizer) # Changing type to ndarray is recommended. loss += l.as_ndarray() train_loss = loss / (N // batch) # Validation test_loss = rm.softmax_cross_entropy(model(X_test), labels_test).as_ndarray() test_learning_curve.append(test_loss) learning_curve.append(train_loss) if i % 10 == 0: print("epoch %03d train_loss:%f test_loss:%f"%(i, train_loss, test_loss))
epoch 000 train_loss:0.483988 test_loss:0.337702 epoch 010 train_loss:0.264421 test_loss:0.288495 epoch 020 train_loss:0.253830 test_loss:0.284980 epoch 030 train_loss:0.248383 test_loss:0.280919 epoch 040 train_loss:0.245048 test_loss:0.283287 epoch 050 train_loss:0.242510 test_loss:0.291389 epoch 060 train_loss:0.240432 test_loss:0.290518 epoch 070 train_loss:0.238449 test_loss:0.293869 epoch 080 train_loss:0.237188 test_loss:0.289975 epoch 090 train_loss:0.236099 test_loss:0.285034
confusion matrix, precision, recall and f1 scoreThere are two types of evaluations for the neural network result, one is the method for the classification, the another is for the regression.Precision, recall and f1 score are often used for evaluation of classification ability of the model.Additionally, in the case of a classification, incorrect classification has two types, false positive and true negative, so we can see how the model would classify the data incorrectly through the confusion matrix.
learning curve Loss is the gap between the neural network output and correct label.Learning curve shows us a visualization of the loss decreasing process for each step.We can get the information about overfitting problem and underfitting problem from the learning curve.
predictions = np.argmax(model(X_test).as_ndarray(), axis=1) # Confusion matrix and classification report. print(confusion_matrix(y_test, predictions)) print(classification_report(y_test, predictions)) # Learning curve. plt.plot(learning_curve, linewidth=3, label="train") plt.plot(test_learning_curve, linewidth=3, label="test") plt.title("Learning curve") plt.ylabel("error") plt.xlabel("epoch") plt.legend() plt.grid() plt.show()
[[650 0 0 0 1 2 2 0 1 1] [ 0 740 3 3 1 4 0 2 5 0] [ 6 7 664 6 7 4 10 7 14 5] [ 9 3 13 658 0 21 3 7 15 10] [ 2 1 4 0 643 0 4 3 1 22] [ 7 2 4 31 10 544 17 2 14 11] [ 8 3 9 1 6 10 683 1 1 0] [ 3 3 8 2 10 1 0 680 0 25] [ 11 18 2 18 9 14 7 3 571 11] [ 6 1 0 11 27 4 1 17 3 606]] precision recall f1-score support 0.0 0.93 0.99 0.96 657 1.0 0.95 0.98 0.96 758 2.0 0.94 0.91 0.92 730 3.0 0.90 0.89 0.90 739 4.0 0.90 0.95 0.92 680 5.0 0.90 0.85 0.87 642 6.0 0.94 0.95 0.94 722 7.0 0.94 0.93 0.94 732 8.0 0.91 0.86 0.89 664 9.0 0.88 0.90 0.89 676 avg / total 0.92 0.92 0.92 7000
In neural network learing, our purpose is to minimize the loss, which means difference between label and neural network output.
Neural network input is a 28x28=784 dimension vector, output is a 10 dimension vector which is activated by softmax function because there are 10 kinds of labels(0 to 9) in this case.
Loss value is 0 when the output of neural network model corresponds to the label completely, that is, the model can predict 100% correctly.
There are two types of loss value, train loss and test loss.
During learning step, the neural network model tries to reduce the train loss.
But our purpose is that the model has good prediction ability for test data, and works well for future data.
We can see the transition of train loss and test loss from above learning curve.
From the above learning curve, we can confirm that test loss had been going up slightly from about 20 epoch.
This separation between train loss and test loss is called overfitting, and more severe overfitting causes bigger separation between them.
We think lowest test loss means best score of the defined neural network model. To prevent these phenomenon and boost the prediction ability, it might be helpful to use dropout or batch normalization or adjust the epoch or batch size.
DropoutWhen dropout is used in the neural network model, the neural network model starts to use not all the units but a part of units.Some units of the layer are dropped(not in use) depends on dropout ratio.
Batch normalizationBatch normalization means that normalization is applied to the batch-size input data of the layer which is defined batch normalization.Generally, normalization preprocess all the data to make the data even scale before the data is given to the neural network model, but batch normalization is preprocess a batch-size input data of the layer when the batch normalization is defined in the layer.