Dropout

Dropout using fully connected neural network model to mnist.

Dropout is very useful to avoid the overfitting problem.
In this tutorial, we build a fully connected neural network model for clustering digit images. You can learn following points.
  • How to use dropout

Required libraries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
In [1]:
from __future__ import division, print_function
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, classification_report

import renom as rm
from renom.optimizer import Sgd

Load data

Next, we have to load-in the raw, binary MNIST data and shape into training-ready objects. To accomplish this, we’ll use the fetch_mldata module included in the scikit-learn package.

The MNIST dataset consists of 70000 digit images. Before we do anything else, we have to split the data into a training set and a test set. We’ll then do two important pre-processing steps that make for a smoother training process: 1) Re-scale the image data (originaly integer values 0-255) to have a range from 0 to 1. 2) ‘’Binarize’’ the labels- map each digit (0-9) to a vector of 0s and 1s.

In [2]:
# Datapath must point to the directory containing the mldata folder.
data_path = "."
mnist = fetch_mldata('MNIST original', data_home=data_path)

X = mnist.data
y = mnist.target

# Rescale the image data to 0 ~ 1.
X = X.astype(np.float32)
X /= X.max()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
labels_train = LabelBinarizer().fit_transform(y_train).astype(np.float32)
labels_test = LabelBinarizer().fit_transform(y_test).astype(np.float32)

# Training data size.
N = len(X_train)

Define the neural network and dropout

Dropout is useful to avoid the overfitting problem.
Dropout ratio is how ratio is dropped from each layer.
We can get the effect like the ensemble method because the unit which dropped from each layer different each time.
So, we recommend you use dropout in the case that the data is easy to occur the overfitting.
In [3]:
class Mnist(rm.Model):

    def __init__(self):
        super(Mnist, self).__init__()
        self._layer1 = rm.Dense(100)
        self._layer2 = rm.Dense(10)
        self._dropout1 = rm.Dropout(dropout_ratio=0.5)

    def forward(self, x):
        t1 = self._dropout1(self._layer1(x))
        out = self._layer2(t1)
        return out

Instantiation

In [4]:
network = Mnist()

Training loop

Now that the network is built, we can start to do the actual training. Rather than using vanilla “batch” gradient descent, which is computationally expensive, we’ll use mini-batch stochastic gradient descent (SGD). This method trains on a handful of examples per iteration (the “batch-size”), allowing us to make “stochastic” updates to the weights in a short time. The learning curve will appear noisier, but this method tends to converge much faster.

In [5]:
# Hyper parameters
batch = 64
epoch = 50

optimizer = Sgd(lr = 0.01)

learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch):
        train_batch = X_train[perm[j * batch:(j + 1) * batch]]
        responce_batch = labels_train[perm[j * batch:(j + 1) * batch]]

        # The computational graph is only generated for this block:
        with network.train():
            l = rm.softmax_cross_entropy(network(train_batch), responce_batch)

        # Back propagation
        grad = l.grad()

        # Update
        grad.update(optimizer)

        # Changing type to ndarray is recommended.
        loss += l.as_ndarray()

    train_loss = loss / (N // batch)

    # Validation
    test_loss = rm.softmax_cross_entropy(network(X_test), labels_test).as_ndarray()
    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)
    if i % 2 == 0:
        print("epoch %03d train_loss:%f test_loss:%f"%(i, train_loss, test_loss))
epoch 000 train_loss:0.708555 test_loss:0.491806
epoch 002 train_loss:0.400918 test_loss:0.404904
epoch 004 train_loss:0.367667 test_loss:0.379439
epoch 006 train_loss:0.352874 test_loss:0.367108
epoch 008 train_loss:0.340097 test_loss:0.359081
epoch 010 train_loss:0.334252 test_loss:0.351308
epoch 012 train_loss:0.327346 test_loss:0.342508
epoch 014 train_loss:0.322593 test_loss:0.340577
epoch 016 train_loss:0.321203 test_loss:0.344014
epoch 018 train_loss:0.316652 test_loss:0.336845
epoch 020 train_loss:0.311172 test_loss:0.336718
epoch 022 train_loss:0.311073 test_loss:0.336991
epoch 024 train_loss:0.307484 test_loss:0.336896
epoch 026 train_loss:0.305593 test_loss:0.331858
epoch 028 train_loss:0.303051 test_loss:0.326330
epoch 030 train_loss:0.301192 test_loss:0.329767
epoch 032 train_loss:0.299760 test_loss:0.325156
epoch 034 train_loss:0.298049 test_loss:0.330092
epoch 036 train_loss:0.297320 test_loss:0.325677
epoch 038 train_loss:0.295978 test_loss:0.319460
epoch 040 train_loss:0.293667 test_loss:0.326050
epoch 042 train_loss:0.294690 test_loss:0.320485
epoch 044 train_loss:0.292206 test_loss:0.326908
epoch 046 train_loss:0.292942 test_loss:0.329125
epoch 048 train_loss:0.291288 test_loss:0.323323

Inference Mode

ReNom has some function which work differently while learning and inference.
For example, dropout has dropped the units stocastically while learning.
But dropout doesn’t have to drop the units while inference because we can get the result which prevents overfitting by learning weights stocastically.
In this case, ReNom is set to be learning mode, so we have to change to be inference mode according to bellow command.
In [6]:
network.set_models(inference=True)

Model evaluation

After training our model, we have to evaluate it. For each class (digit), we’ll use several scoring metrics: precision, recall, F1 score, and support, to get a full sense of how the model performs on our test data.

In [7]:
predictions = np.argmax(network(X_test).as_ndarray(), axis=1)

# Confusion matrix and classification report.
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Learning curve.
plt.plot(learning_curve, linewidth=3, label="train")
plt.plot(test_learning_curve, linewidth=3, label="test")
plt.title("Learning curve")
plt.ylabel("error")
plt.xlabel("epoch")
plt.legend()
plt.grid()
plt.show()
[[662   0   2   2   0   7   2   2   1   0]
 [  0 749   4   1   2   4   1   1  11   1]
 [  7   7 602  10  10   4   8  12  19   2]
 [  1   3  13 644   0  34   2   5  19   2]
 [  0   3   1   1 667   2   8   1   9  19]
 [  3   3   4  16  13 570  11   2  15   5]
 [  3   3   5   0   8   9 664   0   4   0]
 [  2   6  10   3   9   0   0 690   4  25]
 [  3  15   6  15   2  20   4   5 573  14]
 [  6   4   1  10  27   6   1  23   8 603]]
             precision    recall  f1-score   support

        0.0       0.96      0.98      0.97       678
        1.0       0.94      0.97      0.96       774
        2.0       0.93      0.88      0.91       681
        3.0       0.92      0.89      0.90       723
        4.0       0.90      0.94      0.92       711
        5.0       0.87      0.89      0.88       642
        6.0       0.95      0.95      0.95       696
        7.0       0.93      0.92      0.93       749
        8.0       0.86      0.87      0.87       657
        9.0       0.90      0.88      0.89       689

avg / total       0.92      0.92      0.92      7000

../../../_images/notebooks_basic_algorithm_dropout_notebook_14_1.png