Normalization and Standardization

Learning come to slow case, and prevent to use normalization and standardization.

Basically, Stochastic Gradient Descent method is used for minimizing the loss function, but when we don’t use normalization or standrdization, it is possible that gradient direction is not appropriate.
In such a case, weight does not update towards the direction that we would like to.
But, nomarlization and standardization is one of the method that prevent the inappropriate update, and update the correct in terms of minimizing the loss function direction.
In practical, sometimes each feature have different scale, for example, one is kilogram[kg], another is miligram[mg].
In this time, we see the case that learning can be slow and effect of the normalization and standardization as bellow.

So, in this time, we use the standardization and see the how fast the loss decreases when we use the standardization.

In [1]:
#!/usr/bin/env python
# encoding:utf-8

from __future__ import division, print_function
import numpy as np
import pandas as pd

import renom as rm
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active
from sklearn.preprocessing import LabelBinarizer, label_binarize
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# If you would like to use GPU computing, set this to True. Otherwise, False.
set_cuda_active(False)
In [2]:
df = pd.read_csv("magic04.data",header=None)
X = df.drop(10,axis=1).values.astype(float)
y = df.iloc[:,10].replace("g",1).replace("h",0).values.astype(float).reshape(-1,1)
print("X:{} y:{}".format(X.shape, y.shape))
X:(19020, 10) y:(19020, 1)
In [3]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
X_train:(15216, 10) y_train:(15216, 1) X_test:(3804, 10) y_test:(3804, 1)

No standardization to the input data

In [4]:
sequential = rm.Sequential([
    rm.Dense(64),
    rm.Relu(),
    rm.Dense(32),
    rm.Relu(),
    rm.Dense(1)
])
In [5]:
batch_size = 128
epoch = 15
N = len(X_train)
optimizer = Sgd(lr=0.01)
learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]

        with sequential.train():
            z = sequential(train_batch)
            l = rm.sigmoid_cross_entropy(z, response_batch)
        grad = l.grad()
        grad.update(optimizer)
        loss += l.as_ndarray()
    train_loss = loss / (N // batch_size)
    z_test = sequential(X_test)
    test_loss = rm.sigmoid_cross_entropy(z_test, y_test).as_ndarray()
    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)
    print("epoch:{:03d}, train_loss:{:.4f}, test_loss:{:.4f}".format(i, float(train_loss), float(test_loss)))
epoch:000, train_loss:0.7173, test_loss:0.5320
/home/d_onodera/renom2.0.0/ReNom/renom/layers/loss/sigmoid_cross_entropy.py:18: RuntimeWarning: overflow encountered in exp
  z = 1. / (1. + np.exp(to_value(-lhs)))
epoch:001, train_loss:0.5097, test_loss:0.4971
epoch:002, train_loss:0.4922, test_loss:0.4840
epoch:003, train_loss:0.4834, test_loss:0.4841
epoch:004, train_loss:0.4764, test_loss:0.4707
epoch:005, train_loss:0.4685, test_loss:0.4698
epoch:006, train_loss:0.4618, test_loss:0.4714
epoch:007, train_loss:0.4563, test_loss:0.4501
epoch:008, train_loss:0.4519, test_loss:0.4552
epoch:009, train_loss:0.4484, test_loss:0.4448
epoch:010, train_loss:0.4429, test_loss:0.4444
epoch:011, train_loss:0.4402, test_loss:0.4585
epoch:012, train_loss:0.4400, test_loss:0.4476
epoch:013, train_loss:0.4353, test_loss:0.4285
epoch:014, train_loss:0.4332, test_loss:0.4345

Standardization to the input data

We change the input data to mean 0 shifting and unit variance scaling as bellow.

\begin{equation} X\_new = \frac{X\_old - X\_mean}{X\_std} \end{equation}
In [6]:
X_train_mean = np.mean(X_train,axis=0)
X_train_std = np.std(X_train, axis=0)
X_train = (X_train - X_train_mean) / X_train_std

X_test = (X_test - X_train_mean) / X_train_std
print(X_train)
[[-0.79930181 -0.50039069 -0.65253323 ...,  0.26139757  1.95726617
  -1.68962135]
 [ 2.29112205  2.73758014  1.06699689 ...,  2.34536206  1.93503937
   1.12829706]
 [ 0.53831044  0.57492273  1.16703611 ...,  0.7128446  -0.32139495
  -0.63908639]
 ...,
 [-0.58745154 -0.03303521 -0.12096889 ...,  0.41261994 -0.29773626
   0.69609949]
 [-0.81462223 -0.63824131 -1.31190193 ...,  0.19084111 -0.04924745
   0.16164643]
 [ 1.53152094 -0.06017667 -0.46178048 ...,  0.44287204  1.9893673
   0.6404793 ]]
In [7]:
sequential = rm.Sequential([
    rm.Dense(128),
    rm.Relu(),
    rm.Dense(64),
    rm.Relu(),
    rm.Dense(1)
])
In [8]:
batch_size = 128
epoch = 15
N = len(X_train)
optimizer = Sgd(lr=0.01)
learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]

        with sequential.train():
            z = sequential(train_batch)
            l = rm.sigmoid_cross_entropy(z, response_batch)
        grad = l.grad()
        grad.update(optimizer)
        loss += l.as_ndarray()
    train_loss = loss / (N // batch_size)
    z_test = sequential(X_test)
    test_loss = rm.sigmoid_cross_entropy(z_test, y_test).as_ndarray()
    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)
    print("epoch:{:03d}, train_loss:{:.4f}, test_loss:{:.4f}".format(i, float(train_loss), float(test_loss)))
epoch:000, train_loss:0.5887, test_loss:0.5197
epoch:001, train_loss:0.4879, test_loss:0.4550
epoch:002, train_loss:0.4434, test_loss:0.4235
epoch:003, train_loss:0.4203, test_loss:0.4054
epoch:004, train_loss:0.4068, test_loss:0.3931
epoch:005, train_loss:0.3967, test_loss:0.3845
epoch:006, train_loss:0.3883, test_loss:0.3784
epoch:007, train_loss:0.3820, test_loss:0.3715
epoch:008, train_loss:0.3773, test_loss:0.3670
epoch:009, train_loss:0.3725, test_loss:0.3629
epoch:010, train_loss:0.3690, test_loss:0.3598
epoch:011, train_loss:0.3660, test_loss:0.3562
epoch:012, train_loss:0.3620, test_loss:0.3537
epoch:013, train_loss:0.3584, test_loss:0.3506
epoch:014, train_loss:0.3564, test_loss:0.3474
Compare to the two case, when we use the standardization, the loss value decreases little faster.
It might standardization change the shape of loss function from elongated to like circle, so the we can avoid the learning go slow.