正規化と標準化

学習が遅くなるケースと正規化と標準化を使った防ぎ方

基本的に確率的勾配降下法は損失関数を最小化する方向に重みを更新していく手法ですが、
正規化や標準化を行わない場合には最小化する方向から大きく外れた方向に重みが更新されてしまい、学習に時間がかかることがあります。
しかし、正規化や標準化を用いることで図にあるようなerror surfaceの形を円に近づけることができて、勾配法によって正しく損失関数が最小化される方向に重みが更新できることが期待されます。
実際には扱うデータの特徴量のスケールが違って、数字が大きく違うということが起こりえると思います。例えばキログラムやミリグラムが混在することも考えられます。そのような場合には大きさの違いから上で述べたようなことが発生してしまい、学習が遅くなるといったことが起こりえます。
学習が遅くなるような場合にはどのようなことが考えられるのかについてと、正規化と標準化によってどのような効果が得られるのかを以下の図に示します。

今回は標準化を用いてどれくらい早く損失が下がっていくのかを見ていきたいと思います。

Required Libaries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
  • pandas 0.20.3
In [1]:
from __future__ import division, print_function
import numpy as np
import pandas as pd

import renom as rm
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active
from sklearn.preprocessing import LabelBinarizer, label_binarize
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# If you would like to use GPU computing, set this to True. Otherwise, False.
set_cuda_active(False)

Make Data

The reference of the dataset is below.

MAGIC Gamma Telescope Data Set, R. K. Bock Major Atmospheric Gamma Imaging Cherenkov Telescope project (MAGIC),
P. Savicky Institute of Computer Science, AS of CR Czech Republic.
In [2]:
df = pd.read_csv("magic04.data",header=None)
X = df.drop(10,axis=1).values.astype(float)
y = df.iloc[:,10].replace("g",1).replace("h",0).values.astype(float).reshape(-1,1)
print("X:{} y:{}".format(X.shape, y.shape))
X:(19020, 10) y:(19020, 1)
In [3]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
X_train:(15216, 10) y_train:(15216, 1) X_test:(3804, 10) y_test:(3804, 1)

標準化をしないケース

In [4]:
sequential = rm.Sequential([
    rm.Dense(64),
    rm.Relu(),
    rm.Dense(32),
    rm.Relu(),
    rm.Dense(1)
])
In [5]:
batch_size = 128
epoch = 15
N = len(X_train)
optimizer = Sgd(lr=0.01)
learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]

        with sequential.train():
            z = sequential(train_batch)
            l = rm.sigmoid_cross_entropy(z, response_batch)
        grad = l.grad()
        grad.update(optimizer)
        loss += l.as_ndarray()
    train_loss = loss / (N // batch_size)
    z_test = sequential(X_test)
    test_loss = rm.sigmoid_cross_entropy(z_test, y_test).as_ndarray()
    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)
    print("epoch:{:03d}, train_loss:{:.4f}, test_loss:{:.4f}".format(i, float(train_loss), float(test_loss)))
epoch:000, train_loss:0.7173, test_loss:0.5320
/home/d_onodera/renom2.0.0/ReNom/renom/layers/loss/sigmoid_cross_entropy.py:18: RuntimeWarning: overflow encountered in exp
  z = 1. / (1. + np.exp(to_value(-lhs)))
epoch:001, train_loss:0.5097, test_loss:0.4971
epoch:002, train_loss:0.4922, test_loss:0.4840
epoch:003, train_loss:0.4834, test_loss:0.4841
epoch:004, train_loss:0.4764, test_loss:0.4707
epoch:005, train_loss:0.4685, test_loss:0.4698
epoch:006, train_loss:0.4618, test_loss:0.4714
epoch:007, train_loss:0.4563, test_loss:0.4501
epoch:008, train_loss:0.4519, test_loss:0.4552
epoch:009, train_loss:0.4484, test_loss:0.4448
epoch:010, train_loss:0.4429, test_loss:0.4444
epoch:011, train_loss:0.4402, test_loss:0.4585
epoch:012, train_loss:0.4400, test_loss:0.4476
epoch:013, train_loss:0.4353, test_loss:0.4285
epoch:014, train_loss:0.4332, test_loss:0.4345

標準化をするケース

以下のように式変形を行うことで平均が0の分散が1となるように入力データのシフティングとスケーリングを行います。

\begin{equation} X\_new = \frac{X\_old - X\_mean}{X\_std} \end{equation}
In [6]:
X_train_mean = np.mean(X_train,axis=0)
X_train_std = np.std(X_train, axis=0)
X_train = (X_train - X_train_mean) / X_train_std

X_test = (X_test - X_train_mean) / X_train_std
print(X_train)
[[-0.79930181 -0.50039069 -0.65253323 ...,  0.26139757  1.95726617
  -1.68962135]
 [ 2.29112205  2.73758014  1.06699689 ...,  2.34536206  1.93503937
   1.12829706]
 [ 0.53831044  0.57492273  1.16703611 ...,  0.7128446  -0.32139495
  -0.63908639]
 ...,
 [-0.58745154 -0.03303521 -0.12096889 ...,  0.41261994 -0.29773626
   0.69609949]
 [-0.81462223 -0.63824131 -1.31190193 ...,  0.19084111 -0.04924745
   0.16164643]
 [ 1.53152094 -0.06017667 -0.46178048 ...,  0.44287204  1.9893673
   0.6404793 ]]
In [7]:
sequential = rm.Sequential([
    rm.Dense(128),
    rm.Relu(),
    rm.Dense(64),
    rm.Relu(),
    rm.Dense(1)
])
In [8]:
batch_size = 128
epoch = 15
N = len(X_train)
optimizer = Sgd(lr=0.01)
learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]

        with sequential.train():
            z = sequential(train_batch)
            l = rm.sigmoid_cross_entropy(z, response_batch)
        grad = l.grad()
        grad.update(optimizer)
        loss += l.as_ndarray()
    train_loss = loss / (N // batch_size)
    z_test = sequential(X_test)
    test_loss = rm.sigmoid_cross_entropy(z_test, y_test).as_ndarray()
    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)
    print("epoch:{:03d}, train_loss:{:.4f}, test_loss:{:.4f}".format(i, float(train_loss), float(test_loss)))
epoch:000, train_loss:0.5887, test_loss:0.5197
epoch:001, train_loss:0.4879, test_loss:0.4550
epoch:002, train_loss:0.4434, test_loss:0.4235
epoch:003, train_loss:0.4203, test_loss:0.4054
epoch:004, train_loss:0.4068, test_loss:0.3931
epoch:005, train_loss:0.3967, test_loss:0.3845
epoch:006, train_loss:0.3883, test_loss:0.3784
epoch:007, train_loss:0.3820, test_loss:0.3715
epoch:008, train_loss:0.3773, test_loss:0.3670
epoch:009, train_loss:0.3725, test_loss:0.3629
epoch:010, train_loss:0.3690, test_loss:0.3598
epoch:011, train_loss:0.3660, test_loss:0.3562
epoch:012, train_loss:0.3620, test_loss:0.3537
epoch:013, train_loss:0.3584, test_loss:0.3506
epoch:014, train_loss:0.3564, test_loss:0.3474
2つの例を比べて、標準化を使ったケースのほうが損失が下がるスピードが若干速かったかと思います。
冒頭の図に示した通り、標準化によって元々楕円形のような損失関数のerror surfaceをしていたパラメータが円のような形に変形され、学習がスムーズに行くようになったと考えています。