Weight Initialization

Weight initialization and randomness in deep learning

Deep Learning is a random method, we can get the different result on every training because of randomness of initial weight state.
There are some kinds of randomized method for weight initialization, we'll introduce initialization method which is prepared in ReNom. We'll perform four random method, Uniform initialization, Gaussian initialization, Glorot Uniform initialization, Glorot Normal initialization in time time.
  • Uniform initialization is the method which generate the initial value from Uniform distribution, max is 1, min is -1.
  • Gaussian initialization is the method which generate the initial value from Gaussian distribution, mean is 0, std is 1.
  • Glorot Uniform initialization is the method which generate the initial value from Uniform distribution.Each layer has different max and min value of Uniform distribution depends on the number of input units and output units.
  • Glorot Normal initialization is the method which generate the initial value from Gaussian distribution.Each layer has different std value of Gaussian distribution depends on the number of input units and output units.

Generally, it is said that we should initialize the weights according to the number of input units and output units, so we'll show the how-to-initialize the weights problem and typical initialization methods in the bellow figure.

As we can see in the above figure, we have to consider the number of input units and output units when we initialize the weight of each layer.
And Glorot Uniform method and Glorot Normal initialization are the method which considers about that.
We'll see how we can use each initialization method in ReNom, and comparison between them.

Required Libraries

  • numpy 1.21.1
  • pandas 0.20.3
  • matplotlib 2.0.2
  • scikit-learn 0.18.1
  • ReNom 2.5.2
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import renom as rm
from renom.utility.initializer import Uniform, Gaussian, GlorotUniform, GlorotNormal
from renom.cuda.cuda import set_cuda_active
set_cuda_active(False)

Make label data

In this section, we'll construct simple fully-connected neural networks to building energy efficiency analysis. Here we predict heating-load of from each building features, such as wall-area or glazing-area. Heating/Cooling load is defined by how much energy our air conditioners need to maintain indoor temperature (unit: kWh). The more difficult it is to keep indoor temperature, the bigger Heating/CoolingLoad become. To give an example, the size of room or building material pervious to heat (it means the building easily exchange heat with outdoor) can lead to bigger load. Please download the free data from UCI website in advance ( https://archive.ics.uci.edu/ml/datasets/Energy+efficiency ).

In [2]:
columns = ["RelativeCompactness", "SurfaceArea", "WallArea", "RoofArea", "OverallArea",
           "Orientation", "GlazingArea", "GlazingAreaDistribution", "HeatingLoad", "CoolingLoad"]
df = pd.read_excel("./ENB2012_data.xlsx", names=columns)
df.head()

df_s = df.copy()

for col in df.columns:
    v_std = df[col].std()
    v_mean = df[col].mean()
    df_s[col] = (df_s[col] - v_mean) / v_std

df_s.head()
Out[2]:
RelativeCompactness SurfaceArea WallArea RoofArea OverallArea Orientation GlazingArea GlazingAreaDistribution HeatingLoad CoolingLoad
0 2.040447 -1.784712 -0.561586 -1.469119 0.999349 -1.340767 -1.7593 -1.813393 -0.669679 -0.342443
1 2.040447 -1.784712 -0.561586 -1.469119 0.999349 -0.446922 -1.7593 -1.813393 -0.669679 -0.342443
2 2.040447 -1.784712 -0.561586 -1.469119 0.999349 0.446922 -1.7593 -1.813393 -0.669679 -0.342443
3 2.040447 -1.784712 -0.561586 -1.469119 0.999349 1.340767 -1.7593 -1.813393 -0.669679 -0.342443
4 1.284142 -1.228438 0.000000 -1.197897 0.999349 -1.340767 -1.7593 -1.813393 -0.145408 0.388113
In [3]:
X, y = np.array(df_s.iloc[:, :8]), np.array(df_s.iloc[:, 8:])
X_train, X_test, labels_train, labels_test = train_test_split(X, y, test_size=0.1, random_state=42)

Training Loop

In the training loop, we recommend to watch the loss value to confirm whether the overfitting is occured or not, and whether the learning is progressing properly.
We can discriminate between good patterns and bad patterns because the loss means gap between correct answer and prediction.
This time we'll use loss value as a measure, how good the initialization method works.
In [4]:
def train_loop(epoch, N, batch_size, sequential, X_train, labels_train, X_test, labels_test, optimizer):
    learning_curve = []
    test_learning_curve = []

    for i in range(epoch):
        perm = np.random.permutation(N)
        loss = 0
        for j in range(0, N//batch_size):
            train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
            response_batch = labels_train[perm[j*batch_size : (j+1)*batch_size]]
            with sequential.train():
                l = rm.mse(sequential(train_batch), response_batch)
            grad = l.grad()
            grad.update(optimizer)
            loss += l.as_ndarray()
        train_loss = loss / (N//batch_size)
        test_loss = rm.mse(sequential(X_test), labels_test).as_ndarray()
        test_learning_curve.append(test_loss)
        learning_curve.append(train_loss)
    return learning_curve, test_learning_curve

Network definition and initialize parameters

In [5]:
output_size = 1
epoch = 500
batch_size = 128
N = len(X_train)
optimizer = rm.Adam()

network1 = rm.Sequential([
    rm.Dense(8, initializer=Uniform()),
    rm.Relu(),
    rm.Dense(8, initializer=Uniform()),
    rm.Relu(),
    rm.Dense(6, initializer=Uniform()),
    rm.Relu(),
    rm.Dense(1, initializer=Uniform())
])

network2 = rm.Sequential([
    rm.Dense(8, initializer=Gaussian()),
    rm.Relu(),
    rm.Dense(8, initializer=Gaussian()),
    rm.Relu(),
    rm.Dense(6, initializer=Gaussian()),
    rm.Relu(),
    rm.Dense(1, initializer=Gaussian())
])

network3 = rm.Sequential([
    rm.Dense(8, initializer=GlorotUniform()),
    rm.Relu(),
    rm.Dense(8, initializer=GlorotUniform()),
    rm.Relu(),
    rm.Dense(6, initializer=GlorotUniform()),
    rm.Relu(),
    rm.Dense(1, initializer=GlorotUniform())
])

network4 = rm.Sequential([
    rm.Dense(8, initializer=GlorotNormal()),
    rm.Relu(),
    rm.Dense(8, initializer=GlorotNormal()),
    rm.Relu(),
    rm.Dense(6, initializer=GlorotNormal()),
    rm.Relu(),
    rm.Dense(1, initializer=GlorotNormal())
])

learning_curve, test_learning_curve_Gaussian = train_loop(epoch=epoch, N=N, batch_size=batch_size, sequential=network1, X_train=X_train, labels_train=labels_train, X_test=X_test, labels_test=labels_test, optimizer=optimizer)
learning_curve, test_learning_curve_Uniform = train_loop(epoch=epoch, N=N, batch_size=batch_size, sequential=network2, X_train=X_train, labels_train=labels_train, X_test=X_test, labels_test=labels_test, optimizer=optimizer)
learning_curve, test_learning_curve_GlorotUniform = train_loop(epoch=epoch, N=N, batch_size=batch_size, sequential=network3, X_train=X_train, labels_train=labels_train, X_test=X_test, labels_test=labels_test, optimizer=optimizer)
learning_curve, test_learning_curve_GlorotNormal = train_loop(epoch=epoch, N=N, batch_size=batch_size, sequential=network4, X_train=X_train, labels_train=labels_train, X_test=X_test, labels_test=labels_test, optimizer=optimizer)


plt.clf()
plt.plot(test_learning_curve_Gaussian, linewidth=1, label="Gaussian")
plt.plot(test_learning_curve_Uniform, linewidth=1, label="Uniform")
plt.plot(test_learning_curve_GlorotUniform, linewidth=1, label="GlorotUniform")
plt.plot(test_learning_curve_GlorotNormal, linewidth=1, label="GlorotNormal")
plt.title("learning_curve")
plt.ylabel("error")
plt.xlabel("epoch")
plt.ylim(0,0.5)
plt.legend()
plt.grid()
plt.show()
../../../_images/notebooks_basic_algorithm_weight_initialization_notebook_9_0.png
We can see the result from above result, Glorot Normal initialization method was the best case.
Uniform initialization and Glorot Uniform Initialization were the worst cases.
Becase of these randomness, we get different result on every learning when we use the deep learning.
And it is difficult to decide which initialization is the best, so we have to think about which method is the best method for our data.