Adam Optimization

Introduction to Adam Optimization

Adam is one of the most sofisticated optimizing algorithms for neural network paramteres. There are many optimizers, Stochastic Gradient Descent, AdaGrad and RMSProp and so on, these are all methods to seek for each parameters, but these have the problem to optimize in terms of the settings of learning rate. Learning rate is very important when the frequency of explanation variable varies by each time. For example, we want to update the important and infrequent feature larger and, the not so important and frequent feature smaller. So, Adagrad revealed the weakness of Sgd like above, and proposed a new kind of update equation, that was to decay the learning rate by the pow of the gradient. But, this method had the problem too, the speed of learning was slow down rapidly by pow of the gradients, this method is successful for updating of the parameter suffered by sparseness, but insufficient for appropriate settings of learning rate. So, next method is RMSProp that updates the parameter based on the recent gradient and its expectation value. It made it possible to update largely when the gradient was large than averaged gradient, and add small change towards to convergence when the gradient was smaller than averaged gradient. Actually, this method was thought that works well, some part of this mension is right, but Adam pointed the weakness of the AdaGrad and RMSProp, these method didn’t consider the unit of the gradients, Adam is method that consider about the magnitude of update width for each parameter and uncertainty of direct of the gradients. Adam updates the parameter based on follows.

  • Sparseness of the parameter(the frequency of update is different by the parameter)
  • Momentum(Previous gradient is related to current gradient), and importance of recent gradients
  • Uncertanity of gradients update direction based on the annealing method(we need to change the parameter largely when it has the frequent changes of gradient direction, it means uncertainly), and when the change of the gradient direction is continuously same, we think it means the less uncerainly case.

Many blogs and papers said, Adam is the best resulted optimizing method, we think its reason is composed of above three point, therefore Adam might be able to catch the variation of the input parameter and updte pattern of many neural network architectures.

Required Libraries

  • numpy 1.21.1
  • scikit-learn 0.18.1
In [1]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import renom as rm
from renom.cuda.cuda import set_cuda_active
set_cuda_active(False)

Make label data

The reference of the dataset is below.

ISOLET Data Set, Ron Cole and Mark Fanty. Department of Computer Science and Engineering,
Oregon Graduate Institute, Beaverton, OR 97006.
In [2]:
filename = "./isolet1+2+3+4.data"
labels = []
X = []
y = []

def make_label_idx(filename):
    labels = []
    for line in open(filename, "r"):
        line = line.rstrip()
        label = line.split(",")[-1]
        labels.append(label)
    labels = list(set(labels))
    return list(set(labels))

labels = make_label_idx(filename)
labels = sorted(labels, key=lambda d:int(d.replace(".","").replace(" ","")))

Load data from the file for training and prediction

In [3]:
for line in open(filename,"r"):
    line = line.rstrip()
    label = labels.index(line.split(",")[-1])
    features = list(map(float,line.split(",")[:-1]))
    X.append(features)
    y.append(label)

X = np.array(X)
y = np.array(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("X_train:{}, y_train:{}, X_test:{}, y_test:{}".format(X_train.shape, y_train.shape,
                                                            X_test.shape, y_test.shape))
lb = LabelBinarizer().fit(y)
labels_train = lb.transform(y_train)
labels_test = lb.transform(y_test)
print("labels_train:{}, labels_test:{}".format(labels_train.shape, labels_test.shape))
X_train:(4990, 617), y_train:(4990,), X_test:(1248, 617), y_test:(1248,)
labels_train:(4990, 26), labels_test:(1248, 26)

Network definition and initialize parameters

In [4]:
output_size = len(labels)
sequential = rm.Sequential([
    rm.Dense(100),
    rm.Relu(),
    rm.Dense(50),
    rm.Relu(),
    rm.Dense(output_size)
])

Learning loop

In [5]:
epoch = 20
batch_size = 128
N = len(X_train)
optimizer = rm.Adam(lr=0.01)
for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N//batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = labels_train[perm[j*batch_size : (j+1)*batch_size]]
        with sequential.train():
            l = rm.softmax_cross_entropy(sequential(train_batch), response_batch)
        grad = l.grad()
        grad.update(optimizer)
        loss += l.as_ndarray()
    train_loss = loss / (N//batch_size)
    test_loss = rm.softmax_cross_entropy(sequential(X_test), labels_test).as_ndarray()
    print("epoch:{:03d}, train_loss:{:.4f}, test_loss:{:.4f}".format(i, float(train_loss), float(test_loss)))
epoch:000, train_loss:1.3367, test_loss:0.4439
epoch:001, train_loss:0.3355, test_loss:0.2605
epoch:002, train_loss:0.2238, test_loss:0.2537
epoch:003, train_loss:0.1772, test_loss:0.3013
epoch:004, train_loss:0.1396, test_loss:0.1789
epoch:005, train_loss:0.1350, test_loss:0.2375
epoch:006, train_loss:0.0864, test_loss:0.2309
epoch:007, train_loss:0.0829, test_loss:0.1849
epoch:008, train_loss:0.0616, test_loss:0.2061
epoch:009, train_loss:0.0471, test_loss:0.2114
epoch:010, train_loss:0.0792, test_loss:0.1997
epoch:011, train_loss:0.0520, test_loss:0.2532
epoch:012, train_loss:0.0572, test_loss:0.3516
epoch:013, train_loss:0.0803, test_loss:0.3358
epoch:014, train_loss:0.0720, test_loss:0.2725
epoch:015, train_loss:0.0859, test_loss:0.2785
epoch:016, train_loss:0.0384, test_loss:0.2300
epoch:017, train_loss:0.0311, test_loss:0.2861
epoch:018, train_loss:0.0172, test_loss:0.2492
epoch:019, train_loss:0.0461, test_loss:0.3394