Recurrent Neural Network (RNN) and LSTM

This tutorial is an introduction of RNN and LSTM which have wide applications to time-series data analysis.

Recurrent Neural Network (RNN)

RNN is a neural network which consists of a recurrent structure as shown below.

This recurrent structure enables a network to hold time series information and is known to work well in time-series data analysis like speech recognition.

The input of the RNN is time-series data \{x_1,\ldots,x_T\} and the output ( y_{T+1} ) is the prediction of the next observation value ( x_{T+1} ).

Propagation is done as follows

\begin{split}\begin{align*} z_t &= f^{\rm{(hidden})}(W_{xh}x_t + \color{red}{W_{hh}z_{t-1}} + b_{xh} + b_{hh})\\ y_{t+1} &= f^{(\rm{out})}(W_{hy}z_t + b_{hy}) \end{align*}\end{split}

Unlike the feed-forward neural network, RNN has \color{red}{W_{hh}z_{t-1}} which has time-series information.

The parameters above are defined as follows

W_{xh} \in \mathbb{R}^{|\rm{hidden}|\times |\rm{input}|}, b_{xh} \in \mathbb{R}^{|\rm{hidden}|} : Weight and bias from the input layer to the hidden layer

W_{hh} \in \mathbb{R}^{|\rm{hidden}|\times |\rm{hidden}|}, b_{hh} \in \mathbb{R}^{|\rm{hidden}|} : Weight and bias from the hidden layer to the hidden layer of the next time

W_{hy} \in \mathbb{R}^{|\rm{output}|\times |\rm{hidden}|}, b_{hy} \in \mathbb{R}^{|\rm{output}|} : Weight and bias from the hidden layer to the output layer

f^{(\rm{hidden})}, f^{(\rm{out})} : activation functions of hidden and output layers

However, RNN has a problem called the vanishing gradient problem. Gradients vanish / explode during backpropagation and learning doesn't go well because time-series data is often long and RNN become deep in proportion to the length of time-series data.

LSTM has been proposed to tackle this week point of RNN.

Long Short Term Memory (LSTM)

LSTM is a model that can learn long time-series data to some extent.

The difference from RNN is that LSTM replaces units by memory units as shown in the figure below.

Memory units tune the unit value c_{t,j} and the output z_{t,j} , j\in\{1,2,\ldots,|hidden|\} with each gate's product over time. Concretely, we can tune

  • how much input we can accept from the previous layer
  • how much input we can accept from the previous time output value
  • how much output we can allow going outside

by multiplying each gate's value ( \in [0,1] ). That's why LSTM is more flexible and can avoid the vanishing gradient problem compared to RNN.

Propagation is done as follows

\begin{split}\begin{align*} c_{t,j} &= \sigma((W_{in}x_t)_j+(R_{in}z_{t-1})_j)f((W_cx_t)_j+(R_cz_{t-1})_j)+\sigma((W_{for}x_t)_j+(R_{for}z_{t-1})_j)c_{t-1,j}\\ z_{t,j} &= \sigma((W_{out}x_t)_j+(R_{out}z_{t-1})_j)f(c_{t,j}) \end{align*}\end{split}

The parameters above are defined as follows

W_c \in \mathbb{R}^{|hidden|\times |input|} : Weight from the previous layer to the memory unit

W_{in} \in \mathbb{R}^{|hidden|\times |input|} : Weight from the previous layer to the input gate

W_{for} \in \mathbb{R}^{|hidden|\times |input|} : Weight from the previous layer to the forget gate

W_{out} \in \mathbb{R}^{|hidden|\times |input|} : Weight from the previous layer to the output gate

R_c \in \mathbb{R}^{|hidden|\times |hidden|} : Weight from the hidden layer in previous time to the memory unit

R_{in} \in \mathbb{R}^{|hidden|\times |hidden|} : Weight from the hidden layer in previous time to the input gate

R_{for} \in \mathbb{R}^{|hidden|\times |hidden|} : Weight from the hidden layer in previous time to the forget gate

R_{out} \in \mathbb{R}^{|hidden|\times |hidden|} : Weight from the hidden layer in previous time to the output gate

Experiment(learning a Sine curve by LSTM)

In this section, we will show an experiment using LSTM to learn a Sine curve.

Problem setting

We will use LSTM to predict the next observation value from previous data points as shown in the figure below.

Required Libraries

  • numpy 1.13.1
  • matplotlib 2.0.2
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy

import renom as rm
from renom.optimizer import Adam
from renom.cuda import set_cuda_active

# set True when using GPU
set_cuda_active(False)

Data preparation

We will define a function which creates subsequences of data, from a given time-series data.

In [2]:
# create [look_back] length array from [time-series] data.
# e.g.) ts:{1,2,3,4,5}, lb=3 => {1,2,3},{2,3,4},{3,4,5}
def create_dataset(ts, look_back=1):
    sub_seq, nxt = [], []
    for i in range(len(ts)-look_back):
        sub_seq.append(ts[i:i+look_back])
        nxt.append([ts[i+look_back]])
    return sub_seq, nxt

We will divide a Sine curve in the interval [-10, 10] into 50 sections and create subsequences with the length of 5, respectively.

In [3]:
# making a Sine curve data
x = np.linspace(-10,10,50)
y = np.sin(x)
look_back = 5

sub_seq, nxt = create_dataset(y, look_back=look_back)
In [4]:
# split data into train and test set
def split_data(X, y, train_ratio=.5):
    train_size = int(len(y)*train_ratio)
    X_train, y_train  = X[:train_size], y[:train_size]
    X_test, y_test    = X[train_size:], y[train_size:]

    X_train = np.array(X_train)
    X_test = np.array(X_test)
    y_train = np.array(y_train)
    y_test = np.array(y_test)
    return X_train, y_train, X_test, y_test
In [5]:
X_train, y_train, X_test, y_test = split_data(sub_seq, nxt)
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print('train size : {}, test size : {}'.format(train_size, test_size))
train size : 22, test size : 23

Drawing prediction curve

We will use the predicted value as the input to predict the observation value of the next step. The equation below represents how the calculation is done.

\begin{split}\begin{align*} \{x_t,\ldots,x_{t+4}\} &\rightarrow \hat{x}_{t+5} \\ \{x_{t+1},\ldots,\hat{x}_{t+5}\} &\rightarrow \hat{x}_{t+6} \\ \{x_{t+2},\ldots,\hat{x}_{t+6}\} &\rightarrow \hat{x}_{t+7} \\ \end{align*}\end{split}

Lastly, we draw \{\hat{x}_{t+5},\ldots,\hat{x}_{t+T}\}

In [6]:
def draw_pred_curve(e_num):
    pred_curve = []
    arr_now = X_test[0]
    for _ in range(test_size):
        for t in range(look_back):
            pred = model(np.array([arr_now[t]]))
        model.truncate()
        pred_curve.append(pred[0])
        arr_now = np.delete(arr_now, 0)
        arr_now = np.append(arr_now, pred)
    plt.plot(x[:train_size+look_back], y[:train_size+look_back], color='blue')
    plt.plot(x[train_size+look_back:], pred_curve, label='epoch:'+str(e_num)+'th')

Model Definition

In [7]:
# model definition
model = rm.Sequential([
   rm.Lstm(2),
   rm.Dense(1)
])

Parameters setting

In [8]:
# params
batch_size = 5
max_epoch = 2000
period = 200 # early stopping checking and drawing predicted curve period
optimizer = Adam()

Train Loop

In [9]:
i = 0
loss_prev = np.inf

# learning curves
learning_curve = []
test_curve = []

plt.figure(figsize=(15,10))

# train loop
while(i < max_epoch):
    i += 1
    # perm is for getting batch randomly
    perm = np.random.permutation(train_size)
    train_loss = 0

    for j in range(train_size // batch_size):
        batch_x = X_train[perm[j*batch_size : (j+1)*batch_size]]
        batch_y = y_train[perm[j*batch_size : (j+1)*batch_size]]

        # Forward propagation
        l = 0
        z = 0
        with model.train():
            for t in range(look_back):
                z = model(batch_x[:,t].reshape(len(batch_x),-1))
                l = rm.mse(z, batch_y)
            model.truncate()
        l.grad().update(optimizer)
        train_loss += l.as_ndarray()

    train_loss = train_loss / (train_size // batch_size)
    learning_curve.append(train_loss)

    # test
    l = 0
    z = 0
    for t in range(look_back):
        z = model(X_test[:,t].reshape(test_size, -1))
        l = rm.mse(z, y_test)
    model.truncate()
    test_loss = l.as_ndarray()
    test_curve.append(test_loss)

    # check early stopping
    # if test loss doesn't reduce by 1% of that at period epoch before, early stopping is done.
    if i % period == 0:
        print('epoch:{}, train loss:{}, test loss:{}'.format(i, train_loss, test_loss))
        draw_pred_curve(i)
        if test_loss > loss_prev*0.99:
            print('Stop learning')
            break
        else:
            loss_prev = deepcopy(test_loss)
# predicted curve
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='upper left', fontsize=20)
plt.show()
epoch:200, train loss:0.040894584730267525, test loss:0.04207055643200874
epoch:400, train loss:0.002047195433988236, test loss:0.005259166471660137
epoch:600, train loss:0.0006290220735536423, test loss:0.0025398710276931524
epoch:800, train loss:0.00030508961572195403, test loss:0.002104206010699272
epoch:1000, train loss:0.0001259945884157787, test loss:0.0019810707308351994
epoch:1200, train loss:8.218832681450294e-05, test loss:0.002142632845789194
Stop learning
../../../_images/notebooks_basic_algorithm_LSTM_notebook_21_1.png

In the figure above, the left half blue curve represents data used to train LSTM. The right half curves represent prediction curves corresponding to each epoch. We can see LSTM approximates to a Sine curve as a learning step goes.

Loss curves

In [10]:
plt.figure(figsize=(15,10))
plt.plot(learning_curve, color='blue', label='learning curve')
plt.plot(test_curve, color='orange', label='test curve')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(fontsize=30)
plt.show()
../../../_images/notebooks_basic_algorithm_LSTM_notebook_24_0.png

Conclusion

In this tutorial, we have explained RNN and LSTM. RNN is a neural network that can be applied to time-series analysis and LSTM is an improved version of that which can hold the time information in a long time.

Since it's effective in time-series analysis, LSTM is applied to time-series regression, time-series anomaly detection and so on.

References

[1] 岡谷貴之.『機械学習プロフェッショナルシリーズ 深層学習』, 講談社, 2015.