Calculation of Back Propagation (1)

Guide to back propagation calculation using 1 input, 1 output neural network

In this tutorial, we will introduce how back propagation works. To understand this in a more intuitive way, we placed some diagrams to show what variables are used to calculate back propagation.

Back Propagation is a calculation method in order to update the weights. With this calculation, we are able to learn functions that have non-linear properties. In this tutorial, we will use 1 unit input layer, 2 unit hidden layer, 1 output layer and a regression problem as an example to understand the process more clearly.

Theory

Let’s consider the target as a single data point. Let’s also focus on the unit the output layer.

z is the total input, f(z) is the activation function, and r is referenced label.

The basic method to updating the weights are shown below.

w_{t+1} \leftarrow w_t + \gamma \frac{\partial E}{\partial w}

t is the learning step, w is the weight, \gamma is the learning rate, and E is the lost function. If we calculate the second term, we could express the function as shown below.

\frac {\partial E}{\partial w}=\frac {\partial E}{\partial y}\frac {\partial y}{\partial z}\frac {\partial z}{\partial w}

In a regression problem, we express the lost function E as (r-y)^2/2 . Luckily, we can express the derivative of E by y as -(r-y) . For derivative z by w_1 , other variables can be treated as constant values, thus we get \partial z/ \partial w= x_1 as a result. \partial y/ \partial z is the derivative of the activation function when inputting z , which in other words is also the slope of the tangent line.

For example, if z=0 , and the activation function is the sigmoid function, then the derivative of \partial y/ \partial z represents the slope of the red line, as shown below.

Therefore, we obtain the following equation.

\frac {\partial E}{\partial w}=\frac {\partial E}{\partial y}\frac {\partial y}{\partial z}\frac {\partial z}{\partial w}=-(r-y) \frac {\partial y}{\partial z} x_1

To understand it more intuitively, let’s check it from a diagram.

From the equation above, we can update the weight with the value marked by the red dot. This also applies with other weights. In other words, with the variables that are behind and in front of the weight, which is the input x_1 and the output gradient, we can calculate the update amount.

How about w_a , w_b ? For w_a , we can calculate using the equation below.

\frac {\partial E}{\partial w_a}=\frac {\partial E}{\partial y}\frac {\partial y}{\partial z}\frac {\partial z}{\partial x_1}\frac {\partial x_1}{\partial x_{1,in}}\frac {\partial x_{1,in}}{\partial w_a}=-(r-y) \frac {\partial y}{\partial z}w_1\frac {\partial x_1}{\partial x_{1,in}}u

x_{1,in} is the input signal for the unit which outputs x_1 , and u is the input signal. The premise of the calculation above is that the input unit is using a linear activation function. If we check it from the diagram, we can tell which values are used easily.

From the diagram above, we can understand that we are multipling all the derivatives from the output to the concerned weight and the input to update weight w_a .

What if there were multiple data points as the target, like time series data? Usually, in this case we would process it with a method called "mini batch" learning . In order to mini-batch, first take D data sets from the dataset, take the sum of loss function values, divided that by D , and use it to update wieght with the equation below.

\frac {\partial E}{\partial w}=\frac{1}{D}\sum_{i}^{D}\frac {\partial E_i}{\partial w}

i represents the i th data point. w represents a generalized weight.

This is how we calculate back propagation. In ReNom, we use the code as shown below.

Coding

In [1]:
    with model.train():
        l = rm.mean_squared_error(model(in1), out1)
    grad = l.grad()
    grad.update(Sgd(lr=0.001))

'model' represent the neural network fuction. 'in1','out1' represent input and output data set. 'with model.train' creates training loop. 'l' stacks the result. '.grad()' method calculates the gradient, and update, updates "model" weight.

Chain Rule

In the tutorial, we showed how to calculate the derivative of a difficult function by separating the derivatives. This is also known as 'the chain rule'. Some of the readers may have learned that when calculating a derivative of a function within a function, “First calculate the outer function and then the inner function” but actually, what the readers are calculating is chain rule.

Summary

In this tutorial, we showed how the back propagation works with 3-layer neural network and a regression problem. We’ve also showed which values are used in order to calculate back propagation.

We focused our tutorial using a single input and a single output neural network. When calculating multi-unit connection, there is a little twist, but there is no difference in using the chain rule.