Calculation of Back Propagation (2)

Calculation of Back Propagation When Layers or Units Increase

In the previous tutorial, we used a 3 layer neural network with a single input and output unit, in order to explain the how the back-propagation calculation works. However, this was based on 1 route ("route" will be explained further in this tutorial). If there are 4 layers, the route from the weight near the input layers increases. In this tutorial, we will explain how to calculate when there are multiple routes. Furthermore, when layers increase, a problem called the “Vanishing Gradient” problem arises. We will explain what the “Vanishing Gradient” is briefly.

Case When Hidden Layer Increase

Suppose there is a route between the update weight and the output unit. For example, if we want to update the weight that exist near the input layer, we can draw a route shown below.

To understand more easily, lets unify activation function to f(x) .

Let’s say that E=(r_1-y_1)^2/2 is our lost function. If we apply the chain rule, w_1 we would only need the red dots to update w_1 . We can calculate it as shown below.

\begin{split}\frac{\partial E}{\partial w_1}=\frac{\partial E}{\partial y_1}\frac{\partial y_1}{\partial z_5}\frac{\partial z_5}{\partial x_3}\frac{\partial x_3}{\partial z_3}\frac{\partial z_3}{\partial x_1} \\=-(r_1-y_1)\ \dot{f}(z_5)\ w_3\ \dot{f}(z_3)\ u_1\end{split}

On the other hand, what if a hidden layer increases like shown below. Let's say the neural network is a 4 layer neural network.

As a result, the route will increase for w_1 (also w_2 ). So, how do we calculate this?

For example, we want to calculate w_1 . Lets go back to what the equation looks like. Also lets calculate the update amount \partial E/\partial y .

w_{t+1}\leftarrow w_t + \gamma \frac{\partial E}{\partial w_1}
\frac {\partial E}{\partial w_1}=\frac {\partial E}{\partial y_1}\frac {\partial y_1}{\partial z_5}\frac {\partial z_5}{\partial w_1}

z_5 can be calculated as x_7 , x_8 . Each of these variable have w_1 inside them. Hence, we can express \partial z_5/\partial w_1 as shown below.

\frac {\partial z_5}{\partial w_1}=\frac {\partial}{\partial w_1}(w_7x_7+w_8x_8)= w_7\frac {\partial x_7}{\partial w_1}+ w_8\frac {\partial x_8}{\partial w_1}

If we calculate each term, we can get the equation as shown below.

\begin{split} w_7\frac {\partial x_7}{\partial w_1}= \frac {\partial x_7}{\partial z_3} \frac {\partial z_3}{\partial x_3} \frac {\partial x_3}{\partial z_1}\frac {\partial z_1}{\partial w_1} \\ w_8\frac {\partial x_8}{\partial w_1}= \frac {\partial x_8}{\partial z_4} \frac {\partial z_4}{\partial x_4} \frac {\partial x_4}{\partial z_1}\frac {\partial z_1}{\partial w_1}\end{split}

As a result, we get the following equation.

\begin{split}\frac {\partial E}{\partial w_1}=\frac {\partial E}{\partial y_1}\frac {\partial y_1}{\partial z_5} \frac {\partial z_5}{\partial x_7}\frac {\partial x_7}{\partial z_3} \frac {\partial z_3}{\partial x_3} \frac {\partial x_3}{\partial z_1}\frac {\partial z_1}{\partial w_1}\\ +\frac {\partial E}{\partial y_1}\frac {\partial y_1}{\partial z_5}\frac {\partial z_5}{\partial x_8}\frac {\partial x_8}{\partial z_4} \frac {\partial z_4}{\partial x_4} \frac {\partial x_4}{\partial z_1}\frac {\partial z_1}{\partial w_1}\\ =-(r_1-y_1)\ \dot{f}(z_5) \ w_7\ \dot{f}(z_3)\ w_3\ \dot{f}(z_1)\ u\\ -(r_1-y_1)\ \dot{f}(z_5)\ w_8\ \dot{f}(z_4)\ w_4\ \dot{f}(z_1)\ u\end{split}

The equation above is actually the sum of 2 chain rules. In other words, if multiple routes exist, we can apply the chain rule to each route, and take the sum of all route in order to update the weight. Red is the first term, blue is the second term. The dots represent the values you would need to calculate the update amount.

Case When Output Unit Increase

Then what happens when 1 output unit increase? We can the express the loss function by using the loss function, E_1 , E_2 , which corresponds to each output.

\frac {\partial E}{\partial w_1}=\frac {\partial E_1}{\partial w_1}+\frac {\partial E_2}{\partial w_1}

In short, if we apply the chain rule to the each route to y_1 , _2 , we can update w_1 . The routes are shown below.

If we increase the input unit, we can also apply the idea shown above.

The Vanishing Gradient Problem

Until now, we have explained that we update the weights by summing the chain rule that’s applied to the route. What happens if we used a Sigmoid function as the activation function used in the hidden layer? For this example, let’s suppose the input and the output layer are linear activation function. The derivative of the sigmoid function can be express as shown below.

\begin{split}f(x)=\frac{1}{1+\exp(-x)} \\ \frac {\partial f(x)}{\partial x}=f(x)(1-f(x))\end{split}

For Sigmoid function, the maximum gradient is when the ouput is 0.5, thus the maximum gradient is 0.5(1-0.5)=0.25 . Sigmoid functions output are [0,1].

We can get the calculation result of 1 route, which is shown below. This calculation is based on 1 output unit, and the route that has w_3 .

\begin{split}\frac {\partial E_1}{\partial w_1}=\frac {\partial E}{\partial y_1}\frac {\partial y_1}{\partial x_8}\frac {\partial x_8}{\partial z_3} \frac {\partial z_3}{\partial x_1} \frac {\partial x_1}{\partial z_1}\frac {\partial z_1}{\partial w_1} \\ ≤ |-(r_1-y_1) \ w_5\ 0.25\ w_3\ 0.25\ u|\\ = 0.0625\ |r_1-y_1|\ w_5\ w_3\ u\end{split}

From the calculation above, we know that the update amount can never go above 0.0625\ |r_1-y_1|\ w_5\ w_3\ u . This is based on 4 layers but as the layer increase, we will have to multiply the maximum amount by 0.25, which decreases the gradient further.

Some of you may have thought that increase the weight can be a solution to increase the update amount near the input layer, but the sigmoid function at some point will have a large input, which will set the gradient of the activation function to 0. Ultimately, the further the layer is, the more difficult it is for the weights to update. This is what we call the “Vanishing Gradient” problem. For this reason, neural network studies did not progress that far.

However, in 2006, Hinton released a paper about DBN[1], which lead to the revival of neural network studies. Since then, there were many studies approached in improving it and among those, we learned that we can use ReLU as the activation function to overcome the “Vanishing Gradient” problem. ReLU is y=\max(x,0) function, x being the input, which has a constant derivative value. With this function, we were able to improve the training process much further.

Conclusion

In this tutorial, we learned that when there are more than or equal to 4 layers, the possible route increase for each unit, but we can apply the chain rule to each route and sum them to update the weights. But as we increase the layers, we also learn that the “Vanishing Gradient” problem arises, and in order to solve this problem, we use the ReLU function for the units in the hidden layer.

As a side note, when calculating in practice, we use calculation graph to compute the weights. Further details about this will not be explained for now, but if curious, you might want to check it out.

[1] Hinton, G. E.; Osindero, S.; Teh, Y. W. (2006). "A Fast Learning Algorithm for Deep Belief Nets" Neural Computation. 18 (7): 1527–1554.