Inside the Activation Function

Understanding the purpose and inside of activation function

Neural Network has many parameters, and we can use activation function to solve non-linear problem.

For this tutorial, we will introduce the purpose of activation function. There are two types of activation function: activation function used for output layers and activtion function for hidden layers.

Activation function for output layers are mainly used in classification problems. For example, there are class A, class B , and class C, and we have to classify data to one of these classes. In classification problem, we have to assign data to the certain classes. Thus, it is better to understand the meaning when output is expressed as probability. In this case, we can interpret the result using sigmoid function or softmax function for output layer.

Activation function used in hidden layers are used to solve and express non-linear functions (including classification problems). ReLU is one of a major function used in activation layers.

For this tutorial, we will deal with XOR problem, a non-linar problem, for further understanding and visualize whats happening inside the neural network.

Required Libraries

In [1]:
import numpy as np
import renom as rm
from renom.optimizer import Sgd
import matplotlib.pyplot as plt

Make XOR data

XOR data is like bellow, we cannot simply devide these points to these two classes.

This problem is often used to represent the non-linear devision problem.
We’ll try to separate these points to use neural network with relu activation function.
In [2]:
X = np.array([[1,1],

y = np.array([[1],

Neural Network Definition

In [3]:
class Mnist(rm.Model):
    def __init__(self):
        self.layer1 = rm.Dense(output_size=5)
        self.layer2 = rm.Dense(1)

    def forward(self, x):
        t1 = self.layer1(x)
        t2 = rm.relu(t1)
        t3 = self.layer2(t2)
        return t3

In this case, output is expressed as below.

\begin{equation*} h1 = w^{i→h}_{11} \times x_1 + w^{i→h}_{21} \times x_2 + b^{i→h}_{1} \end{equation*}
\begin{equation*} h2 = w^{i→h}_{12} \times x_1 + w^{i→h}_{22} \times x_2 + b^{i→h}_{2} \end{equation*}
\begin{equation*} h3 = w^{i→h}_{13} \times x_1 + w^{i→h}_{23} \times x_2 + b^{i→h}_{3} \end{equation*}
\begin{equation*} h4 = w^{i→h}_{14} \times x_1 + w^{i→h}_{24} \times x_2 + b^{i→h}_{4} \end{equation*}
\begin{equation*} h5 = w^{i→h}_{15} \times x_1 + w^{i→h}_{25} \times x_2 + b^{i→h}_{5} \end{equation*}
\begin{equation*} output = w^{h→o}_{1} \times Relu(h1) + \end{equation*}
\begin{equation*} w^{h→o}_{2} \times Relu(h2) + \end{equation*}
\begin{equation*} w^{h→o}_{3} \times Relu(h3) + \end{equation*}
\begin{equation*} w^{h→o}_{4} \times Relu(h4) + \end{equation*}
\begin{equation*} w^{h→o}_{5} \times Relu(h5) + b^{h→o} \end{equation*}


In [4]:
epoch = 50
batch = 1
N = len(X)
optimizer = Sgd(lr=0.1, momentum=0.4)

network = Mnist()
learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch):
        train_batch = X[perm[j*batch : (j+1)*batch]]
        response_batch = y[perm[j*batch : (j+1)*batch]]
        with network.train():
            result = network(train_batch)
            l = rm.sigmoid_cross_entropy(result, response_batch)
        grad = l.grad()
        loss += l
    train_loss = loss / (N // batch)
plt.plot(learning_curve, linewidth=3, label="train")
In [5]:
print("[0, 0]:{}".format(network([0,0]).as_ndarray()))
print("[1, 1]:{}".format(network([1,1]).as_ndarray()))
print("[1, 0]:{}".format(network([1,0]).as_ndarray()))
print("[0, 1]:{}".format(network([0,1]).as_ndarray()))
[0, 0]:[[1.962966]]
[1, 1]:[[1.962966]]
[1, 0]:[[-2.961516]]
[0, 1]:[[-3.0228367]]

As we can see, we’ve obtained good result to devide the XOR problem. The output equation is add-sum of the activation function and its weights. From here, we will visualize the output surface from -3 to 3 for two input data(XOR data). Height is the output value.

\begin{equation*} output = w^{h→o}_{1} \times Relu(h1) + \end{equation*}
\begin{equation*} w^{h→o}_{2} \times Relu(h2) + \end{equation*}
\begin{equation*} w^{h→o}_{3} \times Relu(h3) + \end{equation*}
\begin{equation*} w^{h→o}_{4} \times Relu(h4) + \end{equation*}
\begin{equation*} w^{h→o}_{5} \times Relu(h5) + b^{h→o} \end{equation*}

Output is composed of 5 weighted activation output.

As said before, we can divide non-linear case by using non-linear activation function. We were able to solve the XOR problem by using Relu function. The output surface is the figure show above. Next, let’s see the output surface of Relu output from node 1 and 2.

Above figure is calculated based on the equation below.

\begin{equation*} output = w^{h→o}_{1} \times Relu(h1) \end{equation*}
\begin{equation*} h1 = w^{i→h}_{11} \times x_1 + w^{i→h}_{21} \times x_2 + b^{i→h}_{1} \end{equation*}

Above figure is calculated based on the equation below.

\begin{equation*} output = w^{h→o}_{2} \times Relu(h2) \end{equation*}
\begin{equation*} h2 = w^{i→h}_{12} \times x_1 + w^{i→h}_{22} \times x_2 + b^{i→h}_{1} \end{equation*}

Weight visualization

Here are the results of the weights of neural network with relu activation.

Why it works

Output weights are -2.2, 1.1, -1.2, -0.2, -1.8, in numerical order. The only positive value is second weight of output.

We want two points(0,0) and (1,1) to be classified to class 1 by having the output value above 0, and (1,0) and (0,1) to be classified to class 0 by having the output value below 0. Thus we’d prefer positive value from the positive weight path of the output layer rather than other weights, when we calculate the (0,0) and (1,1). When the input is (0,0), the product sum of the first weight is 0, so the second bias should be a high value while other weights are set to low value. When the input is (1,1), because the output weights have large values, the weight parameters for the first layer should be designed to activate the 2nd node and not the other nodes.

By considering these 2 elements, we can confirm that the system above is design to have the same function for (0,0) and (1,1). For (1,0)(0,1) inputs, we can appriciate the fact that the nodes, other than the 2nd node activates, outputting a large negative value to the the output node, which results a negative value and choosing class 0.

Sparse Effect of ReLU

By the way, there are node which have negative effect and positive effect, but how we can calculate based on the method above?? The most important role of Relu is we can use the activation function as sparsity control tool.

For example, second output weight is for calculation of positive number, and other weights are for negative number. When we’d like to calculate the positive number, we wouldn’t like to use negative effect node.

Then, Relu forces negative value to zero, so we make the specific node not to be used in calculation.


Neural Network can express the non-linear function using simple relu function and its weighted sum. Without activation function, we cannot devide the non-linear case like XOR problem. It is important to think about activation function and define the model when we’ll make the novel model.