Intro to Activation Function

We will introduce the basic activation function and its features.

Activation function is a function that decides on how much output there should be, depending on the summation of the input. Most of the activation function use non-linear function in order to solve nonlinear regressive problems or nonlinear classification problems.

The basic equation for a single neuron is shown below.

\begin{split}y=f(z)\\ z=\sum_{i} w_i x_i +b\end{split}

x_i are the inputs for the neuron, w_i are the weights for each input, b is the bias, and f(z) is the function. Our main focus will be on f(z) .

There has been lots of activation function proposed since the birth of neural network, but in this tutorial, we will introduce and focus on Sigmoid, Tanh, Relu function. For further understanding, we will also describe the difference on what effects it has on back-propagation.(Go to the tutorial about "back propagation", for further detail.)

Required Libraries

In [2]:
import renom as rm
import numpy as np
import matplotlib.pyplot as plt

Sigmoid Function

f(x) = \frac{1}{1 + \exp(-x)}

This function has an output range from 0 to 1 and is continuous. This function was frequently used before Relu became popular. Before sigmoid function, activation function based on if rules were used, but with sigmoid, activation outputs were able to be express values between 0 to 1. If you look at the graph in narrow input range, it might look like a slant graph, but in a wider input range, it will closely resemble a function of IF rule.

The following input range is -10, 10 for Sigmoid function.

In [3]:
x=np.array([i/100 for i in range(-1000,1000)])
y=rm.sigmoid(x) # Sequential ->rm.Sigmoid()
plt.grid()
plt.plot(x,y)
plt.xlabel('input')
plt.ylabel('output')
plt.show()
../../../_images/notebooks_beginners_guide_activation_intro_notebook_4_0.png

Another interesting feature to consider about this function is the derivative function. If f(x) is a sigmoid function, the derivative function can be express as followed.

\frac {\partial f(x)}{\partial x}=f(x)(1-f(x))

Hence, the derivative calculation as well as back propagation can be done easily.

However, 2 problems exist within this function.

  1. The center point of the output is 0.5, thus when input is near 0, it will update the weights largely

\rightarrow Weights are updated too quickly

  1. If the inputs are large enough, the gradient vanishes

\rightarrow Update stops

Recently, there are other activation function that can overcome these problem, thus the use of sigmoid has declined recently.

Tanh Function

f(x)=\tanh(x)

This function has an output range from 0 to 1. This function is also continuous like the sigmoid function and you can get the derivative using the output values. The major difference between the sigmoid function is the output center point is 0.

The following input range is -10, 10 for Tanh function.

In [4]:
y=rm.tanh(x) # Sequential ->rm.Tanh()
plt.grid()
plt.plot(x,y)
plt.xlabel('input')
plt.ylabel('output')
plt.show()
../../../_images/notebooks_beginners_guide_activation_intro_notebook_7_0.png

If f(x) is a tanh function, the derivative function can be express as followed.

\frac {\partial f(x)}{\partial x}=1-f(x)^2

The Tanh function can solve the problem 1) of Sigmoid but, can’t solve problem 2).

Relu Function

f(x)=\max(x, 0)

This function outputs the same value as the input when inputs are positive, and outputs 0 when inputs are negative. This is dis-continuous function, but because the derivatives are constant when back propagating, the calculation is simple.

The following input range is -10, 10 for Relu function.

In [5]:
y=rm.relu(x) # Sequential ->rm.Relu()
plt.grid()
plt.plot(x,y)
plt.xlabel('input')
plt.ylabel('output')
plt.show()
../../../_images/notebooks_beginners_guide_activation_intro_notebook_10_0.png

By using the activation function show above, we can overcome both 1) 2) problems of Sigmoid. Since Nair and Hinton proposed the validity of Relu, using this as an activation function has become a standard procedure. Relu gradients are 0 when inputs are below 0 thus learning stops. In order to overcome this problem there have been proposals of activation functions like Leaky Relu, Elu, Selu, and etc.

Conclusion

As shown above, we have introduced and described the features of Sigmoid, Tanh, ReLu and the effects it has on back propagation. The summary of the features discussed above is summarized in table as shown below.

Furthermore, there are many new activation functions that has been proposed recently.

We hope you’ve got a clear view on what activation function to choose and the reasons behind it from this tutorial.