Convolutional Neural Network(CNN)

In this chapter, we will introcduce the convolutional neural network(CNN) used in mainly computer vision tasks.

The CNN is generally composed of convolutional layers and pooling layers. We will mainly discuss the convolutional layers and pooling layers. Because the explanation of the convolutional layer involves complex mathematical notations, you can skip some of them if you are not interested in nor familiar to it.


Firstly, we explain about the convolution before we will explain the convolutional layer. The Convolution is a mathematical operation of two function, and can be represented in the form

(f*g)=\int_{-\infty}^{\infty} f(\tau)g(t-\tau)d\tau

However, because the image and filter is not continuous, we need to convert this form to discrete notation.

(f*g)=\sum_{\tau} f(\tau)g(t-\tau)

Convolutional Layer

The convoluitional layer consists of two learnable parameters; weight and bias, and weight is especially called kernel(filter). Each kernel is convolved across an input image. Now, let us assume that we have an 2-D image I and kernel K , then the kenels are convolved as follwing.

(I*K)(i, t) = \sum_{m}\sum_{n}I(m, n)K(i-m, j-n)

Througout this process, the convolutional layer finds optimal kernel which would activate when the layer detects specific features on the image. Thus, the layer will be able to compress the given image and extracts the features from it.

The figure bellow represents the process of the convolutional layer.

Convolutiona Layer

Convolutiona Layer

Pooling Layer

As the convolutional layer does, the pooling layer also has small window(kernel). Appying the small window across images, the pooling layer conducts statistical process. The computation of the output shape after the pooling layer can be represented as the same as the convolutional layer.

There are two well-used pooling layers; average pooling layer and max pooling layer. We will introduce those layers bellow.

Max Pooling

The max pooling layer takes the max value in the small window applied to images.

Max Pooling

Max Pooling

Average Pooling

The average pooling layer takes average of pixels in the small window applied to images. The following figure represents the process of the average pooling layer. Averge Pooling

Output Shape

The concept of the output shape for the convolutional layer and the pooling layer is same. The output shape depends on the kernel, padding and stride. The output shape after the convolutional layer can be represented in the form

W-2[H/2] × W-2[H/2]

where W is the length of an image and H is the length of a kernel.

In other words, if the length of the image is 10 and the length of the kernel is 2, the outshape will be 10-2[2/2] × 10-2[2/2] , thus 8 × 8 . However, we often want the output shape to be the same shape as the input shape. To achieve this, the padding technique can be used.


The padding is a technique to fill values around the image. Usually, zero is filled around the image, and it is specifically called zero-padding . The following figure represents the padding technique. By introducing this technique, the output shape will be the same as the input shape.

zero padding

zero padding


There is another frequently used technique called Stride . Though the kernel usually moves 1 pixel vertically and horizontally across the image, by setting values more than 1 to the stride parameter, the kernel filter will move the stride size across the image. For example, if we set 2 to the stride parameter, then the kernel moves 2 pixel next to the current state vertically and horizontally. Thus, the output shape will be the half of the input shape.

Therefore, by introducing the padding and stride, the output shape can be represented in the form

((W-1)/S + 1, (W-1)/S + 1)

The convolutional Layer in ReNom

Now that we explained the theory of the convolutional layer, then we will explain how to use the convolutional layer and the pooling layers in ReNom. In ReNom, Conv2d class is implemented with the arguments; channel , filter , padding , and stride . Also, MaxPool2d(AveragePool2d) is implemented with the arguments; filter , padding and stride . The argument of channel determined how many kernels you use. The filter decides the size of the kernel. The padding and stride are what we explained so far. We will show you the usage of the convolutional layer by demonstrating digits classification tasks.

Required Libraries

  • scikit-learn 0.18.2
  • matplotlib 2.0.2
  • numpy 1.12.1
  • tqdm 4.15.0
In [1]:
import renom as rm
from renom.cuda.cuda import set_cuda_active
import numpy as np
from sklearn.datasets import fetch_mldata
from tqdm import tqdm
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

GPU activation

If you want to boost the training of your model, you need to activate GPU in your machine. The GPU on your machine can be activated by calling set_cuda_activation method implemented in ReNom.

In [2]:

Fetching Data

We will use Mnist data comprising of pictures of digits. Executing the following program, the data set will be downloaded.

In [3]:
mnist = fetch_mldata('MNIST original', data_home='./')

Model Definition

Here, we will define the convoloutional neural network. Because the images in the dataset is not complicated, we do not need to define a complex model. We, thus, define only two convolution, two pooling, and fully connected layers. We can implement the model easily by calling Sequential class in ReNom.

In [4]:
cnn = rm.Sequential([
    rm.Conv2d(channel=32, filter=3, padding=1),
    rm.Conv2d(channel=64, filter=3, padding=1),
    rm.MaxPool2d(filter=2, stride=2),

Data Conversion

We will split the dataset to two groups: training dataset and validation dataset to find an optimal model. Moreover, because the target data has to be one hot vectors, we need to convert the target data.

In [5]:
data = mnist['data']
targets = mnist['target']
train_num = int(0.8 * len(data))
train_data = np.expand_dims(data[:train_num].reshape(train_num, 28, 28), axis=1)
test_data = np.expand_dims(data[train_num:].reshape(len(data) - train_num, 28, 28), axis=1)
train_targets = targets[:train_num]
train_targets = LabelBinarizer().fit_transform(train_targets).astype(np.float32)
test_targets = targets[train_num:]
test_targets = LabelBinarizer().fit_transform(test_targets).astype(np.float32)


In [6]:
batch_size = 64
epochs = 10
optimizer = rm.Sgd(lr=0.001)
N = train_num
for epoch in range(epochs):
    perm = np.random.permutation(N)
    loss = 0
    test_loss = 0
    bar = tqdm(range(N//batch_size))
    for j in range(N//batch_size):
        train_batch = train_data[perm[j*batch_size:(j+1)*batch_size]]
        train_targets_batch = train_targets[perm[j*batch_size:(j+1)*batch_size]]
        with cnn.train():
            l = rm.softmax_cross_entropy(cnn(train_batch), train_targets_batch)

        bar.set_description("epoch {:03d} train loss:{:6.4f} ".format(epoch, float(l.as_ndarray())))
        loss += l.as_ndarray()
    for k in range(len(test_data)//batch_size):
        test_batch = test_data[k*batch_size:(k+1)*batch_size]
        test_targets_batch = test_targets[k*batch_size:(k+1)*batch_size]
        test_l = rm.softmax_cross_entropy(cnn(test_batch), test_targets_batch)
        test_loss += test_l.as_ndarray()
    bar.set_description("epoch {:03d} avg loss:{:6.4f} val loss:{:6.4f}".format(epoch, float((loss/(j+1))), float((test_loss/(k+1)))))
epoch 000 avg loss:0.4480 val loss:0.2918: 100%|██████████| 875/875 [00:21<00:00, 41.42it/s]
epoch 001 avg loss:0.1526 val loss:0.2224: 100%|██████████| 875/875 [00:18<00:00, 46.12it/s]
epoch 002 avg loss:0.1146 val loss:0.1713: 100%|██████████| 875/875 [00:19<00:00, 45.21it/s]
epoch 003 avg loss:0.0903 val loss:0.1481: 100%|██████████| 875/875 [00:19<00:00, 44.92it/s]
epoch 004 avg loss:0.0798 val loss:0.1379: 100%|██████████| 875/875 [00:19<00:00, 45.92it/s]
epoch 005 avg loss:0.0728 val loss:0.1431: 100%|██████████| 875/875 [00:19<00:00,  9.73it/s]
epoch 006 avg loss:0.0637 val loss:0.1282: 100%|██████████| 875/875 [00:19<00:00, 44.80it/s]
epoch 007 avg loss:0.0573 val loss:0.1157: 100%|██████████| 875/875 [00:19<00:00, 45.77it/s]
epoch 008 avg loss:0.0531 val loss:0.1165: 100%|██████████| 875/875 [00:19<00:00, 45.72it/s]
epoch 009 avg loss:0.0489 val loss:0.1192: 100%|██████████| 875/875 [00:19<00:00, 44.95it/s]

Kernel Visualization

As we explain, each kenel in convolutional layers will be convolved across the image. Now, let us show you kernel filters(weight in the convolutional layers) bellow.

In [7]:
W = cnn._layers[0].params.w
nb_filter, nb_channel, h, w = W.shape
for i in range(nb_filter):
    im = W[i, 0]
    scalar = MinMaxScaler(feature_range=(0, 255))
    im = scalar.fit_transform(im)
    plt.subplot(4, 8, i+1)
    plt.imshow(im, cmap='gray')


Moreover, we will show you the comparison between the original image and the images after kernels in the first convolutional layer are convolved across the original images

In [18]:
print('Original Image')
x = test_data[:1]
t = cnn._layers[0](x)
nb_filter, nb_channel, h, w = t.shape
plt.imshow(x[0][0], cmap='gray')
Original Image
In [19]:
print('Feature maps after the first convolutional layer')
for i in range(nb_channel):
    im = t[0, i, :, :]
    scalar = MinMaxScaler(feature_range=(0, 255))
    im = scalar.fit_transform(im)
    plt.subplot(4, 8, i+1)
    plt.imshow(im, cmap='gray')
Feature maps after the first convolutional layer