Introduction to the Loss Function

Introduction to two basic loss funcions, and related activation functions


  1. Introduction
  2. Problem classification and basic loss function
  3. Combinations of activation and loss functions
  4. How to apply a basic combination with ReNom
1. Introduction
A loss function calculates the error between a set of outputs and their labels.
Selecting an appropriate loss function is extremely important. This is because neural networks differentiate the loss function in order learn parameters. We’ll now introduce basic loss functions and related activation functions in this notebook.
2. Problem classification and basic loss functions
There are many commonly used loss functions, and sometimes we even define novel loss functions to solve a specific problem.
We’ll now introduce two basic loss functions, cross entropy and mean squared error, and some related activation functions, sigmoid function and softmax function.
To choose a loss function, or find a suitable loss-activation combination, we first have to classify the problem- Binary classification? Multiclass classification?Regression?
Typically, for each of these problem types, we would use a different combination.
3. Combinations of activation and loss function
According to the different problem-types above, there are several reasons why we might prefer one combination over another.
As described above, cross entropy is usually used for probablistic output.
Mean squared error is usually used for regression.
Using mean squared error and the softmax(sigmoid) functions together is not recommended, is this may lead to very slow learning.
4. How to use basic combinations with ReNom
The reference for the exmaple dataset is below:
Lichman, M. (2013). UCI Machine Learning Repository [ ].
Irvine, CA: University of California, School of Information and Computer Science.
In [1]:
from __future__ import division, print_function
import numpy as np
import pandas as pd
import re

import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import renom as rm
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active
# If this is the first time running the example,
# and you need to download the data- set this to True
first_time = False

if first_time:
    import os

def load_data(filename):
    df = pd.read_csv(filename, header=None, index_col=None)
    print("the number of {} records:{}".format(filename, len(df.index)))
    df = df.applymap(lambda d:np.nan if d=="?" else d)
    df = df.dropna(axis=0)
    print("the number of {} records after trimming:{}".format(filename, len(df.index)))
    sr_labels = df.iloc[:,-1]
    labels = sr_labels.str.replace("+","1").replace("-","0").values.astype(float)
    data = df.iloc[:,:-1].values.astype(str)
    return data, labels

Identify the numerical column or categorical column and onehot vectorize

In [2]:
pattern_continuous = re.compile("^\d+\.?\d*\Z")
def onehot_vectorize(data):
    continuous_idx = {}
    for i in range(data.shape[1]):
        is_continuous = True if pattern_continuous.match(data[0][i]) else False
        if is_continuous and i==0:
            X = data[:,i].astype(float)
        elif not is_continuous and i==0:
            X = pd.get_dummies(data[:,i]).values.astype(float)
        elif is_continuous and i!=0:
            X = np.concatenate((X, data[:,i].reshape(-1,1).astype(float)), axis=1)
        elif not is_continuous and i!=0:
            X = np.concatenate((X, pd.get_dummies(data[:,i]).values.astype(float)), axis=1)
    return X

data, y = load_data("")
X = onehot_vectorize(data)
print("X:{} y:{}".format(X.shape, y.shape))
the number of records:690
the number of records after trimming:653
X:(653, 46) y:(653,)

Data splitting and model definition

In [3]:
indices = np.arange(len(X))
X_train, X_test, y_train, y_test, indices_train, indices_test = \
train_test_split(X, y, indices, test_size=0.2)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))

sequential = rm.Sequential([
X_train:(522, 46) y_train:(522, 1) X_test:(131, 46) y_test:(131, 1)

Learning loop for sigmoid activation and cross entropy

First, we will introduce how to use the sigmoid-cross entropy combination. The following line sets sigmoid as the activation function, and cross entropy as the loss function.

l = rm.sigmoid_cross_entropy(sequential(train_batch), response_batch)
In [4]:
batch_size = 32
epoch = 50
N = len(X_train)
optimizer = Sgd(lr=0.001)
learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]

        with sequential.train():
            l = rm.sigmoid_cross_entropy(sequential(train_batch), response_batch)

        grad = l.grad()
        loss += l.as_ndarray()
    train_loss = loss / (N // batch_size)

    test_loss = rm.sigmoid_cross_entropy(sequential(X_test), y_test).as_ndarray()
    if i%10 == 0:
        print("epoch :{}, train_loss:{}, test_loss:{}".format(i, train_loss, test_loss))

predictions = rm.sigmoid(sequential(X_test)).as_ndarray()
pred = np.array(list(map(lambda d:1 if d>0.5 else 0, predictions))).reshape(-1,1)

print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred, target_names=["-","+"]))
epoch :0, train_loss:3.272980712354183, test_loss:0.7651634812355042
epoch :10, train_loss:0.6081356462091208, test_loss:0.7066062092781067
epoch :20, train_loss:0.6118309292942286, test_loss:0.6797510385513306
epoch :30, train_loss:0.6028882898390293, test_loss:0.6095903515815735
epoch :40, train_loss:0.6093996576964855, test_loss:0.6735255718231201
[[52 10]
 [32 37]]
             precision    recall  f1-score   support

          -       0.62      0.84      0.71        62
          +       0.79      0.54      0.64        69

avg / total       0.71      0.68      0.67       131