Application of Entity Embedding Layer

How to apply neural network with entity embedding layers and those examples.

We introduced how to model an entity embedding layer in Renom in the other tutorial( http://www.renom.jp/notebooks/basic/embedding/notebook.html ). Entity embedding method is one of the methods to make neural network easy to treat with categorical variables. One may say that entity embedding have a role to reduce the dimensions of the input categorical variables like PCA. In practice, embedding method has been applied in natural text processing, which is well-known as Word2vec. We also describe below the other applications of entity embedding[1] in this tutorial. When you use categorical variables in neural network, you have to convert raw data into data which is easy to handle in Renom. Because we also have tutorials for its needs(" http://www.renom.jp/notebooks/preprocessing/onehot/notebook.html ", " http://www.renom.jp/notebooks/preprocessing/embedding/notebook.html "), we recommend you see those at first.

In this tutorial, we first generate the data and train neural network with embedding layers, following the modeling tutorial. Then, using that result, we introduce some of examples of the applications.

Required libraries

  • Python 3.5.2
  • Numpy 1.13.3
  • ReNom 2.3.1
  • Matplotlib 2.1.0
  • Scikit-learn 0.19.1
In [1]:
%matplotlib inline
from __future__ import division, print_function
import numpy as np
import time
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA

import renom as rm
from renom.optimizer import Adam
from renom.utility.initializer import Uniform

Data generation

At this time, we use artificially generated data to show how to use entity embedding layers. We assume that target variable( \mathbf{Y} ) is transformed from several categorical variables. In other words, the target variable is a linear combinatioin of all variables that each of the categorical variables is one-hot encoded. Let X_i denote each of the categorical variables for i=0,...,K-1 , where K is the number of the categorical variables. X_i 's range is \{ 0,...,N_i-1 \} for all i , where N_i is the number of the categories of X_i . We assume X_i is uniformly and independently distributed on \{ 0,...,N_i-1 \} for all i . Each value of X_i has equal probability and is observed independently. Remark how large the value of X_i is doesn't mean how effective the indicator variable of the value is in the target variable. For simplicity, we consider the case of K=3 , N_1 = 10 , N_2 = 10 and N_3 = 10 . Let \mathbf{H}_{ij} the indicator variable of the value j of i -th categorical variable, and let \beta_{ij} denote \mathbf{H}_{ij} 's coefficient. Therefore, a generating equation is

\begin{equation*} \mathbf{Y} = \beta_{-1} + \sum_{j=0}^9\beta_{0j}\mathbf{H}_{0j} + \sum_{j=0}^9\beta_{1j}\mathbf{H}_{1j} + \sum_{j=0}^9\beta_{2j}\mathbf{H}_{2j} + \varepsilon, \end{equation*}

where \beta_{-1} is intercept and \varepsilon is stochastic noise. First, we define a function to generate data.

In [2]:
def generate_data(sample_size, categorical_dim, coef=None, intercept=0, random=0):
    np.random.seed(random)
    for i in range(len(categorical_dim)):
        dim = categorical_dim[i]
        x = np.random.randint(size=(sample_size, 1), low=0, high=dim)
        b = np.random.randint(size=(dim, 1), low=-30, high=30)
        if i == 0:
            X = x
            B = b
        else:
            X = np.concatenate((X, x), axis=1)
            B = np.concatenate((B, b), axis=0)

    if coef is not None:
        B = np.asarray(coef).reshape(-1,1)

    Y = intercept + np.dot(OneHotEncoder().fit_transform(X).toarray(), B) + np.random.randn(sample_size, 1)

    max_Y = np.max(Y)
    min_Y = np.min(Y)

    Y = ((Y - min_Y) / (max_Y - min_Y)).astype(np.float32)
    return (X, Y), (max_Y, min_Y), B
In [3]:
sample_size = 10000
categorical_dim = [10,10,10]
reduced_dim = [2,2,2]
coef = []
for d in categorical_dim:
    if d == 15:
        coef += [0 for i in range(d)]
    else:
        coef += list(range(d))

(data_X, data_y), (max_y, min_y), coef = generate_data(sample_size, categorical_dim)

print("Mean of |coefficients| of X1 = {}"
      .format(np.mean(np.abs(coef[0:np.sum(categorical_dim[:1])]))))
print("Mean of |coefficients| of X2 = {}"
      .format(np.mean(np.abs(coef[np.sum(categorical_dim[:1]):np.sum(categorical_dim[:2])]))))
print("Mean of |coefficients| of X3 = {}"
      .format(np.mean(np.abs(coef[np.sum(categorical_dim[:2]):np.sum(categorical_dim[:3])]))))

Mean of |coefficients| of X1 = 12.0
Mean of |coefficients| of X2 = 17.2
Mean of |coefficients| of X3 = 10.2

Neural network with entity embedding layers

We split the input data into training data and test data.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.1, random_state=1)

Model definition

In [5]:
class NN_with_EE(rm.Model):

    def __init__(self):
        super(NN_with_EE, self).__init__()

        # Define entity embedding layers
        self._embedding_0 = rm.Embedding(reduced_dim[0], categorical_dim[0], initializer=Uniform(min=-0.05, max=0.05))
        self._embedding_1 = rm.Embedding(reduced_dim[1], categorical_dim[1], initializer=Uniform(min=-0.05, max=0.05))
        self._embedding_2 = rm.Embedding(reduced_dim[2], categorical_dim[2], initializer=Uniform(min=-0.05, max=0.05))

        # Define fully connected layers
        self._layer1 = rm.Dense(50, initializer=Uniform(min=-0.05, max=0.05))
        self._layer2 = rm.Dense(20, initializer=Uniform(min=-0.05, max=0.05))
        self._layer3 = rm.Dense(1)


    # Define forward calculation.
    def forward(self, x):
        _x = rm.Node(x)
        v1 = self._embedding_0(_x[:, 0].reshape(-1,1))
        v2 = self._embedding_1(_x[:, 1].reshape(-1,1))
        v3 = self._embedding_2(_x[:, 2].reshape(-1,1))

        z = v1

        for v in [v2, v3]:
            z = rm.concat(z, v)

        return rm.sigmoid(self._layer3(rm.relu(self._layer2(rm.relu(self._layer1(z))))))

Training loop

In [6]:
optimiser = rm.Adam()

model = NN_with_EE()

epoch = 100
N = len(X_train)
batch_size = 128
train_curve = []
test_curve = []

start = time.time()
for i in range(1, epoch+1):
    perm = np.random.permutation(N)
    total_loss = 0
    for j in range(N//batch_size):
        index = perm[j*batch_size:(j+1)*batch_size]
        train_batch_X = X_train[index]
        train_batch_y = y_train[index]
        with model.train():
            z = model(train_batch_X)
            loss = rm.mean_squared_error(z, train_batch_y.reshape(-1,1))
        grad = loss.grad()
        grad.update(optimiser)
        total_loss += rm.mean_squared_error(z, train_batch_y).as_ndarray()
    train_curve.append(total_loss/(N//batch_size))

    y_test_pred = model(X_test)
    test_loss = rm.mean_squared_error(y_test_pred, y_test).as_ndarray()
    test_curve.append(test_loss)
    elapsed_time = time.time() - start

    if i%10 == 0:
        print("Epoch %02d - Train loss:%f - Test loss:%f - Elapsed time:%f"
              %(i, train_curve[-1], test_curve[-1], elapsed_time))

Epoch 10 - Train loss:0.000104 - Test loss:0.000104 - Elapsed time:2.511724
Epoch 20 - Train loss:0.000062 - Test loss:0.000062 - Elapsed time:5.062957
Epoch 30 - Train loss:0.000055 - Test loss:0.000056 - Elapsed time:7.584701
Epoch 40 - Train loss:0.000044 - Test loss:0.000041 - Elapsed time:10.109916
Epoch 50 - Train loss:0.000036 - Test loss:0.000037 - Elapsed time:12.658092
Epoch 60 - Train loss:0.000035 - Test loss:0.000034 - Elapsed time:15.182339
Epoch 70 - Train loss:0.000036 - Test loss:0.000036 - Elapsed time:17.732062
Epoch 80 - Train loss:0.000034 - Test loss:0.000037 - Elapsed time:20.301953
Epoch 90 - Train loss:0.000035 - Test loss:0.000035 - Elapsed time:22.780132
Epoch 100 - Train loss:0.000035 - Test loss:0.000035 - Elapsed time:25.319993

Usage of entity embedding layers

One of advantages of entity embedding layers is to use embedding layers to reduce dimension of each categorical variable like PCA. We show below two examples of the advantage.

Example: usage in the other machine learning methods

At first, we introduce combining embedding layers with the other machine learning methods such as K-nearest-neighbor, Random forest. Neural networks is very useful method to predict but it is difficult for it to deal with structual data, of which physical or social structure hide in the background. It is also difficult for us to interpret its result as the relationship between the target variable and the input variables. Some of the other machine learning methods are, however, good at what are difficult for neural network. Therefore, we may improve performance and find the relationship of data using the output of embedding layers to be the input of the other machine learning methods.

Then, we calculate the outputs of embedding layers for the training data and the test data.

In [7]:
ls_layer = [model._embedding_0, model._embedding_1, model._embedding_2]

train_embed = []
test_embed = []


for i in range(len(ls_layer)):
    train_embed.append(ls_layer[i](rm.Node(X_train[:, i]).reshape(-1,1)))
    test_embed.append(ls_layer[i](rm.Node(X_test[:, i]).reshape(-1,1)))

X_train_embed = np.concatenate(train_embed, axis=1)
X_test_embed = np.concatenate(test_embed, axis=1)

We show k-nearest neighbor regression and random forest regression as examples. As is usually the case with analyses to use random forest, we can calculate feature importance by sklearn's function.

In [8]:
knn_embed = KNeighborsRegressor(n_neighbors=100, weights='distance', p=1)
knn_embed.fit(X_train_embed, y_train)
y_knn_embed_predicted = knn_embed.predict(X_test_embed).reshape(-1,1)
print("MSE of K Nearest Neighbor:{:.6f}".format(np.float(rm.mean_squared_error(y_knn_embed_predicted, y_test))))


rf_embed = RandomForestRegressor(n_estimators=100, max_depth=50, min_samples_split=2, min_samples_leaf=1)
rf_embed.fit(X_train_embed, y_train.ravel())
y_rf_embed_predicted = rf_embed.predict(X_test_embed).reshape(-1,1)
print("MSE of Random Forest:{:.6f}".format(np.float(rm.mean_squared_error(y_rf_embed_predicted, y_test))))

print("Sum of feature importance of X1 = {:.3f}"
      .format(np.mean(np.sum(rf_embed.feature_importances_[0:sum(reduced_dim[:1])]))))
print("Sum of feature importance of X2 = {:.3f}"
      .format(np.mean(np.sum(rf_embed.feature_importances_[sum(reduced_dim[:1]):sum(reduced_dim[:2])]))))
print("Sum of feature importance of X3 = {:.3f}"
      .format(np.mean(np.sum(rf_embed.feature_importances_[sum(reduced_dim[:2]):sum(reduced_dim[:3])]))))
MSE of K Nearest Neighbor:0.000031
MSE of Random Forest:0.000031
Sum of feature importance of X1 = 0.263
Sum of feature importance of X2 = 0.495
Sum of feature importance of X3 = 0.242

On the other hand, the absolute mean of the coefficients of each categorical variable was

In [9]:
print("Mean of |coefficients| of X1 = {}"
      .format(np.mean(np.abs(coef[0:np.sum(categorical_dim[:1])]))))
print("Mean of |coefficients| of X2 = {}"
      .format(np.mean(np.abs(coef[np.sum(categorical_dim[:1]):np.sum(categorical_dim[:2])]))))
print("Mean of |coefficients| of X3 = {}"
      .format(np.mean(np.abs(coef[np.sum(categorical_dim[:2]):np.sum(categorical_dim[:3])]))))
Mean of |coefficients| of X1 = 12.0
Mean of |coefficients| of X2 = 17.2
Mean of |coefficients| of X3 = 10.2

Compared with this absolute mean, we can find how large the absolute mean is is partly reflected in the feature importance.

Example: usage in visualizing combining with dimensionality reduction methods

Second, we introduce visualization of embedding space as one of the methods to visualize data using neural network. We can regard embedding layers as dimensionality reduction methods like PCA. One of the advantage of dimensionality reduction enable us to visualize of the data. However, a dimension of an embedding layer is not always less than 4. We cannot usually visualize 4-dimensional space. Therefore, we may apply dimensionality reduction methods to the outputs of the embedding layer. Here, we do PCA with the outputs of the embedding layer and the one-hot encoded data, and we show the relationship between each first principle component and the coefficients.

In [10]:
X_train_one_hot = OneHotEncoder().fit_transform(X_train).toarray()
X_test_one_hot = OneHotEncoder().fit_transform(X_test).toarray()
In [11]:
ind = np.random.permutation(len(X_train))[:3000]

ls_name = [range(d) for d in categorical_dim]

ls_title = ["X0", "X1", "X2"]

fig, _figs = plt.subplots(ncols=3, figsize=(15,8))
d = 0
d_ = 0
for i in range(len(categorical_dim)):
    feature_embed = X_train_embed[:, d:(d+reduced_dim[i])]
    coef_ = coef[d_:(d_+categorical_dim[i])]
    dic = {k: v[0] for (k, v) in enumerate(coef_)}
    beta = np.array([dic[c] for c in X_train[ind, i]])
    pca = PCA(n_components=2)
    pca_fit= pca.fit_transform(feature_embed[ind])
    _figs[i].scatter(pca_fit[:, 0], beta, c = X_train[ind, i])
    _figs[i].set_title(ls_title[i])
    _figs[i].set_xlabel('First Principle Component')
    if i == 0:
        _figs[i].set_ylabel('Magnitude of Coefficient')
    for k in range(len(ls_name[i])):
        _figs[i].annotate(ls_name[i][k] ,xy=(pca_fit[X_train[ind, i] == k][0, 0]-0.01,
                                             beta[X_train[ind, i] == k][0]+1), size=15)
    _figs[i].grid(True)
    d += reduced_dim[i]
    d_ += categorical_dim[i]
../../../_images/notebooks_embedding_entity_embedding_notebook_32_0.png

We see that the points stand in the diagonal line in each plot. This plots look like that the first principle component of the outputs of the embedding layers indicates the magnitude of the coefficients corresponding with each indicator variable to considerable extent. We find that we can recover how influential the coefficient of each indicator variable is in predicting the target variable from the outputs of the embedding layers in this simple situation.

In [12]:
fig, _figs = plt.subplots(ncols=3, figsize=(15,8))
d_ = 0
for i in range(len(categorical_dim)):
    feature_one_hot = X_train_one_hot[:, d_:(d_+categorical_dim[i])]
    coef_ = coef[d_:(d_+categorical_dim[i])]
    dic = {k: v[0] for (k, v) in enumerate(coef_)}
    beta = np.array([dic[c] for c in X_train[ind, i]])
    pca = PCA(n_components=2)
    pca_fit = pca.fit_transform(feature_one_hot[ind])
    _figs[i].scatter(pca_fit[:, 0], beta, c = X_train[ind, i])
    _figs[i].set_title(ls_title[i])
    _figs[i].set_xlabel('First Principle Component')
    if i == 0:
        _figs[i].set_ylabel('Magnitude of Coefficient')
    for k in range(len(ls_name[i])):
        _figs[i].annotate(ls_name[i][k] ,xy=(pca_fit[X_train[ind, i] == k][0, 0],
                                             beta[X_train[ind, i] == k][0]), size=15)
    _figs[i].grid(True)
    d_ += categorical_dim[i]
../../../_images/notebooks_embedding_entity_embedding_notebook_34_0.png

On the other hand, the second plot shows the principle component of the one-hot encoded data is not related with the magnitude of the coefficients. We remark that this result will depend on the assumption that the relationship between the target variable and the indicator variables is linear. However, the fact that there is a possibility to detect the influentiality of the categorical variables using neural network will be a very useful function of embedding layers.

References

[1] Cheng Guo and Felix Berkhahn. Entity Embeddings of Categorical Variables. CoRR, 2016. https://arxiv.org/abs/1604.06737