Preprocessing for Embedding Layers

We will explain preprocessing needed to use embedding layer on Renom.

It is well-known that neural networks are used to approximate any continuous functions. On the other hand, it is said that it is not suitable to approximate discontinuous functions because of difficulty of calculating gradients. Therefore, neural networks may be less effective in the fields, where data are structured, than the other method of machine learning because some of features of such data often have a categorical variable. Dealing with categorical variables, we usually use one-hot encoding, which is to convert categorical variables into a vector whose component indicates corresponding category. After one-hot encoding, the dimension of inputs often becomes large. However, it was recently made clear that entity embedding is effective in training machine learning methods including neural networks in such situations[1]. Entity embedding reduces, like PCA, the dimension of categorical variables , converting a one-hot vector. Therefore, entity embedding made some machine learning methods including neural networks easier to deal with categorical variables and improved those performance.

In this tutorial, we will introduce how to preprocess data for using embedding layers on Renom. Categorical variables on Renom are defined as non-negative integer, which start from 0 with increments of 1. We must input those varibles into embedding layers on Renom. However, as is often the case with many data, categorical variables’ type is not integer but string or datetime, etc. Therefore, we have to convert those variable to suitable categorical variables for embedding layers of Renom.

Requirements

  • Python 3.5
  • Numpy 1.13.3
  • Pandas 0.21.0
  • Stats 0.1.2a0
In [1]:
from __future__ import division, print_function
import numpy as np
import pandas as pd
import stats

Loading & preprocessing data

You can download the data, which we will use in this tutorial, from “ http://archive.ics.uci.edu/ml/datasets/online+retail ”. This data contains each customer’s all transactions of an online retail store from December, 2010 to December, 2011. The data records transaction codes and, product names and codes and datetime to purchase, how many one bought, unit price of products, customers’ ID, customers’ country.

Here, we think about the problem to predict each customer’s expenditure on this web store per day using neural networks with embedding layers. Remark that we regard that not only customers’ ID and customer’s country but also year and month, date are categorical variables.

In [2]:
excel = pd.ExcelFile("Online Retail.xlsx")
df = excel.parse(excel.sheet_names[0])
df.head()
Out[2]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

We choose only the columns we need to analyze and calculate “expenditure(Quantity * Unit price)” column.

In [3]:
df = df.iloc[:, 3:8]
df = df.assign(expenditure = df.Quantity * df.UnitPrice)
df.head()
Out[3]:
Quantity InvoiceDate UnitPrice CustomerID Country expenditure
0 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

As “InvoiceDate” column’s type is datetime, we have to extract “Day of week”, “Year”, “Month”, “Day” from this column. Moreover, we subtract one from those columns in order for each column to start with 0.

In [4]:
df = df.assign(dow = df.InvoiceDate.dt.dayofweek.astype(np.int),
                 year = df.InvoiceDate.dt.strftime('%Y').astype(np.int),
                 month = df.InvoiceDate.dt.strftime('%m').astype(np.int),
                 day = df.InvoiceDate.dt.strftime('%d').astype(np.int))
df = df.assign(year = df.year - np.min(df.year), month = df.month - 1,
                 day = df.day - 1)
#Day of Week : Monday=0, Tuesday=1, ..., Friday=4, Saturday=5, Sunday=6
#              But, we will additionally normalize this column later because Saturday is missing.
#Year : 2010=0, 2011=1
#Month : January=0, February=1, ..., November=10, December=11
#Date : 1st=0, 2nd=1, ..., 30th=29, 31st=30
df.head()
Out[4]:
Quantity InvoiceDate UnitPrice CustomerID Country expenditure day dow month year
0 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30 0 2 11 0
1 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 0 2 11 0
2 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00 0 2 11 0
3 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 0 2 11 0
4 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 0 2 11 0

We aggregate each transaction to daily one per customer. We convert “CustomerID” to string type first as “CustomerID” has NaN.

In [5]:
df["CustomerID"] = df["CustomerID"].astype(str)
df = df.groupby(["CustomerID", "year", "month", "day", "dow"], as_index=False).agg({"expenditure":"sum", "Country":lambda x: stats.mode(x)})
df.head()
Out[5]:
CustomerID year month day dow expenditure Country
0 12346.0 1 0 17 1 0.00 United Kingdom
1 12347.0 0 11 6 1 711.79 Iceland
2 12347.0 1 0 25 2 475.39 Iceland
3 12347.0 1 3 6 3 636.25 Iceland
4 12347.0 1 5 8 3 382.52 Iceland

We convert “Country” column to non-negative integer. It is convenient to use “factorize” function in Pandas. “factorize” function returns converted Pandas.Series and list containing countries name indexed by countries’ number. If you sort the data frame by “Country” before using “factorize” function, the countries’ number becomes more intuitive(the bigger number is, the latter alphabetical order is).

In [6]:
df = df.sort_values("Country")
df["Country"], list_country = pd.factorize(df["Country"])
df.head()
Out[6]:
CustomerID year month day dow expenditure Country
173 12415.0 1 0 5 3 7011.38 0
107 12388.0 1 8 24 6 825.92 0
108 12388.0 1 10 23 3 286.40 0
176 12415.0 1 2 2 3 16558.14 0
221 12424.0 1 5 29 3 1760.96 0
In [7]:
for index, name in zip(set(df["Country"]), list_country):
    print(index, name)
0 Australia
1 Austria
2 Bahrain
3 Belgium
4 Brazil
5 Canada
6 Channel Islands
7 Cyprus
8 Czech Republic
9 Denmark
10 EIRE
11 European Community
12 Finland
13 France
14 Germany
15 Greece
16 Hong Kong
17 Iceland
18 Israel
19 Italy
20 Japan
21 Lebanon
22 Lithuania
23 Malta
24 Netherlands
25 Norway
26 Poland
27 Portugal
28 RSA
29 Saudi Arabia
30 Singapore
31 Spain
32 Sweden
33 Switzerland
34 USA
35 United Arab Emirates
36 United Kingdom
37 Unspecified
In [8]:
df = df.sort_values("CustomerID")
df["CustomerID"], list_ID = pd.factorize(df["CustomerID"])
df.head()
Out[8]:
CustomerID year month day dow expenditure Country
0 0 1 0 17 1 0.00 36
4 1 1 5 8 3 382.52 17
5 1 1 7 1 1 584.91 17
7 1 1 11 6 2 224.82 17
1 1 0 11 6 1 711.79 17

Because Saturday transactions in this data are missing, we use “factorize” to convert “dow” column in order for the column to start from 0 with increments of 1.

In [9]:
df = df.sort_values("dow")
df["dow"], list_dow = pd.factorize(df["dow"])
#Before : Monday=0, Tuesday=1, ..., Friday=4, Saturday=5, Sunday=6
#After : Monday=0, Tuesday=1, ..., Friday=4, Saturday=Missing, Sunday=5
for index, name in zip(set(df["dow"]), list_dow):
    print(index, name)
0 0
1 1
2 2
3 3
4 4
5 6
In [10]:
df.head()
Out[10]:
CustomerID year month day dow expenditure Country
15726 3555 1 1 27 0 158.68 36
5662 1241 1 4 15 0 1133.33 10
5654 1237 1 2 6 0 328.80 36
5652 1236 1 4 22 0 -9.68 36
5648 1235 1 0 30 0 304.44 36

Finally, we obtained data frame to use neural networks with embedding layers on Renom.

References

[1] Cheng Guo and Felix Berkhahn. Entity Embeddings of Categorical Variables. CoRR, 2016. https://arxiv.org/abs/1604.06737