Onehot Conversion for Categorical Data

The good point of onehot vector to learn

At first, onehot encoding is one of the conversion method for categorical data, and good representation of the data on neural network learning.(except random forest and other method which can use the categorical data.
There are impossible classification pattern problem to solve without onehot encoding or embedding as bellow.

So, onehot encoding and embedding encoding is needed for neural network classification.

Required Libaries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
  • pandas 0.20.3
In [1]:
from __future__ import division, print_function
import numpy as np
import pandas as pd

import renom as rm
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active

from sklearn.preprocessing import LabelBinarizer, label_binarize
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
# If you would like to use GPU, set True, otherwise you should be set to False.

Import column infos

Import column names and return the column names and the dictionary which column is numeric column or not.

In [2]:
def make_col_names():
    col_names = []
    continuous_dict = {}
    for i,line in enumerate(open("adult.names","r"),1):
        if i > 96:
            line = line.rstrip()
            name = line.split(":")[0]
            line = line.replace(" ","").replace(".","")
            continuous = line.split(":")[1] == "continuous"
            continuous_dict[name] = continuous
    continuous_dict["label"] = False
    return col_names, continuous_dict

Get the column names and if whether column is continuous

In [3]:
n_id = 0
col_names, continuous_dicts = make_col_names()

Load the data and make Index

We show the some information about missing values and some preprocessing.

In [4]:
def load_data(filename, col_names, n):
    df = pd.read_csv(filename, header=None, index_col=None)
    # Display the number of records before delete missing valeus.
    print("the number of {} records:{}\n".format(filename, len(df.index)))
    df.columns = col_names

    # Replace the missing value's character to np.nan.
    df = df.applymap(lambda d: np.nan if d==" ?" else d)

    # Unify the different written forms.
    df = df.applymap(lambda d: " <=50K" if d==" <=50K." else d)
    df = df.applymap(lambda d: " >50K" if d==" >50K." else d)

    # Display the information about missing values and
    print("missing value info:\n{}\n".format(df.isnull().sum(axis=0)))
    df = df.dropna(axis=0)

    # the number of records after delete missing valeus.
    print("the number of {} records after trimming:{}\n".format(filename, len(df.index)))
    ids = list(np.arange(n, n+len(df.index)))
    df["ID"] = np.array(ids)
    n = n+len(df.index)
    return df,n

Load the data and some preprocessing for pandas data frame

The is for the training, the adult.test is for the prediction.

In [5]:
df_train,n_id_train = load_data("", col_names, n_id)
df_test,n_id_test = load_data("adult.test", col_names, n_id_train)
the number of records:32561

missing value info:
age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
label                0
dtype: int64

the number of records after trimming:30162

the number of adult.test records:16281

missing value info:
age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
label               0
dtype: int64

the number of adult.test records after trimming:15060

Get not continious columns

In [6]:
def get_not_continuous_columns(continuous_dict):
    categorical_names = [k for k, v in continuous_dict.items() if not v]
    return categorical_names

Display label information

In [7]:
def print_labelinfo(labelnames):
    for i in range(len(labelnames)):

Onehot encoding

We convert the data from the categorical data to one hot vector representation.

In [8]:
def convert_data(df_train, df_test, n_id_train, n_id_test, continuous_dicts):
    categorical_names = get_not_continuous_columns(continuous_dicts)
    df = pd.concat((df_train, df_test), axis=0)

    # Get the dummy for the categorical data.
    for name in categorical_names:
        if name=="label":
            labelnames = list(pd.get_dummies(df[name]).columns)
        dummy_df = pd.get_dummies(df[name])
        df = pd.concat((df, dummy_df), axis=1)
        df = df.drop(name, axis=1)

    # Convert the data type.
    for name in df.columns:
        df[name] = df[name].astype(float)

    # Reguralize the data.
    for name in df.columns:
        if name=="ID":
            df[name] = df[name]
            df[name] = (df[name] - df[name].min()) / (df[name].max() - df[name].min())

    df_train = df[df["ID"]<n_id_train].drop("ID", axis=1)
    df_test = df[df["ID"]>=n_id_train].drop("ID", axis=1)

    y_train = df_train[labelnames].values
    y_test = df_test[labelnames].values
    X_train = df_train.drop(labelnames, axis=1).values
    X_test = df_test.drop(labelnames, axis=1).values
    return X_train, y_train, X_test, y_test

Confirm the shape of the data

In [9]:
X_train, y_train, X_test, y_test = \
convert_data(df_train, df_test, n_id_train, n_id_test, continuous_dicts)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
labelname:[' <=50K', ' >50K']
label0: <=50K
label1: >50K
X_train:(30162, 104) y_train:(30162, 2) X_test:(15060, 104) y_test:(15060, 2)