数値データとカテゴリデータのマージ

adult datasetを用いた数値データとカテゴリデータのマージ方法

このチュートリアルはadultデータセットを用いて数値データとカテゴリデータのマージの方法を提供します。
このデータセットはある人の年収が50Kを超えるかどうかの予測を行うタスクです。データの詳細を以下に示します。

Datasetのリファレンスは以下になります。

Adult Data Set, Ronny Kohavi and Barry Becker, Data Mining and Visualization, Silicon Graphics.

Required Libaries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
  • pandas 0.20.3
In [1]:
from __future__ import division, print_function
import numpy as np
import pandas as pd

import renom as rm
from renom.optimizer import Sgd, Adam
from renom.cuda import set_cuda_active

from sklearn.preprocessing import LabelBinarizer, label_binarize
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
# If you would like to use GPU, set True, otherwise you should be set to False.
set_cuda_active(False)

列の情報を読み込みます。

列の情報を読み込んで、その列が数値データかそうでないかを格納した辞書を作成します。

In [2]:
def make_col_names():
    col_names = []
    continuous_dict = {}
    for i,line in enumerate(open("adult.names","r"),1):
        if i > 96:
            line = line.rstrip()
            name = line.split(":")[0]
            col_names.append(name)
            line = line.replace(" ","").replace(".","")
            continuous = line.split(":")[1] == "continuous"
            continuous_dict[name] = continuous
    col_names.append("label")
    continuous_dict["label"] = False
    return col_names, continuous_dict

列の名前を取得し、数値データかどうかも取得します。

In [3]:
n_id = 0
col_names, continuous_dicts = make_col_names()

データを読み込んでインデックスを作成します。

欠損値に対しての処理とといくつかの前処置を行います。

In [4]:
def load_data(filename, col_names, n):
    df = pd.read_csv(filename, header=None, index_col=None)
    # Display the number of records before delete missing valeus.
    print("the number of {} records:{}\n".format(filename, len(df.index)))
    df.columns = col_names

    # Replace the missing value's character to np.nan.
    df = df.applymap(lambda d: np.nan if d==" ?" else d)

    # Unify the different written forms.
    df = df.applymap(lambda d: " <=50K" if d==" <=50K." else d)
    df = df.applymap(lambda d: " >50K" if d==" >50K." else d)

    # Display the information about missing values and
    print("missing value info:\n{}\n".format(df.isnull().sum(axis=0)))
    df = df.dropna(axis=0)

    # the number of records after delete missing valeus.
    print("the number of {} records after trimming:{}\n".format(filename, len(df.index)))
    ids = list(np.arange(n, n+len(df.index)))
    df["ID"] = np.array(ids)
    n = n+len(df.index)
    return df,n

データを読み込んでpandasのデータフレームに対していくつかの前処理を行います。

データを読み込んでpandasのデータフレームに対していくつかの前処理を行います。

In [5]:
df_train,n_id_train = load_data("adult.data", col_names, n_id)
df_test,n_id_test = load_data("adult.test", col_names, n_id_train)
the number of adult.data records:32561

missing value info:
age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
label                0
dtype: int64

the number of adult.data records after trimming:30162

the number of adult.test records:16281

missing value info:
age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
label               0
dtype: int64

the number of adult.test records after trimming:15060

数値データでないデータを取得

In [6]:
def get_not_continuous_columns(continuous_dict):
    categorical_names = [k for k, v in continuous_dict.items() if not v]
    return categorical_names

ラベル情報を表示

In [7]:
def print_labelinfo(labelnames):
    for i in range(len(labelnames)):
        print("label{}:{}".format(i,labelnames[i]))

pandasのデータフレームからnumpyのデータ配列に変換します。

カテゴリデータをワンホットベクトル表現に変換します。

In [8]:
def convert_data(df_train, df_test, n_id_train, n_id_test, continuous_dicts):
    categorical_names = get_not_continuous_columns(continuous_dicts)
    df = pd.concat((df_train, df_test), axis=0)

    # Get the dummy for the categorical data.
    for name in categorical_names:
        if name=="label":
            labelnames = list(pd.get_dummies(df[name]).columns)
            print("labelname:{}".format(labelnames))
        dummy_df = pd.get_dummies(df[name])
        df = pd.concat((df, dummy_df), axis=1)
        df = df.drop(name, axis=1)

    # Convert the data type.
    for name in df.columns:
        df[name] = df[name].astype(float)

    # Reguralize the data.
    for name in df.columns:
        if name=="ID":
            df[name] = df[name]
        else:
            df[name] = (df[name] - df[name].min()) / (df[name].max() - df[name].min())

    df_train = df[df["ID"]<n_id_train].drop("ID", axis=1)
    df_test = df[df["ID"]>=n_id_train].drop("ID", axis=1)

    y_train = df_train[labelnames].values
    y_test = df_test[labelnames].values
    print_labelinfo(labelnames)
    X_train = df_train.drop(labelnames, axis=1).values
    X_test = df_test.drop(labelnames, axis=1).values
    return X_train, y_train, X_test, y_test

データのShapeを確認

In [9]:
X_train, y_train, X_test, y_test = \
convert_data(df_train, df_test, n_id_train, n_id_test, continuous_dicts)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
labelname:[&apos; <=50K&apos;, &apos; >50K&apos;]
label0: <=50K
label1: >50K
X_train:(30162, 104) y_train:(30162, 2) X_test:(15060, 104) y_test:(15060, 2)