Completion to numerical data and categorical data

Completion to numerical data and categorical data, using pseudo missing data

When we got the data to process, sometimes the data has missing values.
In such a case, we have to think about how to deal with the these missing values.(delete or imputation)
Simplest way to deal with it is deletion for the rows include any missing values.
But there are some cases this way causes bad effects, so this time we introduce the how to impute for the missing values, using ReNom.

Required Libaries

In [1]:
import numpy as np
import pandas as pd
from renom.utility import completion

Make the data

In [2]:
data_size = 10000
x1 = np.random.randint(0, 500, data_size).reshape(-1, 1)
x2 = x1 + 10
x3 = np.random.randint(0, 500, data_size).reshape(-1, 1)
x4 = np.random.randint(0, 500, data_size).reshape(-1, 1)
x5 = np.random.choice(["One", "Two", "Three"], data_size, replace=True).reshape(-1, 1)
x6 = np.random.choice(["Four", "Five"], data_size, replace=True).reshape(-1, 1)
x = np.concatenate((x1, x2), axis=1)
x = np.concatenate((x, x3), axis=1)
x = np.concatenate((x, x4), axis=1)
x = np.concatenate((x, x5), axis=1)
x = np.concatenate((x, x6), axis=1)
x
Out[2]:
array([['162', '172', '360', '494', 'Two', 'Five'],
       ['404', '414', '242', '263', 'Three', 'Four'],
       ['253', '263', '349', '239', 'Three', 'Five'],
       ...,
       ['458', '468', '114', '427', 'Two', 'Five'],
       ['418', '428', '470', '165', 'Three', 'Five'],
       ['406', '416', '424', '179', 'Three', 'Five']],
      dtype='<U21')

Make missing values randomly

There are some missing types, we will make the combination of the MAR and MCAR describes like bellows.

MCAR is easy to deal with, in this case, no problem to delete. But MAR and MNAR cause the bias problem once we delete the missing values, so it is better to impute the values. MAR is predictable from other variables, but MNAR is difficult and we could not find the useful way to deal with it.

In [3]:
missing_idx = np.random.permutation(data_size)
missing_idx1 = missing_idx[:1000]
missing_idx2 = missing_idx[-1010:]
missing_idx3 = missing_idx[2000:3050]
df = pd.DataFrame(x)
df.columns = ["x1", "x2", "x3", "x4", "x5", "x6"]
df = df.astype({"x1":float, "x2":float, "x3":float, "x4":float, "x5":str, "x6":str})
X_complete = df.values
df.loc[df["x2"]<50, "x1"] = np.nan
df.iloc[list(missing_idx1), 2] = np.nan
df.iloc[list(missing_idx2), 4] = np.nan
df.iloc[list(missing_idx3), 5] = np.nan
X_incomplete = df.values
X_filled = completion(X_incomplete, mode="mice", impute_type="col")
X_filled
[MICE] Completing matrix with shape (10000, 6)
/home/d_onodera/renom2.0.0/ReNom/renom/layers/loss/sigmoid_cross_entropy.py:18: RuntimeWarning: overflow encountered in exp
  z = 1. / (1. + np.exp(to_value(-lhs)))
/home/d_onodera/renom2.0.0/ReNom/renom/core.py:898: RuntimeWarning: overflow encountered in exp
  ret = getattr(ufunc, method)(*new_inputs, **kwargs)
[MICE] Starting imputation round 10/60, elapsed time 50.013
[MICE] Starting imputation round 20/60, elapsed time 106.384
[MICE] Starting imputation round 30/60, elapsed time 161.984
[MICE] Starting imputation round 40/60, elapsed time 217.537
[MICE] Starting imputation round 50/60, elapsed time 272.959
[MICE] Starting imputation round 60/60, elapsed time 328.623
Out[3]:
array([[162.0, 172.0, 360.0, 494.0, 'One', 'Five'],
       [404.0, 414.0, 242.0, 263.0, 'Three', 'Four'],
       [253.0, 263.0, 349.0, 239.0, 'Three', 'Five'],
       ...,
       [458.0, 468.0, 114.0, 427.0, 'One', 'Five'],
       [418.0, 428.0, 326.0309934976139, 165.0, 'Three', 'Five'],
       [406.0, 416.0, 424.0, 179.0, 'Three', 'Five']], dtype=object)

Future works

There are many imputation method, but now there is a mice method only.
We will add other method to use depends on the situations.