Simple Data visualization

data visualization use for data understandig

When we analyse the data, data understanding is useful for feature engeneering and determine the prediction model.
This time, we are going to inspect the explaination variables and missing value, and correlation between explaination variables.

Required libraries

In this tutorial, following modules are required.

  • numpy 1.12.1
  • matplotlib 2.0.2
  • pandas 0.20.3
  • seaborn 0.8.1
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load dataset

In this chapter, we use Titanic dataset which can be downloaded from http://docs.renom.jp/downloads/train.csv .

In [2]:
train = pd.read_csv("train.csv")
train.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • PassengerId - unique ID for each passenger
  • Survived - 0=death 1=survived
  • Pclass - Ticket class(1=rich, 2=middle, 3=low)
  • Name
  • Sex
  • Age
  • SibSp - siblings and spouses riding together
  • Parch - parents and children riding together
  • Ticket - ticket number
  • Fare
  • Cabin
  • Embarked - where they embarked

As we focus on each variable and consider about the characteristics of each variable, we can consider which variable affect the good effect or not and how we can transform the categorical variable.

In [3]:
train.describe()
Out[3]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [4]:
sns.countplot("Sex",data=train)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc22f071198>
../../../_images/notebooks_visualization_simple_visualize_notebook_7_1.png
In [5]:
sns.countplot("Embarked", data=train)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc22f087240>
../../../_images/notebooks_visualization_simple_visualize_notebook_8_1.png

Histogram of each variable

Histogram has important meaning to data preprocessing. For example, when we interpolate the Embarked column which has some missing, the simplest idea of interpliation is we yse S for missing values.

In [6]:
def kesson_table(df):
    null_val = df.isnull().sum()
    percent = 100 * df.isnull().sum()/len(df)
    kesson_table = pd.concat([null_val, percent], axis=1)
    kesson_table_ren_columns = kesson_table.rename(columns={0:"欠損数", 1:"%"})
    return kesson_table_ren_columns
kesson_table(train)
Out[6]:
欠損数 %
PassengerId 0 0.000000
Survived 0 0.000000
Pclass 0 0.000000
Name 0 0.000000
Sex 0 0.000000
Age 177 19.865320
SibSp 0 0.000000
Parch 0 0.000000
Ticket 0 0.000000
Fare 0 0.000000
Cabin 687 77.104377
Embarked 2 0.224467

Missing Values rate

Missing value’s information is important because we have to decide how to process the missing values. There are some options for missing values, one is the interpolation, another is deletion of the column. As said in other tutorial, the missing has three types depends on relationship with other explanation variables. And missing rate can be a signature of whether we use that variables.

In [7]:
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"] = train["Embarked"].fillna("S")
kesson_table(train)
Out[7]:
欠損数 %
PassengerId 0 0.000000
Survived 0 0.000000
Pclass 0 0.000000
Name 0 0.000000
Sex 0 0.000000
Age 0 0.000000
SibSp 0 0.000000
Parch 0 0.000000
Ticket 0 0.000000
Fare 0 0.000000
Cabin 687 77.104377
Embarked 0 0.000000
In [8]:
train.loc[train["Sex"]=="male","Sex"] = 0
train.loc[train["Sex"]=="female","Sex"] = 1
train.loc[train["Embarked"]=="S", "Embarked"] = 0
train.loc[train["Embarked"]=="C", "Embarked"] = 1
train.loc[train["Embarked"]=="Q", "Embarked"] = 2
train.head(5)
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 1
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 0
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN 0
In [9]:
plt.rcParams["figure.figsize"] = (13.0, 13.0)
train_numeric = train.loc[:, ["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]
train_numeric.head(5)
feature_names = train_numeric.columns
correlation_matrix = np.corrcoef(train_numeric.values.astype(np.float).transpose())
sns.heatmap(correlation_matrix, annot=True,
            xticklabels=feature_names,
            yticklabels=feature_names,
            cmap="Reds")
plt.show()
../../../_images/notebooks_visualization_simple_visualize_notebook_14_0.png

Relationship of each explanation variables

Relationship of explanation variables and target variable is useful for feature engineering. This time we describe the heatmap of relationship about the explanation variables, we can see the correlation between them. So, we think we can relieve the overfitting effect to omit the one variable which highly correlate another explanation variable. This time we saw mainly the explation variables, we will analyse and know more information to visualize and consider about the data characteristics.