相関係数と決定係数

Boston house-priceデータセットを用いた相関係数と決定係数の簡単な紹介

基本的に相関係数はピアソンの相関係数を指します.
これは2つの変数の線形な関係性についての一つの目安となります.
\begin{equation*} \rho = \frac{E[(X-E[X])(Y-E[Y])]}{(E[X-E[X])^2]E[Y-E[Y])^2)])^\frac{1}{2}} \end{equation*}

1は正の相関を,-1は負の相関を,0は無相関を表しています.

更にr2スコアは線形モデルの評価に対して有用な指標になります.

\begin{equation*} R^2 = 1 - \frac{\sum_{i=0}^{n}(y_{i}-f_{i})^2}{\sum_{i=0}^{n}(y_{i}-\overline{y})^2} \end{equation*}
R2スコアは相関係数と異なり,回帰の当てはまりの良さへの1つの評価指標となります.
つまり,どれだけ単純に回帰曲線として平均を取っただけの直線で回帰を行うよりも,どれだけよく今回の回帰モデルがデータにfitしているかということを表しています.

Required Libaries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import renom as rm
from renom.optimizer import Sgd
from renom.cuda.cuda import set_cuda_active
set_cuda_active(False)

boston = load_boston()
columns = boston.feature_names
X = boston.data.astype(np.float32)
y = boston.target.astype(np.float32)

X = (X - X.min()) / (X.max() - X.min())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
y_train, y_test = y_train.reshape(-1, 1), y_test.reshape(-1, 1)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(\
       X_train.shape, y_train.shape, X_test.shape, y_test.shape))

sequential = rm.Sequential([
    rm.Dense(30),
    rm.Dense(1)
])

batch_size = 32
epoch = 1000
optimizer = Sgd(lr=0.01)
N = len(X_train)

learning_curve = []
test_learning_curve = []
for i in range(epoch):
    perm = np.random.permutation(N)
    train_loss = 0
    test_loss = 0
    for j in range(N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]
        with sequential.train():
            z = sequential(train_batch)
            l = rm.mean_squared_error(z, response_batch)
        grad = l.grad()
        grad.update(optimizer)
        train_loss += l.as_ndarray()
    train_loss = train_loss / (N // batch_size)

    z = sequential(X_test)
    l = rm.mean_squared_error(z, y_test)
    test_loss = l.as_ndarray()

    learning_curve.append(train_loss)
    test_learning_curve.append(test_loss)

    if i%100==0:
        print("epoch %03d train_loss%f test_loss:%f"%(i, train_loss, test_loss))
X_train:(404, 13) y_train:(404, 1) X_test:(102, 13) y_test:(102, 1)
epoch 000 train_loss121.286549 test_loss:36.927589
epoch 100 train_loss32.149979 test_loss:28.684084
epoch 200 train_loss31.406890 test_loss:23.480282
epoch 300 train_loss29.028488 test_loss:27.579151
epoch 400 train_loss27.724438 test_loss:19.224230
epoch 500 train_loss21.688951 test_loss:16.617624
epoch 600 train_loss25.711622 test_loss:37.536972
epoch 700 train_loss22.570347 test_loss:14.833116
epoch 800 train_loss27.341895 test_loss:22.036667
epoch 900 train_loss21.255029 test_loss:17.414614

評価

In [2]:
prediction = sequential(X_test)
plt.plot(y_test, label="original")
plt.plot(prediction, label="prediction")
plt.legend()
plt.show()
../../../_images/notebooks_basic_correlation_determination_notebook_4_0.png

相関係数

1は正の相関を,-1は負の相関を,0は無相関を表しています.

In [3]:
for i in range(X.shape[1]):
    print("{:8s}:{:10f}".format(columns[i],np.corrcoef(X[:,i], y)[0][1]))
CRIM    : -0.385832
ZN      :  0.360445
INDUS   : -0.483725
CHAS    :  0.175260
NOX     : -0.427321
RM      :  0.695360
AGE     : -0.376955
DIS     :  0.249929
RAD     : -0.381626
TAX     : -0.468536
PTRATIO : -0.507787
B       :  0.333461
LSTAT   : -0.737663

決定係数

回帰モデルの評価のためのR2スコア
R2スコアは1に近いほど,平均を取るだけの直線より相対的に誤差が低いと言えます.
In [4]:
print("r2 score:{}".format(r2_score(y_test, prediction)))
r2 score:0.5501949370412647