Correlation Coefficient and Coefficient of Determination

This is a simple introduction to correlation coefficients and coefficients of determination, using the Boston housing price dataset.

“Correlation coefficient” usually refers to the Pearson coefficient of correlation.
This is just one of the common metrics for linear relatedness between two variables.
\begin{equation*} \rho = \frac{E[(X-E[X])(Y-E[Y])]}{(E[X-E[X])^2]E[Y-E[Y])^2)])^\frac{1}{2}} \end{equation*}

A \rho of 1 indicates perfect positive correlation, while -1 represents a perfect negative correlation. 0 indicates uncorrelated variables.

And, R2 score is useful for evaluation of linear model.

\begin{equation*} R^2 = 1 - \frac{\sum_{i=0}^{n}(y_{i}-f_{i})^2}{\sum_{i=0}^{n}(y_{i}-\overline{y})^2} \end{equation*}
Unlike correlation coefficients, the R2 score is used to evaluate regression performance. An R2 score closer to 1, indicates that the total relative error is below the mean.
You can think of this as telling you how well the model fits the data, as compared to just “fitting” with a line at its mean.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import renom as rm
from renom.optimizer import Sgd
from renom.cuda.cuda import set_cuda_active
set_cuda_active(False)

boston = load_boston()
columns = boston.feature_names
X = boston.data.astype(np.float32)
y = boston.target.astype(np.float32)

X = (X - X.min()) / (X.max() - X.min())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
y_train, y_test = y_train.reshape(-1, 1), y_test.reshape(-1, 1)
print("X_train:{} y_train:{} X_test:{} y_test:{}".format(\
       X_train.shape, y_train.shape, X_test.shape, y_test.shape))

sequential = rm.Sequential([
    rm.Dense(30),
    rm.Dense(1)
])

batch_size = 32
epoch = 1000
optimizer = Sgd(lr=0.01)
N = len(X_train)

learning_curve = []
test_learning_curve = []
for i in range(epoch):
    perm = np.random.permutation(N)
    train_loss = 0
    test_loss = 0
    for j in range(N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]
        with sequential.train():
            z = sequential(train_batch)
            l = rm.mean_squared_error(z, response_batch)
        grad = l.grad()
        grad.update(optimizer)
        train_loss += l.as_ndarray()
    train_loss = train_loss / (N // batch_size)

    z = sequential(X_test)
    l = rm.mean_squared_error(z, y_test)
    test_loss = l.as_ndarray()

    learning_curve.append(train_loss)
    test_learning_curve.append(test_loss)

    if i%100==0:
        print("epoch %03d train_loss%f test_loss:%f"%(i, train_loss, test_loss))
X_train:(404, 13) y_train:(404, 1) X_test:(102, 13) y_test:(102, 1)
epoch 000 train_loss121.286549 test_loss:36.927589
epoch 100 train_loss32.149979 test_loss:28.684084
epoch 200 train_loss31.406890 test_loss:23.480282
epoch 300 train_loss29.028488 test_loss:27.579151
epoch 400 train_loss27.724438 test_loss:19.224230
epoch 500 train_loss21.688951 test_loss:16.617624
epoch 600 train_loss25.711622 test_loss:37.536972
epoch 700 train_loss22.570347 test_loss:14.833116
epoch 800 train_loss27.341895 test_loss:22.036667
epoch 900 train_loss21.255029 test_loss:17.414614

Evaluation

In [2]:
prediction = sequential(X_test)
plt.plot(y_test, label="original")
plt.plot(prediction, label="prediction")
plt.legend()
plt.show()
../../../_images/notebooks_basic_correlation_determination_notebook_3_0.png

Correlation coefficient

1 indicates positive correlation, -1 indicates negative correlation. 0 indicates that there is no correlation with the respective label.

In [3]:
for i in range(X.shape[1]):
    print("{:8s}:{:10f}".format(columns[i],np.corrcoef(X[:,i], y)[0][1]))
CRIM    : -0.385832
ZN      :  0.360445
INDUS   : -0.483725
CHAS    :  0.175260
NOX     : -0.427321
RM      :  0.695360
AGE     : -0.376955
DIS     :  0.249929
RAD     : -0.381626
TAX     : -0.468536
PTRATIO : -0.507787
B       :  0.333461
LSTAT   : -0.737663

Coefficient of determination

The R2 score evaluates a regression.
An R2 score closer to 1 suggests that the relative error is lower than the average line.
In [4]:
print("r2 score:{}".format(r2_score(y_test, prediction)))
r2 score:0.5501949370412647