dHSIC

method for detecting non-linear correlations

Introduction to dHSIC

Hilbert Schmid Independence Criterion, often abbreviated as HSIC, is a newly established method to detect non-linear correlations. Conventionally, Pearson's correlation coefficient is used to detect linear correlations between two data series. However, there is a big problem; the method cannot be used to detect non-linear correlations.
To overcome the limitation, Arthur Gretton et al. established HSIC, which uses kernel method. By using the kernel method appropriately, we can transfer data to an infinite dimensional space. By detecting the correlation in this infinite dimensional space, we can detect not only linear correlations but also non-linear correlations.
A lot of methods have been derived from HSIC, and dHSIC is one of them.

Limitation of Pearson's Correlation Coefficient

We first see the limitation of Pearson's correlation coefficient.
First, we calculate the Pearson's correlation coefficient of two data series, 'x' and 'y', which meet y=3x .
In [37]:
import numpy as np
import matplotlib.pyplot as plt
#Generate data
x = np.random.randn(1000)
y = x * 3
#plot data
plt.scatter(x, y)
plt.show()
#Calcurate Pearson's Correlation Coefficient
print(np.corrcoef(x, y))
../../../_images/notebooks_statistics_dHSIC_notebook_3_0.png
[[ 1.  1.]
 [ 1.  1.]]

Note that the output of 'numpy.corrcoef(x, y)' is as follows: array([[ xx , xy ], [ yx , yy ]]). Here, 'xx' means the correlation between 'x' and 'x', 'xy' means the correlation between 'x' and 'y', so as 'yx' and 'yy'.

Then, the result showed that the Pearson's correlation coefficient between 'x' and 'y' is 1, which means that these two data series are linearly correlated.

Next, we calculate Pearson's correlation coefficient of other data. Our next data follows y=x^2 .

In [38]:
import numpy as np
import matplotlib.pyplot as plt
#Generate data
x = np.random.randn(1000)
y = x ** 2
#plot data
plt.scatter(x, y)
plt.show()
#Calcurate Pearson's Correlation Coefficient
print(np.corrcoef(x, y))
../../../_images/notebooks_statistics_dHSIC_notebook_5_0.png
[[ 1.         -0.02372486]
 [-0.02372486  1.        ]]

Here, Pearson's correlation coefficient was almost zero. This means that these two data series are not linearly correlated. However, it does not mean that these two data series are not correlated. Even the linear correlation was rejected by Pearson's correlation coefficient, the data series might be correlated not in linear way.

It does not make a sense that these data series are not mutually correlated. In this case, if the value of x is given, we can instantly calculate y. This means that x and y are correlated in some way. Limited number of actual data sets have linear correlations. Many actual data sets have non-linear correlations, rather than linear-correlations, if exist. Here lies a big problem; as long as we use Pearson's correlation coefficient, we cannot detect non-linear correlations, which dominates linear correlations in number in real world, as this example shows.

Detection of non-linear correlations

Now, our interest is detecting non-linear correlations. Several methods such as maximum information coefficient (MIC) or HSIC have been suggested. In this tutorial, we use dHSIC. dHSIC is derived from HSIC. The original HSIC cannot detect correlations between more than two variables. dHSIC is the extension of HSIC to overcome the limitation in the number of variables. For more details, see citation 3.

Now, we use dHSIC to detect non-linear correlations.

In [39]:
from renom_dp.dHSIC import dHSIC_test
x = np.random.randn(1000, 1)
y = x ** 2
plt.scatter(x, y)
print(dHSIC_test([x, y]))
0.0
../../../_images/notebooks_statistics_dHSIC_notebook_8_1.png
What this output means? The output is the probability that the given data series are not mutually correlated. In other words, the output is a value between 0 and 1, and the value is smaller, it is more likely that the given data series are mutually correlated.
In this case, the result means that the probability that two data series ('x' and 'y') are not mutually correlated is 0%.

Of course, other correlations can also be detected. Followings are some examples.

In the case of y=x^3+2

In [40]:
from renom_dp.dHSIC import dHSIC_test
x = np.random.randn(1000, 1)
y = x ** 3 + 2
plt.scatter(x, y)
print(dHSIC_test([x, y]))
0.0
../../../_images/notebooks_statistics_dHSIC_notebook_12_1.png

In the case of y=e^x

In [41]:
from renom_dp.dHSIC import dHSIC_test
x = np.random.randn(1000, 1)
y = np.exp(x)
plt.scatter(x, y)
plt.show()
print(dHSIC_test([x, y]))
../../../_images/notebooks_statistics_dHSIC_notebook_14_0.png
0.0

For all of the given examples, the correlations were detected.

So what happens when the data series are not perfectly correlated?
Now, we test with y=x^3+random .
In [42]:
from renom_dp.dHSIC import dHSIC_test
x = np.random.randn(1000, 1)
y = x ** 3 + np.random.randn(1000, 1) * 3
plt.scatter(x, y)
plt.show()
print(dHSIC_test([x, y]))
../../../_images/notebooks_statistics_dHSIC_notebook_17_0.png
0.0

The correlation was detected.

Next, we would like to check whether the results are reliable. To meet this end, we now change the intensity of noise and calculate the dHSIC.

In [19]:
from renom_dp.dHSIC import dHSIC_test
dHSIC_result = []
for n in range(100):
    temp = []
    for i in range(10):
        x = np.random.randn(1000, 1)
        y = x ** 3 + np.random.randn(1000, 1) * n
        temp.append(dHSIC_test([x, y]))
    dHSIC_result.append(np.average(temp))
plt.plot(dHSIC_result)
plt.xlabel("Noise Intensity")
plt.ylabel("dHSIC_test output")
plt.show()
../../../_images/notebooks_statistics_dHSIC_notebook_20_0.png

In order to decrease the effect of chance, we calculated repeatedly and used the average to make the graph. This shows that the probability increases as the noise becomes bigger. We now proceed to the next step; the application of dHSIC.

Applications of dHSIC

There are many possible applications for dHSIC.

Sometimes, we need to select some explanatory variables from many options. This is one of the possible applications of dHISC.

In [8]:
from renom_dp.dHSIC import dHSIC_test
x1 = np.random.randn(1000, 1)
x2 = np.random.randn(1000, 1)
x3 = np.random.randn(1000, 1)
x4 = np.random.randn(1000, 1)
x5 = np.random.randn(1000, 1)
y = x1 ** 3 + x2 ** 2 + np.random.randn(1000, 1) * 3
print("x1:{}".format(dHSIC_test([x1, y])))
print("x2:{}".format(dHSIC_test([x2, y])))
print("x3:{}".format(dHSIC_test([x3, y])))
print("x4:{}".format(dHSIC_test([x4, y])))
print("x5:{}".format(dHSIC_test([x5, y])))
x1:0.0
x2:3.3306690738754696e-16
x3:0.14150768560369287
x4:0.5367135121365071
x5:0.6508547196457113

The conbination of (x1, y) and (x2, y) yeilds much smaller dHSIC possibility compared to others. This result shows that x1 and x2 are probably explanatory variables of y. In fact, y was calculated using x1 and x2. In this way, dHSIC can be used to select possible explanatory variables from many possibilities.

Another example. When y = x1 * x2, it is natural that x1 and y or x2 and y have no correlation because, for example, when x1 is large, y can be both large or small dependeing on the the value of x2, and vise virsa. However, y and the combination of x1 and x2 or y, x1 and x2 must be correlated. Let us try!

In [20]:
from renom_dp.dHSIC import dHSIC_test
x1 = np.random.randn(1000, 1)
x2 = np.random.randn(1000, 1)
y = x1 * x2 + np.random.randn(1000, 1) * 3
print("x1:{}".format(dHSIC_test([x1, y])))
print("x2:{}".format(dHSIC_test([x2, y])))
print("x1&x2:{}".format(dHSIC_test([x1, x2, y])))
print("x1x2:{}".format(dHSIC_test([np.concatenate((x1, x2), axis = 1), y])))
x1:0.26645226359592644
x2:0.5707624912362048
x1&x2:0.000586965474501322
x1x2:4.9920888878784986e-05

Usage of dHSIC_test()

Now, we will see how to use dHSIC_test function.

First, we need to import the function. Numpy is not necessarily required, however, it is very useful.

In [24]:
import numpy as np
from renom_dp.dHSIC import dHSIC_test
Next, we need to input the data which we want to test.
First, we make the data or reshape the data appropriately.
In [25]:
x1 = np.random.randn(1000, 1)
x2 = np.random.randn(1000, 1)
x3 = x1 * 3 + x2 * 1

Next we give it to dHISC_test function.

In [26]:
dHSIC_test([x1, x2])
Out[26]:
0.078981160413225782

Note that data input should be an array consists of 2-dimensional matrixes.

We can also give 3 arrays to the input as follows.

In [27]:
dHSIC_test([x1, x2, x3])
Out[27]:
0.0

Or we can give an array consisted of numpy arrays whose shapes are (n, 2), (n, 3) etc.

In [28]:
x12 = np.concatenate((x1, x2), axis = 1)
dHSIC_test([x12, x3])
Out[28]:
0.0

Now, we see the usage of available options.

Calcuration of the probability requires a statistical test. In this implimentation, 3 kinds of statistical methods are available; Gamma test, Permutation test and Bootstrap test. See the table below for the characteristics of each method. The default is "Gamma".

Test Time Input
Gamma Fast "Gamma"
Permutation Slow "Permutation"
Bootstrap Slow "Bootstrap"
In [29]:
print(dHSIC_test([x12, x3], method = "Gamma"))
print(dHSIC_test([x12, x3], method = "Permutation"))
print(dHSIC_test([x12, x3], method = "Bootstrap"))
0.0
0.01
0.01

When we use Permutation or Bootstrap method, we can specify the number of shuffled data used for each test by using variable "samplesize". The default is 99. As this number increases, the output is getting reliable, however, it is becoming more resource consurming.

In [30]:
print(dHSIC_test([x12, x3], method = "Permutation", samplesize = 100))
print(dHSIC_test([x12, x3], method = "Bootstrap", samplesize = 999))
0.00990099009901
0.001

We can use any kernel. Gaussian kernel and Laprasian kernel., which are often used, are pre-implimented. The default is Gaussian kernel.

In [31]:
print(dHSIC_test([x12, x3], kernel = "Gaussian"))
print(dHSIC_test([x12, x3], kernel = "Laprasian"))
0.0
0.0
We can also use the kernel which we define.
First, we define a kernel functon.
In [32]:
def my_kernel(x, y):
    return np.exp(-np.sum(np.abs(x - y) ** 2))
my_kernel(np.array([1,2]), np.array([3,4]))
Out[32]:
0.00033546262790251185
In [33]:
print(dHSIC_test([x12, x3], kernel = my_kernel))
0.0

Note that inputs of user-defined kernel function should be array type, not float type. Also the output of the function should be float type.

We can get detailed information by using detail method.

First, we define a dictionaly to put detailed information.

In [34]:
detail = {}

Next, just give the dictionary to the function.

In [35]:
dHSIC_test([x12, x3], detail = detail)
print(detail)
{'critical value': 0.52771129667280015, 'statistic': 45.164293934647823, 'dHSIC': 0.045164293934647826}
Here, we get dHSIC value, critical value and statistic.
The critical value is of p=0.05. If we like to change it, we can use alpha. Now, we change it to 0.01.
In [36]:
dHSIC_test([x12, x3], detail = detail, alpha = 0.01)
print(detail)
{'critical value': 0.63138353631016109, 'statistic': 45.164293934647823, 'dHSIC': 0.045164293934647826}

Citations

  1. Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic learning theory, pp. 63–77. Springer-Verlag.
  2. Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2007), A kernel statistical test of independence. Advances in Neural Information Processing Systems (NIPS 20), pp. 585–592.
  3. Pfister, N. , Bühlmann, P. , Schölkopf, B. and Peters, J. (2018), Kernel‐based tests for joint independence. J. R. Stat. Soc. B, 80: 5-31. doi:10.1111/rssb.12235