Search Data

An introduction of searching topology data.

In this tutorial, we visualize iris dataset and search data. you can learn following points.

  • How to create topology using ReNom TDA module.
  • How to search topology.

Requirements

In [1]:
import numpy as np

from sklearn.datasets import load_iris

from renom_tda.topology import Topology
from renom_tda.lens import PCA

Dataset

Next, we have to load iris dataset. To accomplish this, we'll use the load_iris module included in the scikit-learn package.

The iris dataset consists of 150 data and data has 4 columns.

In [2]:
iris = load_iris()

data = iris.data
target = iris.target

Create label data & column name data.

We create text data & column name data.

In this tutorial, use name of iris species.

In [3]:
setosa = ["setosa"] * 50
versicolor = ["versicolor"] * 50
versinica = ["versinica"] * 50
species = np.array(setosa + versicolor + versinica).reshape(-1, 1)

text_data_columns = ["species"]
number_data_columns = ["sepal length", "sepal width", "petal length", "petal width"]

Define topology instance

Next, we have to define topology instance.

In [4]:
topology = Topology()

Load data

Next, we load data.

We use load_data function to load data in topology instance.

In [5]:
topology.load_data(data, number_data_columns=number_data_columns, text_data=species, text_data_columns=text_data_columns)

You can check input data as follow.

In [6]:
print(topology.number_data_columns)
['sepal length' 'sepal width' 'petal length' 'petal width']
In [7]:
print(topology.number_data.shape)
(150, 4)
In [8]:
print(topology.text_data_columns)
['species']
In [9]:
print(topology.text_data.shape)
(150, 1)

Create point cloud

Next, we create point cloud that is projected on 2 or 3 dimention space.

We use fit_transform function to project data with two parameter, metric and lens.

Metric is how to measure distance between data. Lens is the axis of projected space.

This tutorial use metric None and lens PCA. This means dimenstion reduction with normal PCA.

In [10]:
metric = None
lens = [PCA(components=[0, 1])]
topology.fit_transform(metric=metric, lens=lens)

Mapping to topological space

Next, we create topology.

We use map function to map point cloud to topological space.

We set three parameter, resolution, overlap and clusterer.

Resolution means the number of division. It effects the number of nodes.

Overlap means the easiness to connect with each nodes.

Eps and min_samples is used by clustering method for data that is in nodes.

In [11]:
topology.map(resolution=15, overlap=0.5, eps=0.1, min_samples=3)
created 69 nodes.
created 188 edges.

Color topology

Next, we colorize topology using color funcion.

In this tutorial, topology is colored by iris label values.

We can select color_method is "mean" or "mode" and color_type is "rgb" or "gray".

In [12]:
topology.color(target, color_method="mode", color_type="rgb")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)
../../../_images/notebooks_tda-2.1.0_search-data_notebook_21_0.png

Search node data from some values.

Next, we search node from values.

First, we should create search conditions dictionary.

Search dictionary has four parameters, that is "data_type", "operator", "column" and "value".

We search node, "spiecies" column equal "versicolor".

In [13]:
search_dicts = [{
    "data_type": "text",
    "operator": "=",
    "column": 0,
    "value": "versicolor"
}]

We use search function.

This function has 3 arguments.

search_dicts input searching conditions.

We can set multiple searching option in this parameter.

You can search node with values that is not include input data.

In this case, use target argument.

search_type argument selected "and" or "or" search.

In [14]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search(search_dicts=search_dicts)
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)
../../../_images/notebooks_tda-2.1.0_search-data_notebook_27_0.png

Other case

In [15]:
search_dicts = [{
    "data_type": "number",
    "operator": ">",
    "column": 0,
    "value": 6.0
}]
In [16]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search(search_dicts=search_dicts)
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)
../../../_images/notebooks_tda-2.1.0_search-data_notebook_30_0.png

You can use multiple searching options.

In [17]:
search_dicts = [{
    "data_type": "number",
    "operator": "=",
    "column": -1, # search with target argument data
    "value": 1
}, {
    "data_type": "number",
    "operator": ">",
    "column": 0,
    "value": 6.0
}]
In [18]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search(search_dicts=search_dicts, target=target, search_type="and")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)
../../../_images/notebooks_tda-2.1.0_search-data_notebook_33_0.png
In [19]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=target, search_type="or")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)
../../../_images/notebooks_tda-2.1.0_search-data_notebook_34_0.png

Get searched data index

You can get searced data index with _get_searched_index function.

In [20]:
data = np.concatenate([topology.number_data, target.reshape(-1,1)], axis=1)
data_index = topology._get_searched_index(data=data, search_dicts=search_dicts, search_type="and")
In [21]:
data_index
Out[21]:
[50,
 51,
 52,
 54,
 56,
 58,
 62,
 63,
 65,
 68,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 83,
 85,
 86,
 87,
 91,
 97]
In [22]:
topology.number_data[data_index]
Out[22]:
array([[ 7. ,  3.2,  4.7,  1.4],
       [ 6.4,  3.2,  4.5,  1.5],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 6.3,  3.3,  4.7,  1.6],
       [ 6.6,  2.9,  4.6,  1.3],
       [ 6. ,  2.2,  4. ,  1. ],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 6.7,  3.1,  4.4,  1.4],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 6.1,  2.8,  4. ,  1.3],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.8,  4.7,  1.2],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.8,  2.8,  4.8,  1.4],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6. ,  2.9,  4.5,  1.5],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 6.2,  2.9,  4.3,  1.3]])
In [23]:
target[data_index]
Out[23]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1])

Get data from node_id

Node data is in topology.hypercubes attribute.

Hypercubes is dictionary that has key and data_index array.

Hypercubes key is used node_id.

In [24]:
topology.hypercubes
Out[24]:
{0: [8, 13, 38],
 1: [3, 8, 13, 38, 42],
 2: [1, 2, 3, 12, 29, 30, 42, 45, 47],
 3: [1, 2, 6, 9, 11, 12, 29, 30, 34, 37, 45, 47],
 4: [6, 7, 9, 11, 22, 34, 35, 37, 49],
 5: [0, 4, 7, 17, 22, 27, 28, 35, 39, 40, 49],
 6: [0, 4, 17, 19, 21, 27, 28, 39, 40, 46],
 7: [19, 21, 36, 46, 48],
 8: [16, 32, 36, 48],
 9: [16, 32, 33],
 10: [3, 8, 38],
 11: [1, 2, 3, 12, 25, 29, 30, 45, 47],
 12: [1, 2, 6, 9, 11, 12, 24, 25, 29, 30, 34, 37, 45, 47],
 13: [6, 7, 9, 11, 23, 24, 26, 34, 35, 37, 43, 49],
 14: [0, 4, 7, 17, 20, 23, 26, 27, 28, 35, 39, 40, 43, 49],
 15: [0, 4, 17, 19, 20, 21, 27, 28, 31, 39, 40, 44, 46],
 16: [5, 10, 19, 21, 31, 36, 44, 46, 48],
 17: [5, 10, 16, 18, 32, 36, 48],
 18: [16, 18, 32, 33],
 19: [23, 24, 26, 43],
 20: [20, 23, 26, 43],
 21: [20, 31, 44],
 22: [5, 10, 31, 44],
 23: [5, 10, 18],
 24: [57, 60, 93],
 25: [53, 59, 80, 81, 89],
 26: [59, 62, 69, 80, 81, 89],
 27: [62, 67, 69, 79, 82, 92],
 28: [64, 67, 79, 82, 88, 92],
 29: [53, 59, 80, 89, 90],
 30: [59, 62, 69, 80, 84, 89, 90, 94],
 31: [55, 62, 66, 67, 69, 82, 84, 92, 94, 99],
 32: [55, 64, 66, 67, 82, 88, 92, 95, 96, 99],
 33: [61, 64, 71, 88, 95, 96, 97],
 34: [61, 71, 74, 97],
 35: [84, 90, 94, 121],
 36: [55, 66, 84, 94, 99, 121],
 37: [55, 63, 66, 73, 78, 87, 95, 96, 99, 138],
 38: [61, 63, 70, 71, 73, 78, 91, 95, 96, 97, 138],
 39: [54, 58, 61, 70, 71, 74, 85, 91, 97],
 40: [51, 54, 56, 58, 74, 75, 85],
 41: [51, 56, 65, 75, 86],
 42: [68, 101, 113, 114, 119, 121, 142],
 43: [68, 72, 83, 87, 101, 114, 121, 142, 146],
 44: [63, 72, 73, 78, 83, 87, 123, 126, 133, 138, 146, 149],
 45: [63, 70, 73, 78, 91, 123, 126, 127, 133, 138, 149],
 46: [54, 58, 70, 85, 91, 127],
 47: [51, 54, 56, 58, 75, 76, 77, 85, 110],
 48: [51, 52, 56, 65, 75, 76, 77, 86, 110],
 49: [50, 52, 65, 86],
 50: [101, 113, 114, 119, 142],
 51: [72, 83, 101, 114, 134, 142, 146],
 52: [72, 83, 103, 111, 123, 126, 128, 133, 146, 149],
 53: [103, 111, 116, 123, 126, 127, 128, 133, 137, 149],
 54: [115, 116, 127, 136, 137, 145, 147, 148],
 55: [76, 77, 110, 115, 136, 139, 145, 147, 148],
 56: [52, 76, 77, 110, 139, 141],
 57: [103, 108, 111, 128, 132],
 58: [100, 103, 104, 108, 111, 116, 128, 132, 137],
 59: [100, 104, 112, 115, 116, 136, 137, 140, 145, 147, 148],
 60: [102, 112, 115, 120, 124, 136, 139, 140, 143, 144, 145, 147, 148],
 61: [102, 120, 124, 125, 129, 139, 141, 143, 144],
 62: [100, 104, 108, 132],
 63: [100, 104, 112, 140],
 64: [102, 107, 112, 120, 124, 130, 140, 143, 144],
 65: [102, 107, 120, 124, 125, 129, 130, 143, 144],
 66: [105, 107, 122, 130],
 67: [105, 122, 135],
 68: [105, 118, 122]}

For example, get node 0 information.

In [25]:
data_index = topology.hypercubes[0]
In [26]:
topology.number_data[data_index]
Out[26]:
array([[ 4.4,  2.9,  1.4,  0.2],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 4.4,  3. ,  1.3,  0.2]])
In [27]:
target[data_index]
Out[27]:
array([0, 0, 0])