# Search Data ¶

An introduction of searching topology data.

In this tutorial, we visualize iris dataset and search data. you can learn following points.

• How to create topology using ReNom TDA module.
• How to search topology.

## Requirements ¶

In [1]:
import numpy as np

from renom_tda.topology import Topology
from renom_tda.lens import PCA

## Dataset ¶

Next, we have to load iris dataset. To accomplish this, we'll use the load_iris module included in the scikit-learn package.

The iris dataset consists of 150 data and data has 4 columns.

In [2]:

data = iris.data
target = iris.target

## Create label data & column name data. ¶

We create text data & column name data.

In this tutorial, use name of iris species.

In [3]:
setosa = ["setosa"] * 50
versicolor = ["versicolor"] * 50
versinica = ["versinica"] * 50
species = np.array(setosa + versicolor + versinica).reshape(-1, 1)

text_data_columns = ["species"]
number_data_columns = ["sepal length", "sepal width", "petal length", "petal width"]

## Define topology instance ¶

Next, we have to define topology instance.

In [4]:
topology = Topology()

In [5]:

You can check input data as follow.

In [6]:
print(topology.number_data_columns)
['sepal length' 'sepal width' 'petal length' 'petal width']
In [7]:
print(topology.number_data.shape)
(150, 4)
In [8]:
print(topology.text_data_columns)
['species']
In [9]:
print(topology.text_data.shape)
(150, 1)

## Create point cloud ¶

Next, we create point cloud that is projected on 2 or 3 dimention space.

We use fit_transform function to project data with two parameter, metric and lens.

Metric is how to measure distance between data. Lens is the axis of projected space.

This tutorial use metric None and lens PCA. This means dimenstion reduction with normal PCA.

In [10]:
metric = None
lens = [PCA(components=[0, 1])]
topology.fit_transform(metric=metric, lens=lens)
None
projected by PCA.

## Mapping to topological space ¶

Next, we create topology.

We use map function to map point cloud to topological space.

We set three parameter, resolution, overlap and clusterer.

Resolution means the number of division. It effects the number of nodes.

Overlap means the easiness to connect with each nodes.

Eps and min_samples is used by clustering method for data that is in nodes.

In [11]:
topology.map(resolution=15, overlap=0.5, eps=0.1, min_samples=3)
created 70 nodes.
created 192 edges.

## Color topology ¶

Next, we colorize topology using color funcion.

In this tutorial, topology is colored by iris label values.

We can select color_method is "mean" or "mode" and color_type is "rgb" or "gray".

In [12]:
topology.color(target, color_method="mode", color_type="rgb")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)

## Search node data from node id ¶

Next, we search node from node_id.

You can show data about node and this function return same data.

In [13]:
return_data = topology.search_from_id(0)
node id: 0
coordinate: [ 0.02756928  0.28271102]
data ids: [8, 13, 38]
text data columns:
['species']
text data:
[['setosa']
['setosa']
['setosa']]
number data columns:
['sepal length' 'sepal width' 'petal length' 'petal width']
number data:
[[ 4.4  2.9  1.4  0.2]
[ 4.3  3.   1.1  0.1]
[ 4.4  3.   1.3  0.2]]
In [14]:
return_data
Out[14]:
{'coordinate': array([ 0.02756928,  0.28271102]),
'data_ids': [8, 13, 38],
'id': 0,
'number_data': array([[ 4.4,  2.9,  1.4,  0.2],
[ 4.3,  3. ,  1.1,  0.1],
[ 4.4,  3. ,  1.3,  0.2]]),
'number_data_columns': array(['sepal length', 'sepal width', 'petal length', 'petal width'],
dtype='<U12'),
'text_data': array([['setosa'],
['setosa'],
['setosa']],
dtype='<U10'),
'text_data_columns': array(['species'],
dtype='<U7')}

## Search node data from some values. ¶

Next, we search node from values.

First, we should create search value dictionary.
Search dictionary has four parameters, that is "data_type", "operator", "column" and "value".

We search node, "spiecies" column equal "versicolor".

In [15]:
search_dicts = [{
"data_type": "text",
"operator": "=",
"column": "species",
"value": "versicolor"
}]
We use search_from_values function.
This function has 3 arguments.
search_dicts input searching option to instance.

We can set multiple searching option in this parameter.

If you search node from values that is not include input data.
You can use target argument.
And you can use column name that input load_data function.
search_type argument select column name or index for search.
In [16]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=None, search_type="column")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)

Other case

In [17]:
search_dicts = [{
"data_type": "number",
"operator": ">",
"column": 0,
"value": 6.0
}]
In [18]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=None, search_type="index")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)

You can use multiple searching options.

In [19]:
search_dicts = [{
"data_type": "number",
"operator": "=",
"column": "target",
"value": 1
}, {
"data_type": "number",
"operator": ">",
"column": "sepal length",
"value": 6.0
}]
In [20]:
topology.color(target, color_method="mode", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=target, search_type="column")
topology.show(fig_size=(10, 10), node_size=10, edge_width=2)