MNIST Dataset Mapping

An introduction of MNIST dataset mapping using ReNom TDA.

In this tutorial, we visualize mnist dataset. you can learn following points.

  • How to create topology using ReNom TDA module.
  • How to understand relation between topology and mnist label value.

Requirements

In [1]:
import numpy as np

from sklearn.datasets import fetch_mldata

from renom_tda.topology import Topology
from renom_tda.lens import PCA

Dataset

Next, we have to load-in the raw, binary MNIST data. To accomplish this, we'll use the fetch_mldata module included in the scikit-learn package.

The MNIST dataset consists of 70000 digit images. Before we do anything else, we re-scale the image data (originaly integer values 0-255) to have a range from 0 to 1.

In [2]:
# Datapath must point to the directory containing the mldata folder.
data_path = "../dataset"
mnist = fetch_mldata('MNIST original', data_home=data_path)
In [3]:
# Reduce data because 70000 data is too large to calculate mapping.
data = mnist.data[::10]
target = mnist.target[::10]

# Rescale the image data to 0 ~ 1.
data = data.astype(np.float32)
data /= data.max()

Define topology instance

Next, we have to define topology instance.

In [4]:
topology = Topology()

Load data

Next, we load data.

In [5]:
topology.load_data(data)

Create point cloud

Next, we create point cloud that is projected on 2 or 3 dimention space.

We use fit_transform function to project data with two parameter, metric and lens.

Metric is how to measure distance between data. Lens is the axis of projected space.

This tutorial use metric None and lens PCA. This means dimenstion reduction with normal PCA.

In [6]:
metric = None
lens = [PCA(components=[0, 1])]
topology.fit_transform(metric=metric, lens=lens)

Mapping to topological space

Next, we create topology.

We use map function to map point cloud to topological space.

We set three parameter, resolution, overlap and clusterer.

Resolution means the number of division. It effects the number of nodes.

Overlap means the easiness to connect with each nodes.

Eps and min_samples is used by clustering method for data that is in nodes.

In [7]:
topology.map(resolution=50, overlap=1, eps=0.4, min_samples=2)
created 3305 nodes.
created 22203 edges.

Color topology

Next, we colorize topology using color funcion.

In this tutorial, topology is colored by iris label values.

We can select color_method is "mean" or "mode" and color_type is "rgb" or "gray".

In [8]:
topology.color(target, color_method="mode", color_type="rgb", normalize=True)

Show topology

Finally, we show topology.

This graph shows that data of the same label are close to each other.

In [9]:
topology.show(fig_size=(15, 15), node_size=1, edge_width=0.1, mode="spring", strength=0.05)
../../../_images/notebooks_tda-2.1.0_mnist-dataset-mapping_notebook_17_0.png