Baseball data mapping

An introduction of mapping baseball data using ReNom TDA.

In this tutorial, we visualize baseball data using ReNom TDA module. you can learn following points.

  • How to analyse topology graph.

Requirement

In [1]:
import numpy as np

import pandas as pd

from sklearn.cluster import DBSCAN

from renom.tda.topology import SearchableTopology
from renom.tda.lens import PCA

Import baseball data

We get 2016 baseball hitter stats from https://github.com/nyk510/baseball_dataset/tree/master/data .
And we calculate sabermetrics measurements.
  • OPS(On-base Plus Slugging)
OPS = OBP + SLG
OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
SLG = (1B + 2 2B + 3 3B + 4*HR) / AB
  • IsoP(Isolated Power)

IsoP = SLG - AVG

  • BABIP(Batting Average on Balls In Play)

BABIP = (H – HR)/(AB – K – HR + SF)

  • BB/K
  • PA/K
  • AB/HR
  • SecA(Secondary average)

SECA=(TB - H + BB + SB - CS) / AB

  • TA(Total Average)

TA = ( TB + BB + HBP + SB - CS ) / ( AB - H + CS + DP )

  • PS(Power-Speed-Number)

PS = ( HR × SB × 2) / ( HR + SB )

  • RC27(Runs Created per 27 outs)
RC = ( 2.4 × C + A ) × ( 3 × C + B ) ÷ (9 × C) - 0.9 × C
A = H + BB + HBP - CS - DP
B = TB + 0.26 ×(BB + HBP) + 0.53 × SF + 0.64 × SB - 0.03 × K
C = AB + BB + HBP + SF
In [2]:
file_path = "hitter_metrics.csv"
pdata = pd.read_csv(file_path).dropna()

Extract categorical data & numerical data

We extract categorical data like Team Name or Player Name and numerical data.

In [3]:
categorical_data = np.array(pdata.loc[:, pdata.dtypes=="object"])

numerical_data = np.array(pdata.loc[:, np.logical_or(pdata.dtypes=="float", pdata.dtypes=="int")])
numerical_data = (numerical_data - np.average(numerical_data, axis=0)) / np.std(numerical_data, axis=0)

Create topology instance

In [4]:
topology = SearchableTopology()

Regist categorical data to topology instance

In [5]:
topology.regist_categorical_data(categorical_data)

Create point cloud

In [6]:
metric = None
lens = [PCA(components=[0,1])]
topology.fit_transform(numerical_data, metric=metric, lens=lens)
projected by PCA.
finish fit_transform.

Mapping to Topological Space

In [7]:
topology.map(resolution=25, overlap=0.7, clusterer=DBSCAN(eps=10, min_samples=1))
mapping start, please wait...
created 251 nodes.
calculating cluster coordination.
calculating edge.
created 935 edges.

Color topological graph & show

Next, we create color array for topological graph and show topological graph.

In [8]:
for i in range(len(pdata.columns)-2):
    if i == (len(pdata.columns)-3):
        print("colored by %s." % pdata.columns[i+2]+"(salary)")
    else:
        print("colored by %s." % pdata.columns[i+2])
    topology.color(topology.data[:, i], dtype="categorical", ctype="rgb", normalized=False)
    topology.show(fig_size=(10,10), node_size=5, edge_width=0.5)
colored by OPS.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_1.png
colored by IsoP.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_3.png
colored by BABIP.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_5.png
colored by BB/K.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_7.png
colored by PA/K.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_9.png
colored by AB/HR.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_11.png
colored by SecA.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_13.png
colored by TA.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_15.png
colored by PS.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_17.png
colored by RC27.
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_19.png
colored by 年俸(推定).(salary)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_21.png

Search player from node

In [9]:
topology.color(topology.data[:, 0], dtype="categorical", ctype="rgb", normalized=False)
topology.search("大谷 翔平")
大谷 翔平 is in [63] data.
大谷 翔平 is in [242, 243, 245, 246, 247, 248] node.

Show searched player in topological graph

In [10]:
topology.show(fig_size=(10,10), node_size=5, edge_width=0.5)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_20_0.png

Search team

In [11]:
topology.color(topology.data[:, 0], dtype="categorical", ctype="rgb", normalized=False)
topology.search("ヤクルト")
ヤクルト is in [41, 42, 43, 44, 45, 46, 47, 48] data.
ヤクルト is in [11, 12, 13, 23, 24, 25, 59, 60, 61, 78, 79, 80, 81, 82, 87, 88, 89, 90, 91, 99, 100, 101, 102, 103, 108, 109, 110, 111, 112, 122, 123, 124, 132, 133, 134, 153, 154, 155, 194, 195, 209, 210, 249, 250] node.
In [12]:
topology.show(fig_size=(10,10), node_size=5, edge_width=0.5)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_23_0.png

conclusion

This graph shows that slugger measurements like OPS or RC27 effects horizontal axis and batting eye measurements like BB/K or PA/K effects vertical axis.