Mushroom Classification

In this tutorial, we will show how to categorize data with variables expressed in character form.

This can be useful when classifying groups with qualitative data, such as surveys. For this tutorial, we will use Mushroom Database, retrieved from UCI.

Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms, G. H. Lincoff (Pres.), New York: Alfred A. Knopf https://archive.ics.uci.edu/ml/datasets/mushroom !

Required Libaries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
In [1]:
from __future__ import division, print_function
import numpy as np
import pandas as pd

import renom as rm
from renom.optimizer import Sgd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Loading Data

First, obtain the raw data and save it in a directory where you are working on. Load the raw data.

From the result below, we can confirm the 1st column represents the edible state of each mushroom. The other columns represent features of the mushroom.(View the attached document)

In [2]:
X_csv=pd.read_csv("agaricus-lepiota.data",header=None)
print(X_csv.head())
  0  1  2  3  4  5  6  7  8  9  ... 13 14 15 16 17 18 19 20 21 22
0  p  x  s  n  t  p  f  c  n  k ...  s  w  w  p  w  o  p  k  s  u
1  e  x  s  y  t  a  f  c  b  k ...  s  w  w  p  w  o  p  n  n  g
2  e  b  s  w  t  l  f  c  b  n ...  s  w  w  p  w  o  p  n  n  m
3  p  x  y  w  t  p  f  c  n  n ...  s  w  w  p  w  o  p  k  s  u
4  e  x  s  g  f  n  f  w  b  k ...  s  w  w  p  w  o  e  n  a  g

[5 rows x 23 columns]

Converting Data to One Hot Vector Representation

Convert the variables to one hot vector representation. Also, check the size of the data.

In [3]:
X_num=pd.get_dummies(X_csv)
print(X_num.head())
print("The size of the dataset is:{}".format(X_num.shape))
   0_e  0_p  1_b  1_c  1_f  1_k  1_s  1_x  2_f  2_g  ...   21_s  21_v  21_y  \
0    0    1    0    0    0    0    0    1    0    0  ...      1     0     0
1    1    0    0    0    0    0    0    1    0    0  ...      0     0     0
2    1    0    1    0    0    0    0    0    0    0  ...      0     0     0
3    0    1    0    0    0    0    0    1    0    0  ...      1     0     0
4    1    0    0    0    0    0    0    1    0    0  ...      0     0     0

   22_d  22_g  22_l  22_m  22_p  22_u  22_w
0     0     0     0     0     0     1     0
1     0     1     0     0     0     0     0
2     0     0     0     1     0     0     0
3     0     0     0     0     0     1     0
4     0     1     0     0     0     0     0

[5 rows x 119 columns]
The size of the dataset is:(8124, 119)

Model Deifinition

The labels for this dataset are “edible” or “poisonous”, so the first 2 variables represent the label of the dataset. Therefore, we will set 2 output for the Neural Network. The rest of the variables (117 variable) represent explanatory variable, which will be used as inputs for the Neural Network.

In [4]:
sequential = rm.Sequential([
    rm.Dense(117),
    rm.Relu(),
    rm.Dense(59),
    rm.Relu(),
    rm.Dense(2)
])

Seperating Dataset to Training, Testing Dataset

For this tutorial, we will use scikit-learn package to divide the dataset into training and testing dataset.

20% of the dataset will be used as testing dataset.

In [5]:
X=X_num[X_num.columns[2:]]
Y=X_num[X_num.columns[:2]]
X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size=0.2,random_state=0)

var_name=["X_train","X_test","y_train","y_test"]
var_list=[X_train.shape,X_test.shape,y_train.shape,y_test.shape]

for [l,m] in zip(var_name,var_list):
    print("the size of {} is {}".format(l,m))
the size of X_train is (6499, 117)
the size of X_test is (1625, 117)
the size of y_train is (6499, 2)
the size of y_test is (1625, 2)

Convert Dataframe to Numpy Array

Convert Dataframe to Numpy Array. Confirm the datatype.

In [6]:
X_train=X_train.reset_index(drop=True).as_matrix()
X_test=X_test.reset_index(drop=True).as_matrix()
y_train=y_train.reset_index(drop=True).as_matrix()
y_test=y_test.reset_index(drop=True).as_matrix()

var_name=["X_train","X_test","y_train","y_test"]
var_list=[type(i) for i in [X_train,X_test,y_train,y_test]]

for [l,m] in zip(var_name,var_list):
    print("the type of {} is {}".format(l,m))
the type of X_train is <class 'numpy.ndarray'>
the type of X_test is <class 'numpy.ndarray'>
the type of y_train is <class 'numpy.ndarray'>
the type of y_test is <class 'numpy.ndarray'>

Training Loop

Make random index using the function numpy.random.permutation, and construct batch data. In this section, we set the batch size to 20, epoch to 100, learning rate to 0.001.

In [7]:
batch_size = 20
epoch = 100
N = len(X_train)
optimizer = Sgd(lr=0.001)
learning_curve = []
test_learning_curve = []

for i in range(epoch):
    perm = np.random.permutation(N)
    loss = 0
    for j in range(0, N // batch_size):
        train_batch = X_train[perm[j*batch_size : (j+1)*batch_size]]
        response_batch = y_train[perm[j*batch_size : (j+1)*batch_size]]

        with sequential.train():
            l = rm.softmax_cross_entropy(sequential(train_batch), response_batch)

        grad = l.grad()
        grad.update(optimizer)
        loss += l.as_ndarray()


    train_loss = loss / (N // batch_size)
    test_loss = rm.softmax_cross_entropy(sequential(X_test), y_test).as_ndarray()

    test_learning_curve.append(test_loss)
    learning_curve.append(train_loss)

    if i % 10 == 0:
        print("epoch:{:03d}, train_loss:{:.4f}, test_loss:{:.4f}".format(i, float(train_loss), float(test_loss)))
epoch:000, train_loss:0.5823, test_loss:0.4414
epoch:010, train_loss:0.0736, test_loss:0.0616
epoch:020, train_loss:0.0333, test_loss:0.0256
epoch:030, train_loss:0.0188, test_loss:0.0138
epoch:040, train_loss:0.0123, test_loss:0.0087
epoch:050, train_loss:0.0086, test_loss:0.0061
epoch:060, train_loss:0.0064, test_loss:0.0045
epoch:070, train_loss:0.0050, test_loss:0.0036
epoch:080, train_loss:0.0039, test_loss:0.0029
epoch:090, train_loss:0.0033, test_loss:0.0024

Prediction and Evaluation

After training the NN, set the outputs to 1 and 0. Use a classification report to view the precision, recall, and F1 values.

In [8]:
predictions = np.argmax(sequential(X_test).as_ndarray(), axis=1)
ans_train = np.array([list(row).index(1.0) for row in y_train])
ans_test = np.array([list(row).index(1.0) for row in y_test])

print(classification_report(ans_test, predictions))

plt.plot(learning_curve, linewidth=1, label="train")
plt.plot(test_learning_curve, linewidth=1, label="test")
plt.title("learning_curve")
plt.ylabel("error")
plt.xlabel("epoch")
plt.legend()
plt.grid()
plt.show()
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       852
          1       1.00      1.00      1.00       773

avg / total       1.00      1.00      1.00      1625

../../../_images/notebooks_clustering_Mushroom-Classification_notebook_16_1.png

Discussion

Considering the high value of precision and recall, we were able to accurately predict the categories of the mushroom. Also, by transforming the data into one hot key vectors, we were able to obtain a model with qualitative variables.

Also, through this tutorial, we were able to reconfirm the useful property of the neural network model. From the document attached to the mushroom data, there are certain rules to distinguish poisonous and edible mushroom, but from the results shown above, we were able to distinguish mushrooms without any rules.