Object detection using YOLO

An example of object detection

You Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
YOLO: Real-Time Object Detection

What is YOLO?

YOLO is a real time object detection algorithm developed by ... . As the name of YOLO, you look only onece, this algorithm doesn’t use a sliding window for detecting objects.

In this section, here is a brief description of YOLO. Because of the format of target data has a little bit complexity, grasping an image of the algorithm is useful for creating target data.

The YOLO does 2 process in the same time. The firts one is object detection, the second one is classification.

YOLO handles the image data as a set of grid cell. For example gig1 is a raw image and fig2 is an image set grid cells. In this case, both the vertical and horizontal cell number is 7.

Fig.1 Fig.2
Fig.1 Fig.2

In yolo, each cell owes setting a bounding boxes. As example, the red colored cell predicts 2 bounding boxes as shown in fig4. The number of bounding box is a hyper parameter. You can choise how many bounding boxes each cell sets. In this case, each cell sets 2 bounding boxes.

Fig.3 Fig.4
Fig.3 Fig.4

As the same time, each cell predicts the class probability P(Car|Object) which conditioned on object. Fig5 expresses the class probabirity of each cell.

Considering the fig3, fig4 and fig 5, the cell red colored predicts the bounding boxes as fig4 and the highest probabirity of the cell is P(Car|Object) then the boxes are colored with orange.

Fig.5

Fig.5

TSame as this, coloring all of the boxes left, then we can obtain the image like fig6. Each color means classes. As final process, we take NMS to the bounding box.

MNS, Non-Maximum Suppression, is a threathold based bounding box connection algorithm.

Fig.2 Fig.2
Fig.6 Fig.7

Reuirements

In this notebook, ReNom later than ersion 2.2 is required.

In [1]:
import os
import sys
from xml.etree import ElementTree
from itertools import product
import urllib.request as request

import numpy as np
from tqdm import tqdm
from PIL import Image
from PIL import ImageDraw
import colorsys
import matplotlib.pyplot as plt

# ReNom version >= 2.3.0
import renom as rm
from renom.cuda import set_cuda_active
from renom.utility.trainer import Trainer
from renom.algorithm.image.detection.yolo import build_truth, Yolo, apply_nms, box_iou
from renom.utility.distributor import ImageDetectionDistributor
from renom.utility.image import *

set_cuda_active(True)

Prepare dataset

We use the PASCAL VOC dataset.
You can find the download link in the middle of following link.
link
In [2]:
dataset_path = "VOCdevkit/VOC2012/Annotations/"

train_file_list = [path for path in sorted(os.listdir(dataset_path)) if not "2012_" in path]
test_file_list = [path for path in os.listdir(dataset_path) if "2012_" in path]

tree = ElementTree.parse(os.path.join(dataset_path, train_file_list[-1]))

Check the dataset

First we check the contents of the dataset. The folder structure is following. We only use the “Annotations” set and “JPEG Images” set.

In the “Annotations” foulder, the information of the bounding box is provided by xml format. To obtain the data, we use the “ElementTree” module. This module is a python built in function.

Following is an example of parsing the contents of the xml file. It includes file name , object bounding box , class of the object and image size .

In [3]:
def parse(node, indent=1):
    print("{}{} {}".format('    ' * indent, node.tag, node.text.strip()))
    for child in node:
        parse(child, indent + 1)

print("/// Contents of a XML file ///")
parse(tree.getroot())
/// Contents of a XML file ///
    annotation
        filename 2011_007214.jpg
        folder VOC2011
        object
            name person
            actions
                jumping 0
                other 1
                phoning 0
                playinginstrument 0
                reading 0
                ridingbike 0
                ridinghorse 0
                running 0
                takingphoto 0
                usingcomputer 0
                walking 0
            bndbox
                xmax 274
                xmin 77
                ymax 375
                ymin 67
            difficult 0
            pose Unspecified
            point
                x 154
                y 151
        object
            name person
            actions
                jumping 0
                other 1
                phoning 0
                playinginstrument 0
                reading 0
                ridingbike 0
                ridinghorse 0
                running 0
                takingphoto 0
                usingcomputer 0
                walking 0
            bndbox
                xmax 500
                xmin 182
                ymax 375
                ymin 1
            difficult 0
            pose Unspecified
            point
                x 411
                y 146
        segmented 0
        size
            depth 3
            height 375
            width 500
        source
            annotation PASCAL VOC2011
            database The VOC2011 Database
            image flickr

Getting the bounding box information

In [4]:
label_dict = {}
img_size = (224*2, 224*2)
cells = 7

def get_obj_coordinate(obj):
    global label_dict
    class_name = obj.find("name").text.strip()
    if label_dict.get(class_name, None) is None:
        label_dict[class_name] = len(label_dict)
    class_id = label_dict[class_name]
    bbox = obj.find("bndbox")
    xmax = float(bbox.find("xmax").text.strip())
    xmin = float(bbox.find("xmin").text.strip())
    ymax = float(bbox.find("ymax").text.strip())
    ymin = float(bbox.find("ymin").text.strip())
    w = xmax - xmin
    h = ymax - ymin
    x = xmin + w/2
    y = ymin + h/2
    return class_id, x, y, w, h

def get_img_info(filename):
    tree = ElementTree.parse(filename)
    node = tree.getroot()
    file_name = node.find("filename").text.strip()
    img_h = float(node.find("size").find("height").text.strip())
    img_w = float(node.find("size").find("width").text.strip())
    obj_list = node.findall("object")
    objects = []
    for obj in obj_list:
        objects.append(get_obj_coordinate(obj))
    return file_name, img_w, img_h, objects
In [5]:
train_data_set = []
test_data_set = []

for o in train_file_list:
    train_data_set.append(get_img_info(os.path.join(dataset_path, o)))

for o in test_file_list:
    test_data_set.append(get_img_info(os.path.join(dataset_path, o)))
In [6]:
# Example and class labels.
print("{}".format(train_data_set[-1]))
print()
print("%-12s: number"%("class name"))
print("----------------------")
for k, v in sorted(label_dict.items(), key=lambda x:x[1]):
    print("%-12s: %d"%(k, v))
('2011_007214.jpg', 500.0, 375.0, [(0, 175.5, 221.0, 197.0, 308.0), (0, 341.0, 188.0, 318.0, 374.0)])

class name  : number
----------------------
person      : 0
aeroplane   : 1
tvmonitor   : 2
train       : 3
boat        : 4
dog         : 5
chair       : 6
bird        : 7
bicycle     : 8
bottle      : 9
sheep       : 10
diningtable : 11
horse       : 12
motorbike   : 13
sofa        : 14
cow         : 15
car         : 16
cat         : 17
bus         : 18
pottedplant : 19

Making target data

The target data format of YOLO has a little complexity.

Considering above processes, the data format of target data should be below.

Its shape is (N, cell^2 * (bbox * 5 + class)). Where “N” is batch size, “cell” is the number of cell along each axis width and height, “bbox” is the number of bounding box that each cell predicts and “class” is one hotted class label.

In [7]:
label_length = len(label_dict)
last_layer_size = cells*cells*(5*2+label_length)

def one_hot(label):
    oh = [0]*label_length
    oh[label] = 1
    return oh

def create_detection_distributor(train_set=True):
    label_data = []
    img_path_list = []
    label_list = []
    if train_set:
        file_list = train_file_list
        data_set = train_data_set
        # Augumentation Settngs
        augmentatiion = DataAugmentation(
            [
                Flip(1),
                Rotate(90),
                # Resize(size=img_size),
                Shift((20, 20)),
                # ColorJitter(v=(1.0, 1.5)),
                # Zoom(zoom_rate=(1.0, 1.1)),
                Rescale(option=[-1, 1])
            ],
            random=True
        )
    else:
        file_list = test_file_list
        data_set = test_data_set
        augmentatiion = DataAugmentation(
            [Rescale(option=[-1, 1])],
        )
    for i in range(len(file_list)):
        img_path = os.path.join("VOCdevkit/VOC2012/JPEGImages/", data_set[i][0])

        # obj[1]:X, obj[2]:Y, obj[3]:Width, obj[4]:Height, obj[0]:Class
        objects = []
        for obj in data_set[i][3]:
            detect_label = {"bndbox":[obj[1], obj[2], obj[3], obj[4]],
                            "name":one_hot(obj[0])}
            objects.append(detect_label)
        img_path_list.append(img_path)
        label_list.append(objects)
    class_list = [c for c, v in sorted(label_dict.items(), key=lambda x:x[1])]
    return ImageDetectionDistributor(img_path_list,
                                     label_list,
                                     class_list,
                                     imsize = img_size,
                                     augmentation=augmentatiion)

def transform_to_yolo_format(label):
    yolo_format = []
    for l in label:
        yolo_format.append(build_truth(l.reshape(1, -1), img_size[0], img_size[1], cells, label_length).flatten())
    return np.array(yolo_format)

def draw_rect(draw_obj, rect):
    cor = (rect[0][0], rect[0][1], rect[1][0], rect[1][1])
    line_width = 3
    for i in range(line_width):
        draw_obj.rectangle(cor, outline="red")
        cor = (cor[0]+1,cor[1]+1, cor[2]+1,cor[3]+1)

train_detect_dist = create_detection_distributor(True)
test_detect_dist = create_detection_distributor(False)

Check Label Data

Note : You don’t have to write following code. This code is only for confirming created label data. This notebook is executable without following code.

In [8]:
sample, sample_label = train_detect_dist.batch(3, shuffle=True).__next__()

for Mth_img in range(len(sample)):
    example_img = Image.fromarray(((sample[Mth_img]+1)*255/2).transpose(1, 2, 0).astype(np.uint8))
    dr = ImageDraw.Draw(example_img)

    print("///Objects")
    for i in range(0, len(sample_label[Mth_img]), 4+label_length):
        class_label = np.argmax(sample_label[Mth_img][i+4:i+4+label_length])
        x, y, w, h = sample_label[Mth_img][i:i+4]
        if x==y==h==w==0:
            break
        draw_rect(dr, ((x-w/2, y-h/2), (x+w/2, y+h/2)))
        print("obj:%d"%(i+1),
              "class:{:7s}".format([k for k, v in label_dict.items() if v==class_label][0]),
              "x:%3d, y:%3d width:%3d height:%3d"%(x, y, w, h))

    plt.figure(figsize=(4, 4))
    plt.imshow(example_img)
    plt.show()

///Objects
obj:1 class:motorbike x:192, y:257 width:384 height:297
obj:25 class:car     x:386, y:187 width: 86 height: 38
../../../_images/notebooks_image_processing_yolo_notebook_15_1.png
///Objects
obj:1 class:motorbike x:165, y:195 width: 68 height:117
obj:25 class:car     x: 24, y:190 width: 48 height:124
obj:49 class:car     x:259, y:189 width:155 height:110
obj:73 class:car     x:356, y:197 width:143 height: 91
obj:97 class:car     x:119, y:174 width: 48 height: 47
obj:121 class:car     x:420, y:183 width: 37 height: 26
obj:145 class:person  x:287, y:282 width:134 height:213
../../../_images/notebooks_image_processing_yolo_notebook_15_3.png
///Objects
obj:1 class:chair   x:343, y:225 width:207 height:377
obj:25 class:chair   x: 90, y:181 width:151 height:329
obj:49 class:diningtable x:188, y:250 width:236 height:393
../../../_images/notebooks_image_processing_yolo_notebook_15_5.png

Model definition

We define a cnn model and loss function of YOLO.

In [9]:
# Convolutional neural network
model = rm.Sequential([
    # 1st Block
    rm.Conv2d(channel=64, filter=7, stride=2, padding=3),
    rm.LeakyRelu(slope=0.1),
    rm.MaxPool2d(stride=2, filter=2),

    # 2nd Block
    rm.Conv2d(channel=192, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.MaxPool2d(stride=2, filter=2),

    # 3rd Block
    rm.Conv2d(channel=128, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=256, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=256, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.MaxPool2d(stride=2, filter=2),

    # 4th Block
    rm.Conv2d(channel=256, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=256, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=256, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=256, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=1024, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.MaxPool2d(stride=2, filter=2),

    # 5th Block
    rm.Conv2d(channel=512, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=1024, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=512, filter=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=1024, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=1024, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=1024, filter=3, stride=2, padding=1),
    rm.LeakyRelu(slope=0.1),

    # 6th Block
    rm.Conv2d(channel=1024, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),
    rm.Conv2d(channel=1024, filter=3, padding=1),
    rm.LeakyRelu(slope=0.1),

    # 7th Block
    rm.Flatten(),
    rm.Dense(512),
    rm.LeakyRelu(slope=0.1),
    rm.Dense(4096),
    rm.LeakyRelu(slope=0.1),
    rm.Dropout(0.5),

    # 8th Block
    rm.Dense(last_layer_size),
])

# Loss function.
yolo_detector = Yolo(cells=cells, classes=label_length)

Train yolo model

We only train dense layers here.

In [10]:
N = len(train_data_set)
batch = 64
batch_loop = int(np.ceil(N/batch))

# Download the learned model weights.
if not os.path.exists("yolo.h5"):
    print("Weight parameters will be downloaded.")
    url = "http://www.docs.renom.jp/downloads/weights/yolo.h5"
    request.urlretrieve(url, "yolo.h5")

model.load("yolo.h5")
In [11]:
model_upper = rm.Sequential(model[:-7])
model_detector = rm.Sequential(model[-7:])

# Define weight decay
def weight_decay():
    wd = 0
    for m in model_detector:
        if hasattr(m, "params"):
            w = m.params.get("w", None)
            if w is not None:
                wd += rm.sum(w**2)
    return wd
In [12]:
# Reset params of detector model for redoing the learning.
LEARN = False # True if relearning the model.
if LEARN:
    for layer in model_detector:
        if hasattr(layer, "params"):
            layer.params = {}
In [13]:
opt = rm.Sgd(momentum=0.9)

# We use different learning rate in each epoch.
lrarning_rates = []# [0.001] + [0.01]*60

for epoch in range(len(lrarning_rates) * LEARN):
    loss = 0
    test_loss = 0
    bar = tqdm(range(batch_loop))
    opt._lr = lrarning_rates[epoch]

    model_detector.set_models(inference=False)
    for j, (img, label) in enumerate(train_detect_dist.batch(batch, True)):
        if epoch==0:
            # Rise the learning rate slowly at first epoch.
            opt._lr = (0.01 - 0.001)/(batch_loop)*j + 0.001

        yolo_format_label = transform_to_yolo_format(label)
        h = model_upper(img).as_ndarray()

        with model_detector.train():
            z = model_detector(h)
            l = yolo_detector(z, yolo_format_label) + 0.0005*weight_decay()

        l.grad().update(opt)
        loss += l.as_ndarray()

        # Set descriptions to tqdm.
        bar.set_description("epoch {:03d} train loss:{:6.4f}".format(epoch, l.as_ndarray()[0]))
        bar.update(1)

    # Test
    model_detector.set_models(inference=True)
    for k, (img, label) in enumerate(test_detect_dist.batch(batch, True)):
        yolo_format_label = transform_to_yolo_format(label)
        h = model_upper(img).as_ndarray()
        z = model_detector(h)
        test_loss += yolo_detector(z, yolo_format_label) + 0.0005*weight_decay()
    test_loss = test_loss.as_ndarray()/(k+1)

    msg = "epoch {:03d} avg loss:{:6.4f} test loss:{:6.4f}".format(epoch, float(loss/(j+1)), float(test_loss))
    bar.set_description(msg)
    bar.update(0)
    bar.refresh()
    bar.close()

Detection test

Here we test the learned model using test data set.

In [14]:
sample_img, sample_label = test_detect_dist.batch(3, shuffle=True).__next__()
obj_list = []

model.set_models(inference=True)
for i in range(len(sample_img)):
    p = model(np.expand_dims(sample_img[i], axis=0)).as_ndarray().reshape(cells, cells, 5*2+label_length)
    objs = apply_nms(p, cells, 2, label_length, image_size=img_size, thresh=0.2)
    obj_list.append(objs)

for num in range(3):
    im = Image.fromarray(((sample_img[num] + 1)/2*255).transpose(1, 2, 0).astype(np.uint8))
    obj = obj_list[num]
    dr = ImageDraw.Draw(im)
    print("///Objects")
    for i in range(len(obj)):
        class_label = obj[i]["class"]
        w = obj[i]["box"][2]*448
        h = obj[i]["box"][3]*448
        x = obj[i]["box"][0]*448
        y = obj[i]["box"][1]*448
        x1 = x - w/2
        y1 = y - h/2
        x2 = x + w/2
        y2 = y + h/2
        print("obj:%d"%(i+1),
          "class:{:7s}".format([k for k, v in label_dict.items() if v==class_label][0]),
          "x:%3d, y:%3d width:%3d height:%3d"%(x, y, w, h))
        draw_rect(dr, ((x1, y1), (x2, y2)))
    plt.imshow(im)
    plt.show()
///Objects
obj:1 class:person  x:209, y:233 width:237 height:296
../../../_images/notebooks_image_processing_yolo_notebook_24_1.png
///Objects
obj:1 class:person  x:297, y:231 width:120 height:191
../../../_images/notebooks_image_processing_yolo_notebook_24_3.png
///Objects
obj:1 class:person  x:270, y:278 width:240 height:366
../../../_images/notebooks_image_processing_yolo_notebook_24_5.png