You Only Look Once: Unified, Real-Time Object Detection

yolo_v1

Abstract

Prior work on object detection repurposes classifiers to perform detection. Instead, they frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
The unified architecture is extremely fast.

Introduction

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image.

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

They reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.

yolo_framework

This unified model has several benefits over traditional methods of object detection.

YOLO is extremely fast.
YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.
YOLO learns generalizable representations of objects. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

Warning

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.

Unified Detection

They unify the separate components of object detection into a single neural network. Their network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

Grid

The system divides the input image into an \(S \times S\) grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Confidence Score

Each grid cell predicts \(B\) bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally, the confidence score is defined as \(P_{Object} * IOU^{gt}_{pred}\).

Predictions

Each bounding box consists of 5 predictions: \(x, y, w, h\).

The \((x, y)\) coordinates represent the center of the box relative to the bounds of the grid cell.
The width and height are predicted relative to the whole image.
The confidence prediction represents the IOU between the predicted box and any ground truth box.

Classification

Each grid cell also predicts \(C\) conditional class probabilities, \(P(Class_i | Object)\). These probabilities are conditioned on the grid cell containing an object. They only predict one set of class probabilities per grid cell, regardless of the number of boxes \(B\).

At test time, they multiply the conditional class probabilities and the individual box confidence predictions, which gives class-specific scores for each box.

unified_detection

Network Design

Their network architecture is inspired by the GoogLeNet model for image classification. Their network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, they simply use \(1 x 1\) reduction layers followed by \(3 x 3\) convolutional layers.

network_architecture

Training

They pretrain convolutional layers on the ImageNet 1000-class competition dataset. They use the first 20 convolutional layers followed by a average-pooling layer and a fully connected layer. They use the Darknet framework for all training and inference.

They then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance. They add four convolutional layers and two fully connected layers with randomly initialized weights.

Note

They use a linear activation function fo the final layer and all other layers use the leaky rectified linear activation.

Sum-squared error

They optimize for sum-squared error in the output of the model. They use sum-squared error because it is easy to optimize, however it does not perfectly align with the goal of maximizing average precision.

Warning

It weights localization error equally with classification error which may not be ideal.

Imbalance Class

In every image many grid cells do not contain any object. It pushes the confidence scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

To remedy this, they increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don't contain objects.

Box Size

Sum-squared error also equally weights errors in large boxes and small boxes. One error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this, they predict the squared root of the bounding box width and height instead of the width and height directly.

Association

YOLO predicts multiple bounding boxes per grid cell. At training time, only one bounding box should be responsible for each object.

Note

They assign one predictor to be "responsible" for predicting an object based on which prediction has the highest current IOU with the ground truth.

This leads to specialization between the bounding box predictors. Each predictor gets getter at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

loss_function

Training Schedule

Batch size of 64, a momentum of 0.9 and a decay of 0.0005
Learning rate: 1e-3, 1e-2, 1e-4.

Hint

If they start at a high learning rate the model often diverges due to unstable gradients.
- To avoid overfitting they use dropout and extensive data augmentation.

Limitations of YOLO

Warning

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class.

This spatial constraint limits the number of nearby objects that the model can prediction. The model struggles with small objects that appear in groups.

Warning

The main source of error is incorrect localizations.

While they train on a loss function that approximates detection performance, their loss function treats errors the same in small bounding boxes versus large bounding boxes.

Comparison to Other Detection Systems

R-CNN

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. YOLO puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. YOLO proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search.