YOLO9000: Better, Faster, Stronger

yolo_v2

Abstract

They introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories.

They propose various improvements to the YOLO detection method, both novel and drawn from prior work.
Using a novel, multi-scale training method, the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy.
They propose a method to jointly train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. The joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data.

Introduction

Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. We would like detection to scale to level of object classification. However, labelling images for detection is far more expensive than labelling for classification or tagging (tags are often user-supplied for free).

They propose a new method to harness the large amount of classification data and use it to expand the scope of current detection systems. They uses a hierarchical view of object classification that allows to combine distinct datasets together.

Note

Their method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness.

Better

YOLO suffers from a variety of shortcomings relative to state-of-the-art detection systems.

Error analysis of YOLO compared with Fast R-CNN shows that YOLO makes a significant number of localization errors.
Futhermore, YOLO has relatively low recall compared to region proposal-based methods.

YOLOv2 mainly improves recall and localization while maintaining classification accuracy.

With YOLOv2 they want a more accurate detector that is still fast. Instead of scaling up the network, they simplify the network and then make the representation easier to learn. They pool a variety of ideas from past work with their own novel concepts to improve YOLO's performance.

Batch Normalization

Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization.

By adding batch normalization on all of the convolutional layers in YOLO, they got more than 2% improvement in mAP.
Batch normalization also regularize the model. With batch normalization, they can remove dropout from the model without overfitting.

High Resolution Classifier

The original YOLO trains the classifier network at 224 x 224 and increases the resolution to 448 for detection.

Warning

This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution.

For YOLOv2 they first fine tune the classification network at the full 448 x 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. They then fine tune the resulting network on detection. This high resolution classification network gives an increase of almost 4% mAP.

Convolutional With Anchor Boxes

Instead of predicting coordinates directly, Faster R-CNN predicts bounding boxes using hand-picked priors. Using only convolutional layers, the region proposal network (RPN) in Faster R-CNN predicts offsets and confidences for anchor boxes.

Hint

Since the prediction layer is convolutional, the RPN predicts these offsets at every location ina feature map.

Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn.

Using anchor boxes they get a small decrease in accuracy. YOLO only predicts 98 boxes per image but with anchor boxes, the model predicts more than a thousand. Even though the mAP decreases, the increase in recall means that the model has more room to improve.

Dimension Clusters

Instead of choosing priors by hand, they run k-means clustering on the training set bounding boxes to automatically find good priors.

anchors

If we use standard k-means with Euclidean distance, larger boxes generate more error than smaller boxes. However, what we really want are priors that lead to good IOU scores, which is independent of the size of the box. Thus for the distance metric they use

\[d(box, centroid) = 1 - IOU(box, centroid)\]

The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes.

Note

Using k-means to generate the bounding box starts the model off with a better representation and makes the task easier to learn.

Direct Location Prediction

When using anchor boxes with YOLO, a second issue is model instability, especially during early iterations. Most of the instability comes from predicting the \((x, y)\) locations for the box.

RPN predicts values \(t_x\) and \(t_y\) and the \((x, y)\) center coordinates are calculated as:

rpn_predicts

Warning

This formulation is unconstrained so any anchor box can end up at any point in the image, regardless of what location predicted the box. With random initialization the model takes a long time to stabilize to predicting sensible offsets.

Instead of predicting offsets, they follow the approach of YOLOv1 and predict location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between 0 and 1. They use a logistic activation to constrain the network's predictions to fall in this range.

The network predicts 5 bounding boxes at each cell in the output feature map.
The network predicts 5 coordinates for each bounding box, \(t_x\), \(t_y\), \(t_w\), and \(t_o\).

If the cell is offset from the top left corner of the image by \((c_x, c_y)\) and the bounding box prior has width and height \(p_w\), \(p_h\), then the predictions correspond to:

box_encoding

Since they constrain the location prediction, the parametrization is easier to learn, making the network more stable.

box_encoding_illustration

Fine-Grained Features

Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. They take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 x 26 resolution.

The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet.