YOLOv3: An Incremental Improvement

yolo_v3

Abstract

A bunch of little design changes to make it better.
It's a little bigger than last time but more accurate.

The Deal

Bounding Box Prediction

Following YOLOv2, YOLOv3 predicts bounding boxes using dimension clusters as anchor boxes. The network predicts 4 coordinates for each bounding box, \(t_x\), \(t_y\), \(t_w\), \(t_h\). If the cell is offset from the top left corner of the image by \((c_x, c_y)\) and the bounding box prior has width and height \(p_w\), \(p_h\), then the predictions correspond to:

box_encoding

YOLOv3 predicts an objectness score for each bounding box using logistic regression.

This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior.
If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold they ignore the prediction, following Faster-RCNN (they use the threshold of 0.5).

Hint

Unlike Faster-RCNN, they only assigns one bounding box prior for each ground truth object.

If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. They do not use a softmax as they found it is unnecessary for good performance, instead they simply use independent logistic classifiers. During training they use binary cross-entropy loss for the class predictions.

Predictions Across Scales

YOLOv3 predicts boxes at 3 different scales. Their system extract features from those scales using a similar concept to feature pyramid networks (FPN).

From their base feature extractor they add several convolutional layers. The last of these predicts a 3D tensor encoding bounding box, objectness, and class predictions.
Next they take the feature map from 2 layers previous and upsample it by 2x. They also take a feature map from earlier in the network and merge it with their upsampled features using concatenation. They then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.
They perform the same design one more time to predict boxes for the final scale.

Feature Extractor

The new network is a hybrid approach between the network used in YOLOv2 (Darknet-19) and that new fangled residual network stuff. The network uses successive 3 x 3 and 1 x 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so they call it Darknet-53.

darknet_53

Things They Tried That Didn't Work

Focal loss. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions.
Dual IoU thresholds and truth assignment. Faster-RCNN uses two IoU thresholds during training. If a prediction overlaps the ground truth by 0.7 it is as a positive example, by [0.3 - 0.7] it is ignored, less than 0.3 for all ground truth objects it is a negative example.