AlexNet

Rich feature hierarchies for accurate object detection and semantic segmentation

Abstract

In this paper, they propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to previous best results on VOC2012 - achieving a mAP of 53.3%.

The proposed approach combines two key insights:

one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects.
when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

Introduction

AlexNet is the revival of CNN. To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?

They answer this question by bridging the gap between image classification and object detection. To achieve this result they focused on two problems:

localizing objects with a deep network
training a high-capacity with only a small quantity of annotated detection data

Unlike image classification, detection requires localizing objects within an image. The potential solutions include

framing localization as regression problem. This has been shown not working well.
building a sliding-widow detector. However, units high up in CNN have very large receptive fields and strides in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

They solve the CNN localization problem by operating within the recognition using regions paradigm, which has been successful for both object detection and semantic segmentation.

generates around 2000 category-independent region proposals for the input image
extracts a fixed-length feature vector from each proposal using a CNN
Classifies each region with category-specific linear SVMs.

They use a simple technique (afine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region's shape.

The second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN.

The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning
This paper shows that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL), is an effective paradigm for training high-capacity CNNs when data is scarce. (Transferring Learning)

In DeCAF, the author shows that AlexNet can be used as a blcakbox feature extractor, yielding excellent performance on several recognition tasks including scene classification, fine-grained sub-categorization, and domain adaption.

Object detection with R-CNN

Module design

Region proposals

R-CNN is agnostic to the particular region proposal method, however, they use selective search to enable a controlled comparison with prior detection work.

Feature extraction

They extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of AlexNet. Features are computed by forward propagating a mean-subtracted 227 x 227 RGB image through five convolutional layers and two fully connected layers.

Regardless of the size or aspect ratio of the candidate region, they warp all pixels in a tight bounding box around it to the required size. Prior to warping, they dilate the tight bounding box so that at the warped size there are exactly \(\(p\)\) pixels of warped image context around the original box (they use \(\(p = 16\)\)).

Object proposal transformations:

tightest square with context: enclosing each object proposal inside the tighest square and then scales (isotropically) the image contained in that square to the CNN input size.
tightest square without context: excluding the image content that surrounds the original object proposal.
warp: anisotropically scales each object proposal to the CNN input size.

Obviously more alternatives are possible.

Test-time detection

At test time,

run selective search on the test image to extract around 2000 region proposals
warp each proposals and forward propagate it through the CNN in order to compute features
for each class, they score each extracted feature vector using the SVM trained for that class.
Given all scored regions in an image, they apply a greedy NMS (for each class independently) that rejects a region if it has an IoU overlap with a higher scoring selected region larger than a learned threshold.