Thursday, March 3, 2016

Tues, Mar 8: Fast and Faster R-CNN

Fast R-CNN. Ross Girshick. ICCV 2015.
(additionally, the faster version)
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. NIPS 2015.


  1. The paper presents a new technique to object detection based on a previous one called R-CNN. In Fast-RCNN, the image with object proposals are the inputs. The network used a layer called Regions of Interest pooling layer and a sibling output layer with object class and bounding box regression values to detect the objects. On the other hand, the Faster R-CNN complements the previous approach removing the need for image proposals. A new set of layers, that constitute initially a different layer, defines the object proposals. This network is called Region Proposals and is based on the concept of sliding windows over the feature map.
    The network presents superior results when compared to other approaches, in mAP and speed(test time).

  2. The Fast R-CNN paper introduces a new convolutional neural network architecture for object detection that is much faster for both training and testing than existing networks, while improving test accuracy. The R-CNN architecture uses convolutional layers from an existing image classification network. To perform object detection on an image, the full image is first passed through the convolutional layers to create a feature map. Object proposal regions are then mapped to this feature space and a fixed-size feature representation is created for each proposal by performing max-pooling within a fixed number of cells in the feature map region (this is a single-scale version of spatial pyramid pooling). This feature vector can then be fed through the full-connected layers of the original network. Finally, the output layer is modified by splitting it into two output layers: a K+1 way softmax layer which estimates the probability of each class, and a 4-dimensional regression output that estimates the adjustments needed to fit the bounding box to the object. The loss during training can simply be computed by summing the loss for each of these outputs. The authors also describe a number of optimizations used to efficiently train the network, including sampling training examples hierarchically when finding mini-batches and “compressing” the fully connected layers by factorizing the weight matrices with truncated SVD. In their results, the authors demonstrate that Fast R-CNN achieves state-of-the-art results for object detection on VOC 2007-12 and achieves an order of magnitude speedup compared to R-CNN. The Faster-R-CNN paper shows a similar method that generates object proposals and shares its architecture with Fast-R-CNN to save computation. The region proposal network in Faster-RCNN uses the same feature map as R-CNN. Using this feature map, a sliding window approach is used and a small network is applied to each window with two outputs: a objectness score and a bounding box adjustment. These outputs are replicated for a set of “anchor” boxes at each window which represent different possible proposals of different scales and aspect ratios. The authors show that the object proposal and detection networks can be trained efficiently with shared convolutional layers by alternating the output networks at training time. In their results, the authors show that this system results in much faster training times with state-of-the-art accuracy.

    Incorporating bounding box regression loss improves mAP, even when not used at test time, could other tasks also be incorporated into a network to improve classification and detection performance?

  3. Fast R-CNN presents a method to efficiently classify object proposals; rather than using a multi-stage pipeline for object detection, they create a new RoI pooling layer. The RoI pooling layer extracts a feature vector for each object proposal - these result in two outputs: a probability estimate for an object class and a set of four numbers specifying the bounding box of the object. The authors achieved state-of-the art mAP on the VOC datasets, with faster training and testing than R-CNN and SPPnet.

    The authors of Faster R-CNN improve upon Fast R-CNN by computing proposals with a deep CNN. They introduce a Region Proposal Network, that uses convolutional feature maps for generating region proposals.

    What is the current state of the art in fast object detection? Does it use something similar to the Faster R-CNN region proposal network, or something else entirely?

  4. The authors of the faster R-CNN paper present a novel CNN-based pipeline for region-based object proposals that addresses the then-current state of the art performance issues with region proposals. Building on the addition of an RoI pooling layer from the R-CNN paper, the authors compute region proposals in a deep convolutional network they call an RPN (Region Proposal Network). As with R-CNN, the RoI layer is able to generate for each class a most likely bounding box, and an objectness score. With this architecture, the authors were able to leverage the same discriminative power as R-CNN, but with state of the art performance on PASCAL, VOC 2007 and 2012, and MS COCO, performing better than the original R-CNN and SPPnet.

    It would be interesting to see how much this technique degrades in situations where many instances of the same object appear in the same image. How might this affect accuracy?

    Bounding boxes are pretty standard in this kind of research, but there are clearly some types of objects where a bounding box gives little to no information about the actual pose of the object in question. Practically, would it be possible to create an RPN-style architecture that gives more information than just a bounding box? For example, perhaps an n-polygonal boundary? Would this require additional layers?

  5. In Fast R-CNN, Ross Girshick proposes a variety of techniques and ideas to improve on previous object detection and localization systems like R-CNN and SPPnet. The author's main results include decreased training and test times over both previous systems, state-of-the-art mean average precision (mAP) scores on the 2007, 2010, and 2012 PASCAL Visual Object Classes (VOC) datasets, and a novel exploration of fine tuning layers in VGG16, a pre-trained very deep neural net. Additional contributions of fast R-CNN are the elimination of the multi-stage detection pipelines of R-CNN and SPPnet, as well as eliminating GBs worth of intermediate caches that SPPnet used in training.

    Discussion: Apart from a truncated SVD representation of its learned weights, exactly how is Fast R-CNN able to make its forward pass at test time significantly faster than either SPPnet?