Tuesday, February 16, 2016

Tues Feb 26: Object Detectors Emerge in Deep Scene CNNs

Object Detectors Emerge in Deep Scene CNNs. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba. ICLR, 2015.
Supplemental: Learning Deep Features for Scene Recognition using Places Database. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. NIPS 2014.


  1. The paper presents the existence of object detectors in the inner layers of CNN’s designed for scene classification during the training phase. In addition, it presents proofs of this affirmation in a network with no supervisioned training for this application. The paper first compares the ImageNet-CNN and Places-CNN initially on how the networks becomes more specific as it becomes deeper. Resulting in parts of objects detectors for ImageNet and object detectors for Places. This point becomes clearer as the presence of few objects define the class of the scene. The position of the objects, for the emerging detectors, were defined by an analysis of the impact of regions in the image on the receptive field. The semantics on the other hand, were tested by ATM workers and shows clearly more object detection in Places-CNN as deeper layers are considered.
    Finally, Place-CNN was used to do scene classification and object detection in a single forward pass and the produced results of detection were relatively high, especially for those frequently present in the testing set.


    The paper presents a correlation between objects in the testing set and the incidence of them in the testing set. Do you think that something similar happens with the training-set(more examples in the training set would produce better classifiers)?

    The paper focus more on the pool 5 layer(more sensible to objects). Would it be possible to detect other features from other layers, like information about texture and surface using the middle layers?

  2. In this paper, the authors analyze response of the inner layers of deep convolutional neural networks to show that networks trained for the scene classification task produce certain neurons that act as object detectors for some object types. For their analysis, the authors use a network train on the Places dataset along with a network trained on ImageNet as a point of comparison. In their first experiment, the authors took correctly classified images of scenes and iteratively removed segments of the images until they were not longer correctly classified by the Places network. This illustrated that certain objects such as beds or dining tables are a strong indicator to the network of certain types of scenes. In their second experiment, the authors visualized the receptive fields of certain neurons by finding the parts of images that caused the largest change in activation when removed. This illustrates the portions of images that each neuron considers “important”. Next the authors found the images that caused the highest activation for different neurons and had human test subjects provide a common theme for each group. The results showed that different neurons find different and useful categories, and that neurons at higher levels in the network find higher-level semantic concepts. Finally, the authors used their trained networks on the annotated SUN scene database to show the distribution of objects found by the higher internal layers of the networks.

    Perhaps not relevant for what they are trying to do, but it seems that there are many other possible it remove information from images (i.e. lowering resolution, using larger color bins, smoothing out textures etc.) are there any similar experiments using other methods?

  3. In Object Detectors Emerge in Deep Scene CNNs, Zhou et al show that for deep convolutional neural nets trained on a scene database, specific nodes act as detectors for specific objects. This is surprising, since the training is provided for the net to classify scenes, and not objects. One explanation proposed by the authors is that scene recognition is dependent on the objects in the image, and that the neural nets are automatically finding objects as intermediate features to perform the final scene classification. A question that I had was how do the image segmentations given in the paper come from the receptive fields of the particular node?