Paper discussion blog for CS 2951t at Brown University. Instructor: Genevieve Patterson
In DeepBox: Learning Objectness with Convolutional Networks, Kuo et al propose and train a neural network to rerank object proposals of existing computer vision systems. They train their re-ranking algorithm on both the COCO and PASCAL datasets. Exiting object proposal algorithms such as Edge Box and Selective search are extended with DeepBox, and both systems gain increase performance as measured by area under the precision-recall curve (AUC). A key technical feature is that the network is shallower than other state-of-the-art con nets, so that the algorithm performs quicker than other current methods at test time. Furthermore DeepBox is able to improve performance on object categories that is was not trained for, something that the authors claim is evidence that DeepBox is learning a general notion of objectness.Question: What do the upper-level trained convolutional filters that are learning a general notion of objectness look like? What features of the images are they picking up on?
The paper proposes a new method to select image proposals from an initial pool of proposals. Defining itself as a bottom up approach, it re ranks the proposals using a neural network of just 4 layers allowing a fast and lightweight selection. The network and the input image size was defined removing layers of AlexNet and testing the impact on the results. To train the neural network, the process was dived in two. First, the network learns to differentiate background and foreground, sliding windows thought the image to get negative samples and perturbing the ground truth to generalize the process. To train on the proposals from other methods, the same principle of training were used but with the created proposals instead of those from sliding windows.Deep Box outperforms all the other methods compared to it, when considered IoU = 0.7. It was also shown that the methods is in fact learning a definition of objects more general than specific types. DiscussionDo you think the authors had already an idea about the network configuration or they defined it testing with layers and input sizes? It looks like the DeepBox is a general complement of the object proposals methods; do you think a method designed since the beginning with the DeepBox would improve even more the results?
Kuo, Hariharan, and Malik propose a new method of ranking object proposals, called DeepBox. This algorithm gets a pool of bottom-up proposals, and then ranks them using a CNN. They use a CNN which they have created by trimming a preexisting CNN while trying to preserve performance. This resulted in a fast and lightweight CNN, capable of detecting "objectness". They also implement an additional speed-up, which they call Fast DeepBox, which doesn't recompute features for overlapping regions.They trained on background vs. object boxes, as well as the output of a bottom-up proposal, like Edge boxes.DeepBox outperforms Edge boxes, and does much better with multiple small objects or cluttered backgrounds. It is also able to recognize "objectness" even for categories which it has not been trained on.In the section on "Sharing computation for faster reranking" how did they go from a feature map to fixed-length feature vectors for each bounding box?In the first image with red shading (Figure 6.) why is some of the building shaded? What is the distinction between background, like the building, and things like the windows, which may be objects? As a general question, can things be objects in some contexts, but background in others?
The authors propose a novel four-layer CNN (DeepBox) for ranking object proposals. Their architecture is created by pruning a much larger CNN down to something simpler and more performant. Their network is trained to re-rank bottom-up object proposals. Because of their highly simplified 4-layer architecture, their solution is extremely fast compared to other state-of-the-art object detection networks. At the output layer, the network outputs "objectness" for any given class, and is even able to output "objectness" for unseen classes moderately well. Sliding window techniques have been mostly abandoned because they are expensive compared to "bottom-up" approaches. Is it likely that when computational power increases, sliding window techniques will come back, or are bottom-up approaches inherently more powerful even ignoring computational limitations?Collecting data for this problem seems difficult, since it is conceivable that what "objects" are salient in a particular image is largely up to interpretation. That is, at what point does something large, like a building, stop being an "object" and start being part of the "background"? What about hills, mountains, etc?