Thursday, January 28, 2016

Tues, Feb 2: MS COCO

Microsoft COCO: Common Objects in Context. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. ECCV 2014.

COCO website

This is the first paper for which you'll post reading summaries. Here is the description of these summaries from the class website: Students will be expected to read one paper for each class. For each assigned paper, students must write a two or three sentence summary and identify at least one question or topic of interest for class discussion. Interesting topics for discussion could relate to strengths and weaknesses of the paper, possible future directions, connections to other research, uncertainty about the conclusions of the experiments, etc. Reading summaries must be posted to the class blog by 11:59pm the day before each class. Feel free to reply to other comments on the blog and help each other understanding confusing aspects of the papers. The blog discussion will be the starting point for the class discussion. If you are presenting you don't need to post a summary to the blog. 

Simply click on the comment link below this to post your short summary and one or more questions / discussion topics.


  1. The paper presents a new dataset that tries to put object recognition in the context of scene understanding. It adopts the principle that the dataset can push algorithms in the direction of natural images (with many categories per image and balanced number of instances per categories) to generalize the classification. To define the dataset, first the categories are defined. They include the ones from PASCAL VOC and other were selected using parameters as usefulness, facility to get images and usage frequency of the word. To acquire non-iconic images (the natural images) mainly a pairwise combination of terms were searched in Flickr. With these images, a crowdsourcing strategy was used (Amazon's Mechanical Turk). First, the user classified the presence of a super category and then the position of at least one instance. Later, each instance was marked and finally the segmentation was made. To deal with human mistakes precautions were taken. For example, eight workers marking each image in the first step and a training task followed by a worker selection in the third. As a result, of the dataset produced has more instances per categories than other largely used datasets (SUN, PASCAL VOC) and more objects per categories than ImageNet. As supported by an experiment, using DPMv5, MS COCO while training may not always help (noise images), but it can improve generalization(less difference of performance between datasets).


    Since there are more images per category, how would be the results using an more recent algorithm (based on CNN)?
    They used just a model (DPMv5), is it enough to draw conclusions about the dataset?
    Is the lack of "stuff" responsible for some of the results with the DPMv5 trained using MS COCO?

  2. This paper presents the Microsoft Common Objects in Context (COCO) computer vision dataset. The creation of the COCO dataset was motivated by the need to produce larger and more challenging vision datasets in order to advance state of the art algorithms for image classification, object detection and object segmentation. The COCO dataset consists of more than 300,000 images with exact segmentations for all objects in 91 different categories, resulting in more than 2 million segmented objects in total. In their data collection, the authors focused on images that included objects in their natural context, rather than “iconic” images of objects where the target object is the explicit focus of the image. The images in the dataset were collected from Flickr using combinations of image categories as search terms. Ground truth object detection and segmentations were then found by workers on Amazon Mechanical Turk using a novel interface. When considered against existing datasets such as ImageNet, PASCAL VOC and SUN, the COCO dataset is unique in having both a large, diverse collection of images while including object instance segmentations for a large collection of object classes.

    Discussion: How could this process be extended to more complex objects classes or “stuff” classes as the authors note? What would be other useful experiments to run using this dataset?

  3. In Microsoft COCO: Common Objects in Context, Lin, Maire, Belongie et al present the results of obtaining a very large dataset of annotated images of over 300K images. An important goal of the dataset is to provide an image set with many non-iconic images that is more indicative of real visual data, and annotated with many different common object types. The process of using the crowd to annotate the data set is described and precision and recall are compared to expert labelers. A very interesting question to further explore is how well classifiers trained on non-iconic datasets like COCO perform compared to datasets of iconic images on real life datasets one might encounter.

  4. In this paper, the authors present a new dataset for scene understanding. They focused on limiting the number of iconic images in the dataset and on using fewer categories with more examples per categories.

    The authors describe how the images were collected using searches for two objects, like "dog" and "car," and how the images were annotated using their efficient MTurk pipeline.

    Discussion: Why wasn't "stuff" included in the original paper, and has it been added in the year since it was published? How does the "long tail phenomenon" affect object classification when classifiers are trained on datasets with this feature?

  5. The paper presents the introduction of the MS COCO dataset consisting of 91 object categories that are captured in their natural environment. Based on the shortcomings of the previous vision datasets, the authors aimed to compile a set of images which have more and relatively equal number of instances of each category so that visual models per each category are better learned in their natural context. The MS COCO provides 3 different granularities of ground truth supervision, namely 1) category labels/captions 2) instance bounding boxes 3) instance segmentations. The stages of data acquisition are explained and metrics to evaluate the quality of annotation are discussed.

    Discussion: Most important factors that affect any object detection pipeline are 1) scale of objects 2) occlusion of objects 3) articulation of non-rigid objects 4) rotation of objects encountered (though some categories like "people" are almost always encountered upright)
    How to evaluate the variation of these aspects per each object category in the dataset?