Paper discussion blog for CS 2951t at Brown University. Instructor: Genevieve Patterson
This comment has been removed by the author.
In this paper, Patterson et al used crowdsourcing techniques to build a database of image attributions on top of the SUN scene image database. Images can have multiple attributes like "outdoors", "natural", "vacationing" or "eating", while images only have one category in the original image database. A variety of techniques were used to ensure adequate and economical crowdsourced performance. Once this ground truth was established, a variety of SVM binary classifiers were built for the absence or presence of each particular scene attribute based on features like HoG, SIFT, and Self-Similarity, as well as an SVM built on a composite kernel of these features. One question for discussion is, have deep features been tested for the classification task? And if so, what is their performance relative to the best combined SVM classifier?
The paper presents a new data set and the first large-scale attribute database. It was built over the SUN database. However, the final number of attributes was 102 although the SUN data set has more than 700 categories. The attributes were defined using the Amazon Mechanical Turk (AMT) by the text descriptions of images from workers. The AMT workers also performed the annotation task. This phase started with a quiz (a training part) and then went to the labeling. To identify good and bad workers a system that look for outliers in the annotation time and frequency were used. Additionally, only the annotations from the good workers were considered (38 of 800 workers).Different experiments were done to define importance and usefulness of the attributes. They consider since recognizing attributes (the difficulty was related to the frequency if it), using attributes as features instead of low-level ones (beating two of three features) and image retrieval (from text and images to text).Would a larger data set produce better results? In addition, how the small quantity of selected workers influence the creation of the data set?What is the current status of attribute classifiers and the use of them as high-level features?
In this paper, the authors create a scene attribute database, using crowd-sourcing techniques on AMT. They chose 102 scene attributes and collected labels on 14,000 images, by presenting workers with many images and a single attribute at a time. Bad users were filtered out, and good users were cultivated.The authors then performed different experiments using the attributes and features that were learned. They attempted to predict scene categories from attributes, which performed better than some low-level features, but did not do as well as all the features combined. They also used their attributes for image captioning.How well did the worker cultivation strategy work? In general, what metrics does one use to measure the success of a crowd-sourcing strategy?In the image captioning section, if a person were captioning an image, what would their average accuracy be? Does the fact that stop words counted mean that simply choosing captions by randomly selecting stop words would outperform the other options?Like Adam, I would also be curious to see how deep features perform on this task.
This paper introduces a new extension to the SUN scene database called the SUN attribute database. This extension add scene “attribute” labels to a subset of images in the original SUN dataset. There are 102 possible attributes included in the dataset, broken up into 4 high-level categories: materials, surface properties, functions and spatial envelope attributes. The attributes used were generated from workers on Amazon Mechanical Turk (AMT) and filtered by the authors. AMT was then used to label the images in the dataset with the attributes from the chosen set. The authors ran a number of experiments on the SUN attribute dataset including: visualizing the space of scene attributes, predicting attribute labels given an image, using predicted scene attributes as an intermediate feature for predicting scene type classifications and using predicted scene attributes as a feature for image retrieval. These experiments showed that scene attributes can be useful features in a variety of computer vision tasks.Discussion: How much does the performance on the tested tasks suffer if you get if you include “borderline” attributes (attributes with exactly one vote from AMT)? Have there been any experiments using convolutional networks to classify attributes? Are all of the attributes useful for classifying scenes and which are the most important?
The authors extend the existing SUN scene database by crowd-sourcing "attribute" labels for the 14,340 images, arriving at a set of 102 powerful, descriptive attributes. The authors show that attributes are particularly adept at scene description because they can form compositional relationships among each other (a "beach scene" attribute might always be accompanied by "dark water" and "sandy", for example), and do not suffer from the same pitfalls categories suffer when it comes to ill-defined category boundaries. The crowd-sourcing was performed using Amazon Mechanical Turk, with some manual filtering also performed by the authors. A variety of experiments are also run demonstrating the viability of attributes as scene descriptors, including using attributes as a category-predicting feature and as an aid in description-based image retrieval tasks.If many more (and more complicated) attributes were introduced, it may be possible to organically derive a sort of attribute grammar based on the absence and presence of certain attributes in the target image. Is there existing work that does something like this?It seems like it would be very powerful to perform an attribute-based segmentation of the target images e.g. identifying the areas of the image where each attribute is "active". How might one go about doing this?