Data-driven Vision, CS 2951t Spring 2016: Tues, Feb 2: Crowdsourcing Detectors with Minimal Training

Thursday, January 28, 2016

Tues, Feb 2: Crowdsourcing Detectors with Minimal Training

Tropel: Crowdsourcing Detectors with Minimal Training. Genevieve Patterson, Grant Van Horn, James Hays, Serge Belongie, Pietro Perona. Human Computation (HCOMP) 2015.

7 comments:

UnknownFebruary 1, 2016 at 11:52 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownFebruary 1, 2016 at 11:52 AM
In this article, Patterson et al present Tropel, a system to build visual detectors using crowd sourcing and limited training examples. The system is analyzed on a number of tasks, including classifying images of birds, detecting fashion items, and similar architectural elements in street scenes of Paris, and various parameters such as number of initial training examples and iterations of the learning algorithm. Key features of this system are the low cost compared to traditional annotation systems, as well as using human crowds to build visual detectors given only a single example.

Discussion: One issue to discuss is why in the fashion task, among the very reasonable top detections for items, e.g. the glasses, there are some very strange detections, such as a man's hands and a women sitting down. Or high contrast patterns for the watch.
ReplyDelete
Replies
UnknownFebruary 1, 2016 at 11:57 AM
One more question that I found interesting was one that the authors brought up, which was: How specific a concept can a crowd with minimal assumptions placed on them be trained to visually detect?
ReplyDelete
Replies
UnknownFebruary 1, 2016 at 1:21 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownFebruary 1, 2016 at 1:21 PM
This paper presents the Tropel system for crowdsourced visual detector training. Topel is motivated by the goal of training detectors for arbitrary visual elements, especially those for which labelled training data does not exist and is expensive to procure. An example task, taken from the paper, is to detect instances of a specific item of clothing within a set of candidate images. As input, the Tropel system only requires a single example image of the target visual element (although the authors also ran tests starting with multiple examples) and an optional text label. The system then uses a combination of crowdsourcing and active learning to train a detector. First, the system finds the 200 nearest neighbor image patches in the unlabeled data collection. These are then given to human workers on Amazon Mechanical Turk, who label images based on whether or not they match the example. The full set of labeled images is used to train a linear SVM (or some other classifier). The process is then repeated until convergence, though in subsequent iterations the 200 highest scoring patches from the trained SVM are passed to the workers instead of the nearest-neighbors. Using the CUB 200 dataset of birds, the authors show that the Tropel-trained detector achieves average-precision results that are nearly as good as a detector trained on a fully-labeled dataset. The authors further demonstrate the effectiveness of the system on difficult visual concepts using datasets created from images of clothing and architectural elements.

Discussion: What would be the best way to tune the system between more specific and more general visual topics? What is the tradeoff in terms of the number of images shown to workers at a time and could the 200 images be split up and given to more workers?
ReplyDelete
Replies
AnonymousFebruary 1, 2016 at 6:05 PM
The paper presents a new method to create a detector without an annotated training set and any previous technical knowledge from the user. The principle is that it is impossible to predict the categories that a classifier will be required to work with. Therefore, the objective was to build one system that from a single positive sample could create a classifier. The algorithm to create the on demand system starts from a set of images and an example. Then, using it defines the K (in this case 200) nearest neighbors of the original seed (it uses features defined by a CNN), in the first step, crowd source is used. The second step consist in define, from the nearest neighbors, the positive and negative, samples. This part uses crowd work; it has a police of adopting a decision if 2/3 of them agree. Finally, a SVM is trained with the training set defined previously. This process is repeated until convergence or a limit of iterations is reached.
The evaluation of the Tropel detectors first compare the results with an SVM trained with annotated dataset. The results of the proposed system (average performance) were lower than the common approach. Additionally, different quantity of iterations had few influence on the AP. What was defined by the experiments was the minimum of one seed image to define the detector. The resulting system is a relatively not expensive pipeline that produces precise results in concepts that do not even have annotated datasets.
Questions
Is there any special reason for 200 nearest neighbors?
If different features (instead of the features created by the pre trained CNN) were used, would the results change considerably?
ReplyDelete
Replies
UnknownFebruary 1, 2016 at 6:47 PM
The paper addresses the task of training visual detectors of objects/things that are not necessarily semantically meaningful by a single seeded exemplar. To do so, an interactive incremental learning protocol is set that uses crowd-sourcing to mine positive and negative exemplars in a dataset of unlabeled images that are eventually used to train a classifier. The protocol is quantitatively evaluated using CUB 200 bird dataset to train an object part detector, namely species specific head detector, and qualitatively evaluated using a fashion dataset and Paris street scenes. The results demonstrate that effective visual detectors are obtained using minimum supervision and human in the loop.

Discussion: From a classifier training perspective, the pipeline can be seen as “learning with noisy labels” since label corruption is introduced by humans in the loop who label patches either positive or negative which are not necessarily correct. Now after the users labeled the patches, since the evaluation scheme considers an overlap ratio of 0.3 to be correct the noise in the labels of the training data due to user decisions could have been measured in the same way and reported. This way the noise in the labels introduced in the training scheme due to humans could have been shown and maybe instead of a comparison to a one-shot SVM classifier using perfect labels, a comparison to a one-shot SVM classifier using the same training set with corrupted labels could have been done. This might show the effect of “incremental” training and the effect of boosting.

Is the final classifier a cascaded one using all the previously trained classifiers with proper weights?
ReplyDelete
Replies

Add comment