Sunday, April 17, 2016

Tues, Mar 19:Exploring Nearest Neighbor Approaches for Image Captioning

Exploring Nearest Neighbor Approaches for Image Captioning. Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C Lawrence Zitnick. arXiv, 2015.


  1. The paper presents a comparison between captions created by KNN with different features and specific techniques for caption creation and human generated too. For KNN the paper selects the features by using the GIST descriptor for a smaller version of the image and the features from the fc7 layer of the VGG16 neural network. In the case of neural networks, they used two training strategies, the first training with image net and the second fine-tuned for classify the 1000 more common words in the captions. Since the k-neighbors could have outliers, the set is reduced for a subset that filters the neighbors. To evaluate similarity among captions the paper proposed CIDEr and BLEU.
    As expected GIST poorly performed in the tests as fine-tuned fc7 features had similar results to state of art algorithms. When evaluated by a computer the captions provided by humans does not outperform so clearly other algorithms; however, when the judge is human, the difference is much bigger (despite similar scores).
    Do you know how the human evaluation was done, did they use AMT for example?
    Do you think it is suitable to use quantitative analyses to judge natural language description? As these algorithms are still not beating humans.

  2. The authors of this paper explore using K-Nearest Neighbor to caption images from other similar captioned images. They try three different features, GIST, NN, and fine-tuned NN, and two different similarity measures (BLEU and CIDEr). Of these, GIST performs poorly, in large part because the similar images it retrieves are not that similar.
    The fc7-fine features performed better than the ME-DMSM approach when there were more test images.
    The human evaluators do not rank the k-NN models' captions as highly as human captions.

    How do the automatic evaluation metrics referred to in this paper work?
    This method seems to have an upper-bound on possible performance, simply because the scenes which are occurring may never have occurred before - how well would a human choosing from the set of candidate captions perform?

  3. This paper explores nearest neighbors as a simple baseline approach to the image captioning task and demonstrates that even this simple approach can yield close to state-of-the-art results. The authors use the MS COCO dataset for image captioning for their experiments. To find the nearest caption for a given test image, the author’s system first found a set of nearest neighbor training images in some image feature space, then choses the “consensus caption” from all the associated labels, where the consensus caption is the caption that is most similar to all of the other captions as measured by the BLEU or CIDEr score. The authors tested using 3 image feature spaces: GIST, the last fully connected layer of VGG-net and a fine-tuned version of the VGG-net layer. In their results, the authors showed that their system performs close to the state-of-the-art when measured using metrics like BLEU or CIDEr, but performs worse than other methods when evaluated by humans.

    Has there been any human evaluations that include feedback about why one caption is better than another? Could a system be trained to 'fine-tune' a caption from a nearest neighbor?

  4. The authors employ a baseline K nearest neighbor technique for image caption generation leveraging the presence of an existing caption corpus (i.e. MS COCO). They then compare this technique's performance against standard neural network based caption generators. When narrowing down the final candidate clusters, the authors evaluated the performance of standard GIST as well as fc7 and fc7-fine from VGG-net. They found that fc7-fine features performed the best, and that their scheme achieved near-state-of-the-art performance BLEU and CIDEr scores, but worse performance when graded by actual humans, since humans tend to like unique captions, and their scheme tries to find appropriate labels from existing caption data.

    Do we think the rareness/uniqueness of individual words plays a part in formulating discerning captions, and if so, how could the authors leverage this?

    Is it possible that GIST would achieve higher performance if this system were to use much longer captions, where each patch of the image has its own mini sub-caption? This would allow for better high level comparisons that would probably generalize better across similar-image-space.