Wednesday, April 13, 2016

Thurs, April 14: Visual Question Answering

VQA: Visual Question Answering. S. Antol*, A. Agrawal*, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. ICCV, 2015.


  1. The paper proposes a new task for multi-discipline AI that joins NLP, Computer Vision and Knowledge representation and reasoning. The VQA should be able to understand natural language images and answer questions(multiple choice and open answer) according to an image. MS COCO was used to extract the image set since it contains multi object images in natural poses. In addition, a caption dataset (with “paperdoll” models) was used to explore more high-level vision tasks. To acquire the questions each worker contributed with three questions, they also had access to previous (to create different ones). For the answers, 10 from each questions were taken, the workers were also instructed to avoid one-word responses. The answer were considered correct if 30% or more of they agreed. For question, for example, almost 40% had one-word answer, actually almost 98% of the questions had answer with less them three words.
    To analyze the questions a small group (3 people) answered the question with no image associated, showing that especially for non-yes/no questions the image is completely necessary. Additionally, AMT workers were selected to define the youngest group to answer the question and if it required previous knowledge. This should a relation between questions most suitable for adults and necessity for exterior knowledge. Finally, in order to have results two approaches were considered, a recurrent network (LSTM) and a two layer perceptron. Both had much lower accuracy then humans.

    Given the feature that represents the question and the image, do you know how the answered is compared? Do you know if they get features for each answer too?
    I could not find how they selected the workers that would contribute with the questions; do you have any ideas about it?
    What is the current usage of this dataset?

  2. This paper introduces a new dataset for the task of visual question answering (VQA). In this problem, the model is given an image and a text question as input and should produce an answer to the question in text form. The authors argue that this type of problem poses a significant challenge for current computer vision techniques and should allow for significant future research. The authors collected images for the VQA dataset from two different sources: MS COCO and custom created clip-art images using a set of predefined figures, objects and scenes. Together these sources total about 250,000 images across train, test and validation and each includes a human-provided image caption. The authors collected 3 questions about each of these images using Amazon Mechanical Turk. The authors prompted users to ask questions that required the image to answer and that would be difficult for a robot to answer. The authors collected answers to the posed questions by passing the images and questions to different workers, collecting 10 answers for each question and considering all answers that were given 3 or more times to be perfectly correct. The authors also analyzed and visualized different aspects of the datasets including the most frequent question types (as determined by the first few words), the distributions of answers, the level of maturity and access to the image needed to answer different questions and how much the human answers agreed. As a baseline for the task the authors trained a neural network taking inputs from both a CNN transformation of the image and a representation of the question text. The best model was found to get 57% percent accuracy when allowed to give any answer and 63% accuracy when choosing from 18 candidate answers.

    Did the authors do any analysis on often the model gave "plausible" answers to questions as opposed to exactly correct ones? Is it possible to reword questions to trip up the model?

  3. The authors have created a new visual question answering (VQA) dataset. VQA builds upon 204,721 images from MS COCO, as well as a 50,000 custom scene images constructed using clip-art. Using mechanical turk, the authors were able to generate for each image a relevant question about that image (e.g. "what color is the banana?"), and a correct response in natural language. The goal of the dataset is for a computer vision system to look at one of the images, read the caption, and generate a natural language response that is a close or exact match to the "correct" natural language answer for the question. When generating the natural language answers, the authors deemed any answer that was given by a human on mechanical turk more than 3 times to be correct, so there are potentially multiple correct responses for any given images. The authors then evaluated several off-the-shelf AI techniques on their dataset. As the authors expected, the best model was deeper LSTM Q + norm I, since this model featured deeper networks and integrated both the images and natural language question text when generating a response. This model achieved 57% accuracy when its answer pool was unrestricted, and 63% accuracy when choosing from 18 multiple-choice answers.

    The choices the authors made for answer similarity could have massive effects on how accuracy is measured, and in turn, how networks using this dataset end up being trained. What metric did the authors use for comparing the natural language answers?

    98% of questions had a 3 word answer or less. More complex answers are conceivable, but would obviously be more difficult for a network to construct. Is the value of this data-set diminished by the ease of generating simple 3-word responses?