Sunday, March 13, 2016

Tues, Mar 15: Learning Visual Similarity

Learning Visual Similarity for Product Design with Convolutional Neural Networks. Sean Bell, Kavita Bala. Siggraph 2015.

Also read:
Learning Deep Representations for Ground-to-Aerial Geolocalization. Tsung-Yi Lin, Yin Cui, Serge Belongie, James Hays. CVPR 2015.


  1. The paper propose the use of an embedding in visual search for interior design products. The search would be done by queries over the embedding that would be composed by "in-situ" and iconic images. The structure would be created by mapping an image and the a position defined by a neural network. This network is composed by two networks that share the same initial parameters and have a "contrastive" loss function. To select the train data the authors used the MTurk and some strategies to select the best workers.
    To evaluate the technique, four different siamese architectures and two neural networks were used. To evaluate the method the recall was related to the k nearest neighbors.

  2. Sean Bell and Kavita Bala explore ways to match iconic images with items in scenes. They collect both scene and iconic image data (3 million room photos and 6 million product photos) from also has tags connecting about 200,000 items in scenes to their corresponding iconic product photos.

    The authors used Mechanical Turk for quality control on these tags; they checked both that the items matched and retrieved accurate bounding boxes. They used sentinel images and duplication to check for inaccurate labels.

    The authors used a siamese network trained with a loss function composed of L_p (a penalty for a positive pair which is too far apart) and L_n (a penalty for a negative pair within a margin m). This resulted in a cnn which learned a visual similarity metric "powerful enough to place stylistically similar items nearby in space."

    Many of the examples in the paper show objects in the scene in the same orientation in which they appear in the iconic image. For a given product, does have a few different orientations of that product, like many furniture websites do? That would allow us to train on all orientations, and may help with making the system more orientation-invariant.

  3. This paper approaches the problem of visual search within the domain of interior design. In particular, the authors develop an approach to create a reduced-dimension embedding for images, such that images of the same or similar products are close in this space, while objects of different categories are far apart. To do this, the authors first collected a large dataset of images from which has images of interior design related objects both in natural settings and as iconic images. The authors uses existing annotations to match iconic images of products to natural photos of the same products, then used a mechanical turk-based approach to filter out mis-information in the existing tags and to find exact bounding boxes for object instances. The authors trained a siamese CNN to create the image to embedding mapping function. This approach runs two images forward through the network simultaneously, then computes a joint loss function which penalizes distance if the images are of the same object and penalizes a margin minus the distance if the images are of different objects. This loss could also optionally include a classification loss. The last layer of the CNN serves and the actual embedding output. The authors used mini batch gradient descent on many pairs of natural and iconic images to train the network. In their results, the authors showed that this strategy outperformed other approaches including simply taking the last layer of an existing network as the embedding. The authors also found the network was able to find visually similar objects of different categories even when not explicitly trained to do so.

    How would this type of approach perform on more general object similarity tasks? Is there a way to model similarity as a range rather than as a binary label?

  4. The authors apply CNNs to the domain of image similarity -- given a bounding box around an object in a scene image, their network returns the closest-matching "embedding" of that object in their database of training images. Furthermore, they perform a dimensional reduction such that images of similar objects are embedded near each other. Using data from, they match iconic images of products with regions of pictures containing these products, and then filter this data using mechanical turk to filter out incorrect labels and to get bounding boxes. They use a siamese CNN architecture to perform the image mapping + embedding. They two similar images through the network in a feed-forward fashion simultaneously, and then compute a joint loss function which approximates the "distance" between the images in product space. The last layer performs the actual embedding. The authors were able to outperform existing approaches using this strategy, and observed that their approach is able to identify cross-class visual similarity.

    Could such an approach be used for (3D) pose estimation?

    Would this approach work with many images of just one class?


  5. In Learning Visual Similarity for Product Design with Convolutional Neural Networks, Bell and Bala train a convolutional neural network to learn a similarity metric between indoor object patches taken from user photos and tags of centered true product photos. The goal of training is to map pairs of object patches and true label photos close together in a deep feature space, and to drive pairs of object patches with randomly selected label photos far apart. The mechanism to do this is are various Siamese network architectures where two copies of the same network simultaneously process the patches and labels during training. The authors collect roughly a million photos from a design website and use crowdsourcing to refine user-generated object patch labels. Three interesting visual search applications are given, including predicting true label products of a test patch, finding test label products used in real-world photos, and a stylistic image cross-category product search, which given a test patch, find the most visually similar patch or label of a given category such as outdoor lighting or bookcase. Intriguingly the convolutional network seems to learn a notion of decorative style, as tested against user notions of style similarity, while only being explicitly trained for patch-label distance.

    Discussion: How are the distance loss and category loss weighed against each other in the (C) and (D) architectures that use both losses to compute their final loss? This seems to a be a hyper-parameter to these two models.