Tuesday, March 15, 2016

Thurs, Mar 17: What makes Paris look like Paris?

What makes Paris look like Paris? Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. Siggraph 2012.


  1. In What Makes Paris Look Like Paris, Doersch et al seek to discover visual patches in city image views that are discriminative of particular geographical areas. One of their main contributions is an application of a patch discovery clustering algorithm which finds patches which do the best job of discriminating between images from one given area versus another. Being able to find patches which can discriminate between one location versus another leads to a number of interesting applications in computational geography, for instance plotting geo-discriminative elements within a city on a map, finding similar scenes among different cities, and discovering architectural or stylistic influences among cities or regions. Such image patches also are helpful for artists, animators, and movie-makers who try to capture the look and feel of a place by focusing on geo-discriminative elements of say, Paris. Furthermore the automatically discovered patches are geo-informative and are human-interpretable as location specific, for instance cast iron railings to London or distinctive street signs in Paris.

    Discussion: How often did geo-discriminative patches appear in a typical image from Paris? I.e. were discovered patches helpful towards building a classifier for the Paris/non-Paris in the beginning of the paper?

    How were results on non-European and non-US cities like San Paulo, Mexico City and Tokyo?

  2. The paper defines a new way to find automatically the elements that define a city (the work could be extended to get the stylistic features of any set of images). The idea is to discover, using images from Google Street View, which visual characteristics define a place. Therefore, from a set of patches they are looking for those that frequently happen(so they generalize the city) and that are specific for that location. The paper defines random seeds for clusters and by an iterative process ,the algorithm filters those that are discriminative and frequent. To perform this iteration, for each seed the nearest neighbors are taken using a SVM classifier trained with the images from the previous round.
    For evaluation the authors used the SVM detector in a dataset of unseen images from cities like Prague and Paris. The applications are mainly to define features present in the cities and even the neighborhoods where stylistic characteristics are present.


    Do you know if the research were expanded for specific objects and buildings?
    Do you think that it would be possible to use the technique to classify the location in "natural images"?

  3. In this paper, the authors wanted to determine whether they could find visual elements which are distinctive for cities. They tested 11 people, and found that, when shown an image, they could distinguish which images came from Paris with 79% accuracy.

    They collected images from Google Street View: approximately 10,000 images for each of 12 cities. In order to determine which features in the images are important, they define characteristic elements to be ones which are both frequently occurring and geographically discriminative. From a set of candidate seeds, they select the 1000 most geo-informative, and use these to cluster the images.

    They find that their visually informative elements can be used by people to determine whether a patch came from Paris. The authors also found that this worked less well for American cities than it does for European cities.

    I'd be interested to see how well a CNN performs at geo-localizing images. Extracting the features (possibly using a method we've talked about in class) would be much more difficult, though.
    I wonder whether the system would be better at determining neighborhood in US cities than which city it is; in other words, does Chinatown in New York look similar to Chinatown in San Francisco?

  4. This paper aims to use an algorithmic approach to find elements in images that are characteristic of specific geographical region. In particular, the paper focuses on finding architectural elements such as doorways, awnings, balconies and street signs, that can be used to identify a particular city using street level photographs. To do this the authors first created a dataset using photos from Google street view taken in 12 different major cities. The authors then extracted small, high-contrast patches from these images and modeled them using HOG and color descriptors. These patches were clustered to find elements that are discriminative between cities. The clustering first found a set of nearest neighbors for each target cluster center patch and kept clusters where the nearest neighbors mostly were from the city of interest. Then the procedure trains an SVM for each cluster to get a better distance metric for that cluster and iterates between finding a new set of neighbors and training an SVM. In their results, the authors used user studies to show that the features found agreed with human understanding of what the distinctive elements for each city should be. In their discussion of applications, the authors mapped patterns of elements geographically and explored how similar elements can be matched across different cities.

    How well would the same approach work on finding distinctive elements across other types of categories such products, interiors or time periods in architecture? Could this work be used to catch visual elements that look out of place for a given city?

  5. The authors use a novel image retrieval pipeline to identify a city based on similarity to image patches taken from Google Street View panoramas throughout the city. This pipeline consists of taking a HOG and color descriptor for each patch and running these descriptors through an SVM. During the training process, descriptor classes emerge and small subsets of descriptors are shown to be highly discriminative on a per-city basis. Once discriminitive patch types (clusters) are identified, patch-type-specific SVMs are trained on these patch types, allowing a distance metric to be created for each patch type that will measure how close an arbitrary patch is to matching that patch type. Using a human-based study, the authors were then able to verify that their approach largely matches human understanding of distinctive features that differentiate the cities in the study.

    Would this type of approach work as well in cities that perhaps do not have long histories and unique architecture, perhaps American cities?

    Could something similar to this approach be replicated with a deep architecture. In other words, could we get a deep network to take a similar approach without having to explicitly tell it to cluster into patch types, and base the final decision on closeness to discriminative patch types?