The paper presents a method to recover the image information from the encoding created by different methods (HOG, SIFT and CNN’s). It defines the representation of the image as a function and aims to define an approximated inverse function. The problem itself is defined as an optimization one, that seeks the image that minimizes the distance (Euclidean in this case), between the recovered and the original. The notion of “regulariser” was used to “restrain” the process, but to accommodate it with the loss function demanded a new definition of the function. Applying gradient descent worked for the optimization. Since CNNs have the derivative naturally defined the gradient descent becomes viable. For SIFT and HOG, approximations using CNN were necessary. For CNN’s(AlexNet), the error maintains a certain limit(20%) and especially in the last layers the inverse becomes a composition of different parts, and not just a similar image. Additionally, experiments approaching invariance and limiting the features into regions or channels. Discussion How the multiple reconstruction were performed? Did they changed a parameter to do that? Is the inversion error just the normalized Euclidean distance (an average error of 8.5% does not match the reconstruction examples(figures))?

This paper introduces a new method for approximately reconstructing images from common representations used in computer vision, particularly, HOG, DSIFT and CNN features. The authors set up image reconstruction as an optimization problem, where they aim to find the image that minimizes the Euclidean distance between the given representation and the computed representation of the new image. To ensure that this reconstruction creates naturalistic images, the authors also introduce regularizer terms to the optimization objective. The first regularizer is simply the a-norm of the image for some a, while the second is the total variation norm, which penalizes frequent large variations between adjacent pixels. The objective given in the paper can be optimized using gradient descent as long as it is possible to compute derivatives of the function used to construct the image representations. CNNs are naturally easy to differentiate and the authors show that HOG and DSIFT features can be framed as specialized CNNs. The authors used normalized reconstruction error to evaluate their results quantitatively. The authors found that their work outperformed existing inversion methods both qualitatively and quantitatively. They also visualized image reconstructions from different layers of a deep CNN and showed the visual information preserved at each layer.

What would be the results of performing reconstructions for artificially constructed representations? For example, what would happen if we reconstructed a CNN representation that had 1 for one class and 0 for all others on the output layer? Have the authors run similar experiments on any other representations, such as those mentioned in the introduction? Could the results from this paper be used to develop better deep network architectures?

The authors develop a novel means of reconstructing/inverting images from various features used in computer vision, including deep CNN features as well as HOG, DSIFT, etc. The main idea here is to attempt to "invert" features so that they most closely resemble their input form as a way of seeing how much information from the original image is preserved and what invariances are introduced by the filter. To do this, the authors minimize the euclidean distance between the original input image and the feature representation of the image when constructing their inverse function. CNNs lended themselves naturally to this approach as they are easy to differentiate (and thus perform gradient descent on), whereas the authors recast DSIFT and HOG features as special cases of a CNN to make them easily differentiable. This work ended up outperforming existing feature inversion techniques. Detailed visualizations were also created showing the inverses of various features.

It would be interesting to see how much the amount of invariance introduced by a particular filter/feature affects the euclidean distance used by the authors.

Hi guys, there's a 15 minute talk by one of the authors here which I thought was pretty useful/interesting.

ReplyDeleteThe paper presents a method to recover the image information from the encoding created by different methods (HOG, SIFT and CNN’s). It defines the representation of the image as a function and aims to define an approximated inverse function. The problem itself is defined as an optimization one, that seeks the image that minimizes the distance (Euclidean in this case), between the recovered and the original. The notion of “regulariser” was used to “restrain” the process, but to accommodate it with the loss function demanded a new definition of the function. Applying gradient descent worked for the optimization. Since CNNs have the derivative naturally defined the gradient descent becomes viable. For SIFT and HOG, approximations using CNN were necessary.

ReplyDeleteFor CNN’s(AlexNet), the error maintains a certain limit(20%) and especially in the last layers the inverse becomes a composition of different parts, and not just a similar image. Additionally, experiments approaching invariance and limiting the features into regions or channels.

Discussion

How the multiple reconstruction were performed? Did they changed a parameter to do that?

Is the inversion error just the normalized Euclidean distance (an average error of 8.5% does not match the reconstruction examples(figures))?

This paper introduces a new method for approximately reconstructing images from common representations used in computer vision, particularly, HOG, DSIFT and CNN features. The authors set up image reconstruction as an optimization problem, where they aim to find the image that minimizes the Euclidean distance between the given representation and the computed representation of the new image. To ensure that this reconstruction creates naturalistic images, the authors also introduce regularizer terms to the optimization objective. The first regularizer is simply the a-norm of the image for some a, while the second is the total variation norm, which penalizes frequent large variations between adjacent pixels. The objective given in the paper can be optimized using gradient descent as long as it is possible to compute derivatives of the function used to construct the image representations. CNNs are naturally easy to differentiate and the authors show that HOG and DSIFT features can be framed as specialized CNNs. The authors used normalized reconstruction error to evaluate their results quantitatively. The authors found that their work outperformed existing inversion methods both qualitatively and quantitatively. They also visualized image reconstructions from different layers of a deep CNN and showed the visual information preserved at each layer.

ReplyDeleteWhat would be the results of performing reconstructions for artificially constructed representations? For example, what would happen if we reconstructed a CNN representation that had 1 for one class and 0 for all others on the output layer? Have the authors run similar experiments on any other representations, such as those mentioned in the introduction? Could the results from this paper be used to develop better deep network architectures?

The authors develop a novel means of reconstructing/inverting images from various features used in computer vision, including deep CNN features as well as HOG, DSIFT, etc. The main idea here is to attempt to "invert" features so that they most closely resemble their input form as a way of seeing how much information from the original image is preserved and what invariances are introduced by the filter. To do this, the authors minimize the euclidean distance between the original input image and the feature representation of the image when constructing their inverse function. CNNs lended themselves naturally to this approach as they are easy to differentiate (and thus perform gradient descent on), whereas the authors recast DSIFT and HOG features as special cases of a CNN to make them easily differentiable. This work ended up outperforming existing feature inversion techniques. Detailed visualizations were also created showing the inverses of various features.

ReplyDeleteIt would be interesting to see how much the amount of invariance introduced by a particular filter/feature affects the euclidean distance used by the authors.