Wednesday, March 9, 2016

Thurs, Mar 10: Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation. Jonathan Long, Evan Shelhamer, Trevor Darrell. CVPR 2015.


  1. This paper introduces a fully convolutional neural network model for the semantic segmentation task, which aims to find pixel-wise object class labels for images. The authors define a fully convolutional neural network (FCN) to be a network that only involves convolutional layers, so the output is a non-linear filtered “image” which has spatial correspondence with the input image. The authors discuss how to reframe object classifiers as FCNs by casting the fully connected layers to simply be convolutions the size of the input, therefore these nets can be efficiently scanned over an image to create a map of classification outputs. The authors introduce two methods to deal with the coarseness of the output and note that up sampling the output using bilinear (or learned) interpolation layers in an effective strategy. They also mention ways to efficiently recover a patch-sampling method for training to correct class imbalances. In practice the authors showed the results of converting several different classification networks into FCN segmentation networks. They also introduced a strategy for creating more fine-grained segmentations by connecting lower layers of the network into higher layers. In their results, the authors show that their approach gets state-of-the-art results for several sematic segmentation tasks, including: PASCAL VOC 11/12 and the SIFT Flow dataset.

    Why are shift-and-stitch and patch sampling tricks important if neither technique was needed in practice? The authors mention that there are diminishing returns to finer grained versions of their skip net structure, are their other ways that a more refined segmentation be produced using this net?

  2. The paper presents a new technique to segment images (a pixel by pixel classification) using a complete fully convolutional network trained end by end. It expands the classification properties of neural networks for a pixel wise level. To create these networks a classification net was adapted exchanging the fully connected layers by a convolutional layer that uses the entire feature map as the input. In addition, a new layer is appended in the end, this one is responsible up samples the output of the last convolutional layer to a dense pixel classification. Finally, tests were made using a combination of the outputs of the FConv layers and previous ones convolutional layers (creating a DAG), with a finer grain.
    The approached presented the best results when compared to other techniques. They compare with algorithms with object detecting algorithms and scene parsing, since they treat both as pixel prediction.

    Are the convolutional layers a convolutional layer with the sub region as the entire feature map?
    Do you have any thoughts about why using just patches of the image does not affect the result?

  3. In Fully Convolutional Networks for Semantic Segmentation, Long, Shelhamer and Darrell propose and build varieties of convolutional neural network for semantic segmentation tasks. They repurpose existing nets like GoogLeNet and VGG by deleting the final classification layers to obtain a net that outputs pixel by pixel predictions of what type of objects are where in the image. They fine tune these networks on datasets like PASCAL VOC, NYUDv2 of images with depth information, and SIFT Flow to achieve state-of-the-art segmentation performance on each of these datasets with fast test times.

    Discussion: Exactly what type of functions and layers are allowed in their definition of a fully convolutional network?

  4. The authors of Fully Convolutional Networks for Semantice Segmentation create a new type of convolutional neural net to solve the problem of semantic segmentation. Fully convolutional networks are CNNs without any fully connected layers; the authors point out that a typical CNN like AlexNet can be converted into an FCN by interpreting fully-connected layers as convolutions. The authors discuss Shift-and-stitch and upsampling, to provide denser outputs, but they solved the problem by fusing predictions from different layers, instead.

    They converted AlexNet, VGG15, and GoogLeNet to FCNs - GoogLeNet had lower than expected segmentation performance, but VGG achieved stat-of-the-art performance. Their approach gets state-of-the-art performance on several datasets, and also reduces inference time.

    Do you have any thoughts on why GoogLeNet didn't perform as well as VGG16? It looks like, when fusing predictions from pooling and convolutional levels, they give equal weight to each - is this the best way to do this?