Monday, February 8, 2016

Tues, Feb. 9: AlexNet

ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton. NIPS 2012.


  1. This paper introduces a deep convolutional neural network architecture for performing object classification on the ImageNet dataset. The system developed by the authors gives state-of-the-art performance on both the LSVRC-2010 and LSVRC-2012 contest datasets from ImageNet with top-5 error rates of 17% and 15.3% for the respective challenges. The network that the authors developed consists of 8 layers, including 5 convolutional layers and 3 fully-connected layers. The authors employ a number of important “tricks” in the network in order to reduce training time and prevent overfitting. These include: using non-saturating rectified linear neutron outputs, using max-pooling layers with overlapping pools, using random neuron dropout in the fully connected layers, and artificially augmenting the dataset with transformed and re-colored images. The authors also employ a GPU-based training scheme with a novel architecture that splits the network across two GPUs, reducing inter-GPU communication costs by only having the neurons on certain layers communicate across GPUs. The authors hypothesize that with more data and a larger network, convolutional neural networks could perform even better in object classification tasks.

    Discussion: What is the trick to doing hyper parameter tuning when the network takes so long to train? There doesn’t seem to be too much justification given for the high-level structure of the network, what is the strategy for developing this type of structure? Did the authors perform any analysis on the error in their test sets, for instance, are some object types more likely to be misclassified?

  2. The paper presents an approach driven by the ImageNet (with more than 1 million images) and a deep neural network. From the 22,000 categories existent in the data set, a partition of 1000 were used and the images were fixed in 256x256 pixels. The network in total has 60 million parameters and to deal with the processing two GPU’s were used. The architecture of the network present new features. An unusual activation function, Rectified Linear Units that reduces the training time. Two GPU’s that communicate between them in some layers and are responsible for one-half of the training set. Finally, a local response normalization and overlapping polling. Thus, the network has five convolutional layers and three fully connected layers; the last one is a softmax that reduces from 4096 outputs to a thousand. A technique called drouput was also employed, it defines a 50% chance of the neuron produces 0 as output.
    The technique presented the state of art results on the ILSVRC-2012 with almost 10% less error and more than this using a pre-trained model. Comparisons with Euclidean distance between tests also showed similar distance in similar photos defining a qualitative proof of the performance.
    Was there a methodical way to define the structure and the parameters; specially momentum, weight decay and learning rate?
    Does it get better results because of the amount of information used or because it really learned patterns and relations in the data?

  3. This paper presents a deep convolutional neural network trained on ImageNet, classifying 1.2 million images into 1000 different classes. They down-sampled images to 256x256 pixels, and created new training images with translations, reflections, and shifts in lighting.

    They used 5 convolutional layers and 3 fully-connected layers, which they trained using two GPUs. The authors had a few tricks to reduce overfitting, including overlapping pooling and data augmentation (mentioned above), and dropout.

    This system performed better in several competitions than previously reported results.

    Why does the restricted connectivity between kernels help? Has this been done with more than two GPUs?

    The images on the right in Figure 4 all seem very similar, but where there any training images for which the test matches were less good or just weird?

  4. Leveraging a deep (~8 layer) convolutional neural network architecture, the authors achieve state of the art performance and competitively low error margins when classifying a subset of 1000 classes from the LVSRC-2010 and LVSRC-2012 ImageNet datasets. Their 8 layer architecture, which consisted of 5 convolutional layers and 3 fully-connected layers employed a novel melding of new and existing techniques to achieve this performance. Of particular note was the authors' novel use of Rectified Linear Units as a conveniently fast to evaluate and train activation function, and their synthetic augmentation of the original dataset with slightly modified or recolored versions of the original images. At the end of the day, the authors were able to achieve better than 10% less error than the competition using this architecture.


    The convolutional architecture used by the authors involves many initial parameters, including the initial network topology, and the parameters used to configure and adjust learning rate, momentum, weight decay, and neuronal dropout. To what extent might it be possible to learn these parameters adaptively rather than hard coding them into the model?

    PCA, reflections, patch extraction, and other simpler recoloring techniques were used to augment the original dataset and to avoid overfitting. Why do the authors not simply add small levels of random noise to the images? Does their augmentation technique truly eliminate over-fitting, or is this technique somewhat deceptive?