Paper discussion blog for CS 2951t at Brown University. Instructor: Genevieve Patterson
The paper proposes a new architecture for a decision forest that uses probabilistic nodes to route samples along the tree. The decision function in these nodes is differentiable and offers the possibility to learn information through SGD, for example. This function is actually the sigmoid of a function “f”, “f” can be independent for each node and they all share the same parametrization. In this case, this function is the output of a deep network. Therefore, training a single tree in the forest consists in minimizing a loss function in terms of the shared parametrization and the prediction of the leaf. The general train strategy consist in getting the parameters for the prediction nodes and from these results apply stochastic gradient descent to modify the shared parameters. Two configurations of the network were used to perform experiments. The first one compared the state of art forest classifier with a proposed tree without deep learning, shallow tree. In most datasets the proposed version outperformed the state of art technique. The second experiment used a modified version of GoogleNet. Considering the three Softmax layers in the network; they were substituted by decision forests. The resulting network outperformed the GoogleNet. DiscussionHow the shallow tree is defined, which function is used instead of the CNN output?Do they consider the results from the inner forest or just the one that substituted the final layer? Do you think that the trees made the output less affected by noise, I mean, the trees would filter the information to keep just the most important for the decision?
The authors unite the worlds of convolutional neural networks and decision forests by demonstrating how to model and train stochastic, differentiable decision trees as a replacement for part of a convolutional neural network. They define a stochastic routing algorithm for decision trees, which enables decision tree split node parameter learning via backpropagation. Using this technique, they were able to match our outperform state of the art results for both decision trees and deep neural networks MNIST and ImageNet. Most notably, by replacing the final softmax layers in GoogLeNet, they were able to outperform the best available GoogLeNet architecture by 6.67% error without having to rely on any form of training data set augmentation.DiscussionWhat is stopping them from using decision trees to replace neural networks entirely? If they can replace the softmax layers of a GoogLeNet, why can't they replace the other hidden layers as well?Decision trees are known for the ease with which "knowledge" can be extracted from a trained decision tree, whereas neural networks are known for being very stubborn at resisting knowledge extraction. Do their modified decision trees allow for easier knowledge extraction?