In 2015, Nature published “Deep Learning.” Despite three years ago seeming like a century ago in machine learning, the paper is full of interesting ideas and provides a good step back view of the intuition behind deep learning.
Below are seven insights from the paper:
Comparing deep learning to linear classifiers
- Linear classifiers work by carving an input space into separate regions via a hyperplane (a subspace with n – 1 dimensions, in an n dimensional space).

Hyperplane splitting this 2-dimensional space (source)
- Image and speech recognition need to be “insensitive to irrelevant variations in input”. A shifted image of a dog should still be classified as a dog. On the other hand, a good classifier needs to be very sensitive to some minute variations. The paper gives the example of: a Samoyed (wolf like dog) and a wolf sitting in the same spot. An image recognition classifier should tell them apart.
- Linear classifiers, or shallow classifiers, cannot do this using pixel inputs. They need good hand designed feature extractors.
- Deep learning allows these features to be learned automatically.
- In a deep learning architecture: each subsequent layer increases both invariance (on what doesn’t matter to the output) and selectivity (on what matters).
Local minima not a real issue in practise
- In practice, model outputs have lots of “saddle points” where the gradient is 0, and most dimensions curve up, but the remainder will curve down.
- This doesn’t make sense to me: if no local minimas to get stuck at, then what is the limitation of neural networks?
Pixel level labeling with CNNs
The paper claims that:
“Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars.”
How good is this actually?

The above result is from the 2015 paper: From Image-level to Pixel-level Labeling with Convolutional Networks. The output seems okay, but not great (it missed horse’s face, and grabbed the wall with the bus). Given convolutional nets’ compression of the image in hidden layers (losing pixel mapping information), I believe pixel mapping will continue to be a difficult problem to solve.
Deep learning has exponential advantages over classic learning
- Learning n binary features allows 2n combinations of features. This will include combinations not seen in the training data.
- This hints at a more generic principle that general algorithms are better than specific ones, and are exponentially so.
Neural Turing machines can be taught reasoning!
- They can output a sorted list of symbols when input is unsorted list of symbols + values
- I had not come across this previously. This is such an incredible result. The original paper by DeepMind/Google is here.
Future progress will be in ConvNets combined with RNNs trained where to look
- This will also apply to language processing: they predict that neural nets that learn to be attentive to one part at a time will build state of the art sentence and document comprehension.
- These are referred to as “Attention Mechanisms.” They continue gaining momentum and getting good results, particularly in machine translation. link
Major progress will combine representation learning with complex reasoning.
-
- This is largely still missing from deep learning production models today.
- In 2017, Deep Mind proposed Relation Networks to address relational reasoning.