Every Model Learned by Gradient Descent Is Approximately a Kernel Machine (paper review)

Review of paper by Pedro Domingos, University of Washington, 2020

Originally published in Deep Learning Reviews on December 14, 2020.

In this paper, the author shows that neural networks trained using first-order gradient descent with a small learning rate can be represented as similarity kernels and that they memorize the training points and subsequently use this information to make predictions the same way as SVMs and other kernel methods. This insight should lead to a better general understanding of how deep neural networks operate and, hopefully, will help improve future algorithms.

What can we learn from this paper?

That the output of a trained neural network can be viewed as a superposition of its training examples.

Prerequisites (to better understand the paper, what should one be familiar with?)

Neural networks
Kernel methods
Neural tangent kernels

Discussion

Kernel machines are models that predict an output value for a given input x by calculating a weighted sum of the values of a symmetric positive-definite similarity function (kernel) K between x and each of the training inputs xᵢ, possibly with an added bias constant and a final nonlinear…