Puzzling Success of Overparameterization: Lottery Tickets or Escape Dimensions?
infoscience.epfl.chA related viewpoint is that overparametrization is good because the model is stranded when the Hessian has all positive/zero eigenvalues. If we treat the probability that a particular Hessian eigenvalue turns positive as a Bernoulli process, the chance of all eigenvalues going positive/zero exponentially decreases as the parameter count increases
You don't need billions of parameters for that, precisely because the risk of being stuck at a local minimum decreases exponentially with the number of parameters. Right?
I think of it like this. Imagine a network with two inputs and one output. What's happening during training is to orient a set of 2d planes in 3d space. Then for each x and y coordinate you can iterate through those planes and figure out where the normal at (x,y) hits them on the z axis, take the highest of your result and that's your network output.
Sometimes the solution needs one of the planes to make a big change to its orientation. and in the process of doing so the fitness will go down. Now if some other plane happened, by luck, to be in a better position to get to that state without lowering the fitness, then it will generally dominate over the other one, and the first one would start to "evolve" to be suppressed by the other one. The first one then becomes (usually, but not always) pretty much dead. Since the network has a fixed number of weights, it can't decide to just "add another plane". It has to make do with a fixed number of them. If too much of them become useless your training can't really recover, you're stuck at that minimum. The overparametrization allows for a bigger chance that some other plane's orientation will happen to be in a position to shadow counterproductive ones.
IIRC the original author of the Lottery Ticket Hypothesis now disavows that idea.
One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.
If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.
Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.
A while ago a lot of the discussion about overparameterization was about explaining "double descent", the observation that test error doesn't descend monotonically and actually hits a local maximum around the point where the model has just enough parameters to interpolate the data. My favorite article about double descent looks at this in terms of splines [1]. If I can try to summarize that article: when you are designing a parametrized model to fit to data, you have a choice. You can either:
1. Avoid overparameterization by design. Manually create or choose a space of functions that has limited degrees of freedom by construction.
2. Accept overparameterization and regularize.
The latter tends to be more robust, because of the bitter lesson. It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with. The best way to approach that ideal, it turns out, is really to just let the computer figure it out via regularized optimization over an overparameterized space.
Statisticians started moving in favor of overparameterization long before deep learning got off the ground. This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
> [Regularisation] tends to be more robust, because of the bitter lesson.
This seems circular? Over-parametrisation and stochastic gradient decent provides automatic regularisation, and it turns out this regularisation is very good in practice. The bitter lesson is simply the a restatement of the empirical observation that this specific regularisation often works great, it doesn't come with any explanation.
> It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with.
What if that could be automated?
The process would need to have some knowledge of the desired outcome, much like a human expert would have a hunch of the design decisions to make.
Not necessarily. I'm a proponent of (admittedly not very popular) methodology of "train, do interpretability analysis, adjust model architecture".
It's not more popular for a few reasons: 1) you first need to train a full general model anyhow 2) interpretability is nontrivial and not guaranteed 3) once you make the architectural changes, you can't commit to that architecture as you might miss out in the future with more advancements 4) with modern transformers, there is limited amount of architectural "play" happening.
> This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
Could you elaborate on this?
I was also confused at first. I think what OP meant is that ESL talks a lot about regularization but doesn't mention over parametrization much.
However, most of their example datasets are small, and they use complicated models, so it's sort of implicit.
They definitely don't mention double descent though.
Hi, I work on RL, or as it is known today, "classical" RL. I'm interested in knowing the latest work that explains double descent and in general optimisation behaviour of overparameterized neural networks. Do you have a survey paper or blog post or anything else to recommend?
How is this view inconsistent with the lottery ticket hypothesis?
I have a very hand-wavy explanation for how (but not fully why) overparametized nets tend to generalize rather than overfit.
First a couple of facts:
1) An ANN works by learning decision boundaries that separate and group training samples and their associated labels.
2) If you train an overparametized net on random data then it will memorize it, but if you train it on consistently labelled structured data lying on some lower dimensional manifold, then rather than memorizing it, it will instead generalize, so the behavior depends on the nature of the data it is trained on.
Now the hand-wavy bit:
As training progresses the weights move the decision surfaces around until each training sample maps to a region of output space corresponding to the correct label, with these regions of output/latent space being separated by the learnt decision surfaces.
Initially during training (up to the double descent phase in cases where that happens) these regions of "gerrymandered" output space may only correspond to a single or very few training samples, so there may be multiple disconnected regions each mapping to the label "cat", and another group of disconnected regions each mapping to the label "dog". This is the the overfitting phase.
Now, if the data permits, with the data manifold being consistently labelled (nothing that looks like a cat being labelled a dog), there will often be potential to merge some of these disconnected regions of output space that map to the same label. So, for example we might go from four small regions of "cat" space to two larger merged regions of "cat" space. This is the mechanism of generalization with the extra space contained by the merged regions corresponding to interpolation - no training samples "forced" those larger merged regions, but also none prevented it ("dog" that looks like a cat).
The question then remains why the dynamics of training may cause the decision surfaces to initially be highly "gerrimanderd" (because it's easier?), but on continued training to merge (because without any dogs among the cats there is no reason not to, and once merged no label error causing them to unmerge - a ratcheting up process from smaller to larger regions with increasing generalization?).
Isn't this trivial?
What's more interesting is as to why double descent happens