
weight | watcher
WeightWatcher (w|w) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, relying on the older phenomenology theory of Heavy-Tailed Self-Regularization (HTSR), published in JMLR, Nature Communications, etc., and the more recent SemiEmpirical Theory of (Deep) Learning (SETOL), presented at NeurIPS2023.
WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).
Easy to use
WeightWatcher requires just a few lines of code to generate layer-by-layer diagnostics for your Deep Learning models.
It supports most Tensorflow/Keras, pyTorch, and HuggingFace CV and NLP/Transformer models (Dense and Conv2D layers).

What weightwatcher can do for you?
Here are just a few examples of what you can do with the tool:
Identify poorly trained layers in your models
Weightwatcher computes unique layer-by-layer quality metrics
The weightwatcher HTSR theory
tells us if and when a specific DNN layer has converged properly; it is unique in this regard
and the only theory capable of this. When running watcher.analyze(), you will obtain a pandas dataframe containing
several layer quality metrics. In particular, the weightwatcher alpha metric can tell us if a layer has is well trained or not.
Specifically, the layer quality metric alpha should be between 2 and 6.
Here we have run weightwatcher on 2 of the currently newly popular Bloom models.
We plot a histogram of the layer alpha values.
Notice that both models have several layers with alpha > 6; this is not great.
If you see such layers in your model, you may need to decrease the size of the layers or add data, and/or do a better job of optimizing your hyperparameters.
Help you select the best pretrained model
You can compare models by looking at their layer alphas
The best models have layer alphas that lie between 2 and 6. This can be seen by comparing the layer alphas for
BERT and
XLNet.
The WeightWatcher Power-Law (PL) metric alpha is a DNN model quality metric; smaller is better.
This plot displays all the layer alpha values for these 2 popular models.
It is immediately clear that the XNLet layers look much better than BERT.
the alpha values are smaller on average, and there are no alphas larger than 5: (alpha <=5).
In contrast, the BERT alphas are much larger on average, and both models have too many large alphas.
This is totally consistent with the published results: In the original paper (from Microsoft Research),
XNLet outperforms BERT on 20 different NLP tasks.
Predict (trends in) test accuracies -- without training or test data
The average layer quality metrics are correlated with test accuracy
Weightwatcher provides several different layer quality metrics, such as alpha, alpha-hat, etc.
From this, we can make a model quality metric by simply taking a layer-average.
One particularly useful model metric is the average alpha-hat, which is a weighted average
of the weightwatcher alpha layer quality metric
The weightwatcher average alpha-hat metric is correlated with the reported test accuracies
for many production Computer Vision (CV) models like the VGG series, the ResNet series, etc.
In the weightwatcher Nature paper
shows that the average alpha-hat metric works is remarkably well
correlated with test accuracies for over 100 different CV models. Here, we show how the
average alpha-hat metric tracks the reported top 1 (and top 5) test accuracies for the
open-source VGG models.
And, again, this does NOT require access to the test or even the training data!
You can reproduce this yourself using this Notebook.

Evaluate the information flow in different architectures
Correlation Flow shows how well information flows from data to labels
Deep learning models learn the patterns the training data and encode this in the correlations in the layers.
In doing so, information must flow from the data (the first layer) to the labels (the last layer).
We can visualize this flow using the weightwatcher tool by plotting the layer id against layer quality metric alpha:
we call this plot the Correlation Flow.
Here, we see how the Correlation Flow looks in the VGG architectures. We see In our Nature paper, we examine other architectures such as the ResNet and DenseNet series.
The best models have good Correlation Flow, with stable alphas. In models with less optimal architectures, the layer alphas
may increase with layer id, as with the VGG models, and may even behave more erratically.

Find (and remove) training anomalies
Correlation Traps can arise with bad learning rates, etc.
If we randomize the layer weight matrix, W, we can compute the randomized ESD (red) and compare this to the original trained ESD (green). When the layer is well correlated, and captures information well, then the randomized ESD (red) will look very different from the the original ESD (green). If the layer matrix W is well trained, its ESD will appear heavy tailed, even on a log-scale. Moreover, when W is randomized, the randomized ESD will look like a near-perfect semi-circle with very crisp edges. We see both here for VGG FC3 (first / top plot).
In contrast, for VGG FC1, the
original and randomized ESDs do not differ much, except at the right edge. This suggests this layer is not actually as well trained as it could be, despite this being a well-known, albeit older, open-source production quality DNN CV model. How can we tell ? First, notice that the original ESD (green) looks almost identical to the the randomized ESD here, except for a small shelf of density on the far right. Why does this happen ? The orange vertical line depicts the maximum eigenvalue of the the randomized ESD, which extends far to the right of bulk edge (red).
This line the so-called Correlation Trap. It tends to traps the true correlations around it, distorting
the original ESD (green), causing that odd shelf to form. The best trained models do not have such traps.
These traps arise with unusually large learning rates or other sub-optimal settings of the hyperparameters. They can be removed using the weightwatcher SVDSharpness Transform, either during and/or after training.
This represents one of several kinds of problems that can be detected using the weightwatcher tool.
Research and Publications
Presentations and Podcasts
The WeightWatcher Project: ICCF24
The science behind it and its applications for Climate Change
Silicon Valley ACM: April 2022
Technical usage and features for practitioners