WeightWatcher: Data-Free Diagnostics for Deep Learning

WeightWatcher is a data free diagnostic tool for deep learning modles

weight | watcher

WeightWatcher (w|w) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, relying on the older phenomenology theory of Heavy-Tailed Self-Regularization (HTSR), published in JMLR, Nature Communications, etc., and the more recent SemiEmpirical Theory of (Deep) Learning (SETOL), presented at NeurIPS2023.

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

Easy to use

WeightWatcher requires just a few lines of code to generate layer-by-layer diagnostics for your Deep Learning models.

It supports most Tensorflow/Keras, pyTorch, and HuggingFace CV and NLP/Transformer models (Dense and Conv2D layers).

What weightwatcher can do for you?

Here are just a few examples of what you can do with the tool:

Identify poorly trained layers in your models

Weightwatcher computes unique layer-by-layer quality metrics

The weightwatcher HTSR theory tells us if and when a specific DNN layer has converged properly; it is unique in this regard and the only theory capable of this. When running watcher.analyze(), you will obtain a pandas dataframe containing several layer quality metrics. In particular, the weightwatcher alpha metric can tell us if a layer has is well trained or not. Specifically, the layer quality metric alpha should be between 2 and 6.

Here we have run weightwatcher on 2 of the currently newly popular Bloom models. We plot a histogram of the layer alpha values. Notice that both models have several layers with alpha > 6; this is not great.

If you see such layers in your model, you may need to decrease the size of the layers or add data, and/or do a better job of optimizing your hyperparameters.

WeighWatcher alpha quality metrics for Bloom LLM Deep Learning models

Help you select the best pretrained model

You can compare models by looking at their layer alphas

The best models have layer alphas that lie between 2 and 6. This can be seen by comparing the layer alphas for BERT and XLNet.

The WeightWatcher Power-Law (PL) metric alpha is a DNN model quality metric; smaller is better. This plot displays all the layer alpha values for these 2 popular models. It is immediately clear that the XNLet layers look much better than BERT. the alpha values are smaller on average, and there are no alphas larger than 5: (alpha <=5). In contrast, the BERT alphas are much larger on average, and both models have too many large alphas. This is totally consistent with the published results: In the original paper (from Microsoft Research), XNLet outperforms BERT on 20 different NLP tasks.

WeightWatcher quality metrics for the Bert and XLNet LLM Deep Learning models

Predict (trends in) test accuracies -- without training or test data

The average layer quality metrics are correlated with test accuracy

Weightwatcher provides several different layer quality metrics, such as alpha, alpha-hat, etc. From this, we can make a model quality metric by simply taking a layer-average. One particularly useful model metric is the average alpha-hat, which is a weighted average of the weightwatcher alpha layer quality metric

The weightwatcher average alpha-hat metric is correlated with the reported test accuracies for many production Computer Vision (CV) models like the VGG series, the ResNet series, etc. In the weightwatcher Nature paper shows that the average alpha-hat metric works is remarkably well correlated with test accuracies for over 100 different CV models. Here, we show how the average alpha-hat metric tracks the reported top 1 (and top 5) test accuracies for the open-source VGG models.

And, again, this does NOT require access to the test or even the training data!

You can reproduce this yourself using this Notebook.

Predicting the test accuracies of the VGG Computer Vision (CV) Deep Learning AI models

Evaluate the information flow in different architectures

Correlation Flow shows how well information flows from data to labels

Deep learning models learn the patterns the training data and encode this in the correlations in the layers. In doing so, information must flow from the data (the first layer) to the labels (the last layer). We can visualize this flow using the weightwatcher tool by plotting the layer id against layer quality metric alpha: we call this plot the Correlation Flow.

Here, we see how the Correlation Flow looks in the VGG architectures. We see In our Nature paper, we examine other architectures such as the ResNet and DenseNet series.

The best models have good Correlation Flow, with stable alphas. In models with less optimal architectures, the layer alphas may increase with layer id, as with the VGG models, and may even behave more erratically.

WeightWatcher Correlation Flow for the VGG Computer Vision (CV) Deep Learning AI models

Find (and remove) training anomalies

Correlation Traps can arise with bad learning rates, etc.

If we randomize the layer weight matrix, W, we can compute the randomized ESD (red) and compare this to the original trained ESD (green). When the layer is well correlated, and captures information well, then the randomized ESD (red) will look very different from the the original ESD (green). If the layer matrix W is well trained, its ESD will appear heavy tailed, even on a log-scale. Moreover, when W is randomized, the randomized ESD will look like a near-perfect semi-circle with very crisp edges. We see both here for VGG FC3 (first / top plot).

In contrast, for VGG FC1, the

original and randomized ESDs do not differ much, except at the right edge. This suggests this layer is not actually as well trained as it could be, despite this being a well-known, albeit older, open-source production quality DNN CV model. How can we tell ? First, notice that the original ESD (green) looks almost identical to the the randomized ESD here, except for a small shelf of density on the far right. Why does this happen ? The orange vertical line depicts the maximum eigenvalue of the the randomized ESD, which extends far to the right of bulk edge (red).

This line the so-called Correlation Trap. It tends to traps the true correlations around it, distorting

the original ESD (green), causing that odd shelf to form. The best trained models do not have such traps.

These traps arise with unusually large learning rates or other sub-optimal settings of the hyperparameters. They can be removed using the weightwatcher SVDSharpness Transform, either during and/or after training.

This represents one of several kinds of problems that can be detected using the weightwatcher tool.