Writing an LLM from scratch, part 32c -- Interventions: removing dropout

5 min read Original article ↗

Archives

Categories

Blogroll

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something!

This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better?

Background:

In a blog post last summer about architectural advances in LLMs since GPT-2, Sebastian Raschka wrote:

Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended).

I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting.

That makes quite a lot of sense. My own understanding of dropout was that it was a bit broader than just preventing overfitting -- it seemed to me to be similar to the mandatory vacation policies that financial firms user to prevent over-dependence on individuals. My instinct was that having knowledge distributed across different weights in the model was good in and of itself, even beyond its benefit on multiple-epoch training.

But it is quite a high price to pay. With the training parameters we've been using we're literally discarding 10% of our calculations' results -- attention weights, feed-forward neuron activations, and so on -- as we do the forward pass. It's easy to see why it would harm training.

Let's give it a go.

The training run

The nice thing about this one is that, unlike the gradient clipping experiment, I didn't have to write any new code. The dropout level was already controlled by a setting in the model.json file, so by setting that to zero for this run, I could just kick it off and let it do its thing while I worked on something else:

Here's what the training run chart looked like (please disregard the stuff about grad norms in the title and the axis -- I'll remove that for the next train):

Training chart for zero-dropout run

As you can see, we still have loss spikes, including one just after global step 20,000 that lasts for several checkpoint periods of 617 steps. I imagine gradient clipping might have helped with that, but I'm very deliberately testing each intervention in isolation.

At the end of the training run, we got this:

Training complete in 11,376.067 seconds
Tokens seen: 3,260,252,160
Throughput: 286,589 tokens/second
Final train loss: 3.621

So, interestingly, it took 967 seconds -- about 16 minutes -- less time than the gradient clipping run, and about 15 minutes less than the baseline train. So while gradient clipping added on a small amount of time (or maybe that was just noise), dropping dropout certainly seems to speed things up! I guess there's quite a lot of work involved in generating and applying the random masks that drop things out as we're doing the forward pass.

Anyway, with the model trained, it was time to download it, upload it to Hugging Face Hub, and run the evals.

Evals

Firstly, the smoke test, where it just needs to continue the sequence Every effort moves you, it came up with something reasonably coherent:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-remove-dropout/model.json runs/8xa100m40-remove-dropout/checkpoints/best/model.safetensors
Every effort moves you to make the world a better place.
As an international student of the arts in the UK,

...but it was on the test of the loss on the training set that it was most impressive:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets/ runs/8xa100m40-remove-dropout/model.json runs/8xa100m40-remove-dropout/checkpoints/best/model.safetensors
Fetching 4 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1086.75it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [04:54<00:00, 10.87it/s]
Loss against our test dataset: 3.641

That's a bigger improvement on the baseline train's 3.692 than gradient clipping: 0.051, which is more than three times the improvement!

Let's start keeping a table of these:

Test set loss Improvement vs baseline
8xa100m40-baseline 3.692 -
8xa100m40-gradient-clipping 3.678 0.014
8xa100m40-remove-dropout 3.641 0.051

Now, of course, we don't know how these different interventions combine together -- it would be naive to think that if we did both gradient clipping and dropout removal, we'd get a total loss reduction of 0.014 + 0.051 -- but, especially with that long-lived loss spike in our training run -- it does feel like they might play well together.

Wrapping up

So, that's dropout covered. Which one next? I think a nice easy one that I should be able to get done on a Friday will be adding bias to the attention weight calculations. Let's give that a go and see if it makes things worse or better!

Stay tuned...

Here's a link to the next post in this series.