Writing an LLM from scratch, part 32d -- Interventions: adding attention bias

The background

In Raschka's book, the use of the nn.Linear for these weights is introduced in section 3.4.2 with the wording:

We can improve the SelfAttention_v1 implementation further by utilizing PyTorch's nn.Linear layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using nn.Linear instead of manually implementing nn.Parameter(torch.rand(...)) is that nn.Linear has an optimized weight initialization scheme, contributing to more stable and effective model training.

So, it's presented essentially as a way of getting better weights for our untrained model, which makes good sense in and of itself -- but, if that's the only reason, why don't we just hard-wire it to have bias=False? That would be the sensible thing to do if the initialisation were the only reason, but clearly there's more to it than that.

Section 4.1 has a bit more information:

qkv_bias determines whether to include a bias vector in the Linear layers of the multi-head attention ... We will initially disable this, following the norms of modern LLMs, but we will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model.

That looks like a typo, as the real explanation is in chapter 5, section 5 (page 164 in my copy), where we do indeed load the OpenAI weights:

OpenAI used bias vectors in the multi-head attention module's linear layers to implement the query, key and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don't improve the modeling performance and are thus unnecessary.

So, that all makes sense so far. QKV bias was part of the original GPT-2 models, perhaps just because it was standard at the time, inherited from something else, or perhaps for some other reason -- I can't find any reference to it in the actual paper. But people have found it doesn't help, so no-one uses it these days.

But... is there some way in which an LLM of this specific size, or in some other way similar to the GPT-2 small model that we're training, might in some way benefit from having bias?

That's what this experiment is for :-)

Parameters

One thing that occurred to me while setting this up is that we have been training on a Chinchilla-optimal number of tokens, 20x the number of parameters. Without QKV bias, we have 163,009,536 parameters, so we've been training on 3,260,190,720 tokens, rounded up to the nearest batch size, which is 3,260,252,160 in our current setup for these experiments (per-GPU micro-batches of 12, with 8 GPUs, so a total batch size of 96).

These extra bias terms will be parameters, though! We're essentially making our model larger by adding them, which changes the Chinchilla calculation. How much?

In [1]: params = {
   ...:     "vocab_size": 50257,
   ...:     "context_length": 1024,
   ...:     "emb_dim": 768,
   ...:     "n_heads": 12,
   ...:     "n_layers": 12,
   ...:     "drop_rate": 0.1,
   ...:     "qkv_bias": True
   ...: }

In [2]: from gpt import GPTModel

In [3]: model = GPTModel(params)

In [4]: sum(p.numel() for p in model.parameters())
Out[4]: 163037184

OK, that's essentially nothing -- 27,648 extra total paramaters on top of 163 million. I make it less than two hundredths of a percentage point larger! The correct number of tokens goes up to 3,260,743,680, so if we wanted to be very pedantic, we're under-training. But I feel like training on a larger dataset is worse in terms of comparability between the baseline and our "intervened-on" model with QKV bias.

So: we'll train a model with QKV bias on 3,260,252,160 tokens, accepting that it's a tiny bit less than Chinchilla-optimal. Let's see how it goes!

The run

Here's the model.json config file for this train. Running it gives this training chart:

Training run with QKV bias

Pretty standard, though the loss spikes look less prominent than they have been in the other trains. Might QKV bias actually help with model stability in some way...?

The train finished with these stats:

Training complete in 12,329.557 seconds
Tokens seen: 3,260,252,160
Throughput: 264,426 tokens/second
Final train loss: 3.719

Timing-wise, pretty much indistinguishable from the baseline train's 12,243.523 seconds. The final train loss looks a tad better, but we can't rely on that -- the test set loss is the important one.

So it was time to download it, upload it to Hugging Face Hub, and then on to the evals.

Evals

Firstly, our normal "how should you continue Every effort moves you":

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-qkv-bias/model.json runs/8xa100m40-qkv-bias/checkpoints/best/model.safetensors
Every effort moves you toward success. The right questions are asked to become your business coach and help shape the future of their

Not bad at all, borderline coherent! Next, the loss on the test set:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets runs/8xa100m40-qkv-bias/model.json runs/8xa100m40-qkv-bias/checkpoints/best/model.safetensors
Fetching 4 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1701.54it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [04:52<00:00, 10.95it/s]
Loss against our test dataset: 3.669

Well, crap! Now that's a surprise. Let's look at that in the context of the other interventions to see how surprising that is, given Raschka's comments (which were undoubtedly backed up by serious research):

	Test set loss	Improvement vs baseline
8xa100m40-baseline	3.692	-
8xa100m40-gradient-clipping	3.678	0.014
8xa100m40-qkv-bias	3.669	0.023
8xa100m40-remove-dropout	3.641	0.051

So, adding QKV bias actually improved our test set loss by more than gradient clipping did!

The loss spikes in the training chart look smaller than in the other trains ¹, so, speculating wildly, perhaps with a model of this size, the bias stabilises things somehow? Or perhaps what we're seeing is the model become that tiny bit smarter because it has some extra parameters -- albeit less than 0.02 percent more?

I'm not going to spend time investigating things now, but this is a really interesting result. One extra thing that does occur to me is that the direction research has taken since GPT-2 has definitely been in the direction of larger models. The attention weight matrices are sized $d_{emb} \times d_{emb}$ , so excluding bias they have $d_{emb}^{2}$ weights each. Bias adds on another $d_{emb}$ . So, as a model scales up, the attention-related non-bias weights will scale quadratically -- doubling $d_{emb}$ will square their number -- while the bias weights will scale linearly.

So perhaps it's just that the effect -- whatever causes it -- gets rapidly swamped as you scale out of toy-model territory. That, at least, seems pretty plausible.

One final note to self, though: these improvements are small enough that I do find myself wondering whether or not it might be some kind of noise, despite the setting of the random seeds I'm doing:

    seed = 42
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

I think that at the end of this, before I do a final train, it would be worth doing another baseline train and measuring the test set loss again, and doing another comparison. If it comes out exactly the same -- and I can bump up the number of significant figures in the output, it's just a formatting parameter -- then I don't need to worry. But if they vary to some degree, perhaps I'll need to update my mental model of what level of finding is significant, and what isn't.

Summing up

I think it goes without saying that QKV bias definitely goes onto the list of interventions we want to add when training our best-possible GPT-2 small-scale model, assuming that the random seed test goes well. That surprises me a bit, I was expecting it to have negligible impact! That, of course, is why it's worth doing these tests.

Next up, I think, is trying to understand how we can tweak the learning rate, and its associated parameters like weight decay. This will need a bit of a deep dive, so you can expect the next post late next week, or perhaps even later. I'm sure you can't wait ;-)