Writing an LLM from scratch, part 32e -- Interventions: the learning rate

26 min read Original article ↗

Archives

Categories

Blogroll

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In my training code, I have this code to create the optimiser:

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0004, weight_decay=0.1
    )

The values in there -- 0.0004 for the learning rate, and 0.1 for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book.

What do those values actually mean, and are those really the right values for them?

I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before.

The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that.

In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later.

The learning rate: a refresher

If you're reading this blog, you almost certainly know what the learning rate is, but let's go over it briefly to build a solid foundation.

The way it's normally explained, using simple gradient descent, goes something like this. Let's assume that we're training a model with just one parameter, and it starts off set to 5. We run some training data through, and get a loss, let's say 44.44:

First loss point

We don't know what shape our loss curve is (if we did, we might be able to find the lowest loss algebraically), but we do know the differential of the parameter versus the loss at the point we've measured; it happens to be -13. That is reasonably large and negative:

First loss point with gradient

We use that information to say that we want to move in the direction of a larger value for our parameter -- that is, in our case where the gradient is negative, so we have a downhill slope towards the right, we want to increase the parameter to move rightwards on that chart, whereas if it were positive (an uphill slope) we'd want to decrease the parameter to move leftwards.

Simply subtracting the gradient from the parameter would lead to an update in the right direction, but it would be a very large one in this case -- we'd move 13 units to the right -- so we multiply the gradient by a small positive number, the learning rate (often written as a lower-case eta, like this: η), to move a small distance in that direction. Let's say η=0.3. That means we want to update our parameter:

pnew=poldgradient*η =5(13×0.3) =53.9 =1.1

So now we run that through and get a new loss -- let's say it's 9.06 -- and a new gradient, which happens to be -5.2.

Two loss points with gradients

Now we can do another update, and our parameter will become 0.46, so we use that and work out another loss and gradient, which come to 3.3816 and -2.08.

Let's plot that one, but this time we'll draw back the veil and show the actual loss curve.

Three loss points with gradients, showing loss function

Now, it's worth reiterating that while we're training this model we don't know what that curve looks like -- we're just finding points on it, along with its gradient at those points, and using that information to work out which parameter value to explore next.

But it's pretty clear that as we continue, if the learning rate is set correctly, we'll get to the minimum eventually if the learning rate is the right kind of size, because -- due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum 1.

It's also pretty clear that if the learning rate is smaller than an optimal value, in this simple case we will still find the right point, but it will take more steps because each one is smaller:

Slow learning with a low learning rate

And, of course, if the learning rate is too high, we might never converge -- we'd "bounce out of" the dip, and wind up with a parameter value that endlessly cycles between increasingly smaller and increasingly larger values, zooming off to infinity:

No convergence with a too-high learning rate

OK, that's the basics. Why might we want to change from something that seems so logical and simple?

When a fixed learning rate fails

A few paragraphs back I said:

due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum

What if it doesn't? Imagine if we had something more like a V-shaped curve, like this:

A V-shaped loss curve

The gradient does not decrease as we get closer to the minimum, and so while we're in the downward-sloping part, each update is exactly the same distance:

Two updates on a V-shaped learning curve

Now, eventually we'll jump over the minimum:

Crossing the minimum on a V-shaped learning curve

In this example, I've used a gradient of 8.33 on the downward-sloping part of the curve, and +8.33 on the upward-sloping part, so that means that our next update just bounces us back to where we were before!

Bounce-back on a V-shaped learning curve

Because the gradient isn't decreasing the closer we get to the minimum, we wind up just oscillating around it. That's not very helpful.

That's a slightly contrived example (though not entirely -- intuitively, with functions like ReLU or GELU in our real LLMs, it's easy to imagine crazy loss landscapes). But it does show that perhaps we might want to add in our own "artificial" way to decrease the size of the steps we take over the course of training our model rather than just relying on the gradients naturally flattening out for us.

Another way of looking at things is that as the model gets trained, we don't want batches of very new-looking data to cause big updates, taking us away from what was a good part of the loss landscape in terms of what we've seen so far. For example, imagine you've been training an LLM on a bunch of documents, which have so far been in English. Halfway through, it encounters a document in Byzantine Greek, the loss skyrockets, and you do a big update. That would be a problem! You might want it to learn a bit from it to push it slightly in a "the world is multi-lingual" direction, but you don't want it to lose a big chunk of the value from its previous training.

You might also see a kind of connection to the way that people learn over the course of their lives -- for babies, everything is new and they "update their parameters" constantly as they try to understand the world. Children are still pretty flexible, but as we get older we tend to update our beliefs less and less. That's not always optimal, but as a heuristic it's pretty adaptive.

Anyway, in general: for most training runs, we're going to want the learning rate to adjust over time. Most of the time this will be by reducing it, though there can be cases for increasing it again for periods. The general case of doing this is called "learning rate scheduling".

Learning rate scheduling

There are a bunch of ways that people adjust the learning rate over the course of a train; here are a few that cropped up a lot while I was researching this.

Step decay

If we want the learning rate to go down over time, and we know how many steps we're training for, we can just set it to (say) 0.0004 for the first quarter of our train, then 0.0002 for the next, then 0.0001, then finish off with 0.00005, like this:

LR step decay

That can work pretty well! But there is one obvious oddity -- the big step changes in learning rate mean that the exact placement of the drops and the training data before and after can matter. Why are we treating the data and the state of the model immediately before and immediately after so differently? It would make more sense to have a smoother schedule.

What functions decay smoothly like that?

Exponential decay

An exponential curve does: let's say we just multiply the learning rate by a number that is a little smaller than one every step, so that it drops smoothly like this:

LR exponential decay

But there are lots of other curves like that, and one is particularly interesting:

Cosine decay

As you change θ from 0 to π, the value of cosθ goes smoothly from 1 to 1, so it's easy enough to rescale that so that our learning rate follows the same curve:

LR cosine decay

This is called a "cosine annealing" or "cosine decay" schedule, and was apparently inspired by the algorithms used for simulated annealing (an optimisation algorithm that was in turn inspired by how the atomic structures form in metals as they cool -- another one for the list of things to look into in the future...)

That solves the mystery from earlier: the cosine that the Chinchilla paper was talking about was exactly this. As it turns out, the cosine decay scheduling curve is quite popular in deep learning, because it has what amounts to two well-defined phases -- an initial high learning rate where lots of exploration of the loss landscape can happen, followed by a smooth transition to something more like fine-tuning to optimise the location in whatever part of the loss landscape we've wound up in.

Cyclical schedules

Now, all of the above are assuming that we want the learning rate to start high and finish low, so that we can mimic the textbook gradient descent that we had at the start of this post.

Intuitively that feels nice, but on further thought, the important thing is really that we have a low learning rate at the end of the train, so that we can find as close a point as possible for the minimum at the part of the loss landscape we've found ourselves in.

But perhaps there's a case for having both high and low periods during the train, so that we don't get stuck in a local minimum -- something to jolt us out of where we were every now and then? 2

With a step function, that's easy: you could, for example, do this:

LR cyclic step

With an exponential, you could do something like this:

LR cyclic exp

With cosine decay, of course, things are even easier, because the cosine function is inherently cyclical, so we can just do this:

LR cyclic cosine

However, at least for our purposes, training an LLM using a Chinchilla-optimal number of training tokens, it makes sense to be guided by what the authors of the Chinchilla paper did. Appendix B says:

We find that setting the cosine cycle length too much longer than the target number of training steps results in sub-optimally trained models, as shown in Figure A1. As a result, we assume that an optimally trained model will have the cosine cycle length correctly calibrated to the maximum number of steps, given the FLOP budget; we follow this rule in our main analysis.

So, at this point, I think we have one important part of the intervention we want to make: we want to use a cosine learning rate scheduler, going from high near the start of the training run, down to low at the end over one cycle. Additionally, and also from appendix B in the paper:

we use a 10x learning rate decay in line with Rae et al. (2021)

...which means that if our learning rate starts at η, then we want it to decay down to η/10 by the end.

So, we just need to work out an initial value for η, and let it rip, right?

Well, not so fast...

Learning rate warmup

When our model is uninitialised, right at the start of the train, gradients are going to be pretty wild. It's going to be making random errors all of the time, and we'll be making huge jumps across the loss landscape. That sounds bad.

Additionally those kind of wild jumps can get the optimiser into a -- well, sub-optimal -- state. I haven't read enough about optimisers yet to have a solid handle on that, but that can wait -- intuitively it makes some kind of sense that erratic gradient updates might confuse it.

So, it makes a certain amount of sense to start off with a low learning rate so that we don't do that, and then to increase it gradually to the peak, and only then to schedule the gradual cosine decay. According to this (rather nice looking) masterclass on LLM training, it's typical to do this over "a few thousand steps or a small percentage (e.g., 1-10%) of the total training steps, depending on the dataset size and batch size", and we would just use a linear increase over that period:

ηt=ηpeak×twarmup_steps

I think we should do that; a simple linear warmup at the start -- let's relatively arbitrarily say 5% of our training steps going up to our desired peak learning rate. So our learning rate schedule should look something like this:

LR cosine after warmup

The initial learning rate value

So far I've written a lot about how we vary the learning rate over time, and that's all been very useful. But we still need to know what the value should be initially! In smaller-scale experiments you might just try a bunch of different numbers to see what worked well, but at more than US$30 per train, that's not practical here.

Unfortunately it's really quite hard to find good suggestions published anywhere. The GPT-2 paper is (as usual) reticent:

The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText

...and if you search for "learning rate training llm", you'll see lots of results for when people are fine-tuning existing LLMs (2×104 comes up a lot), but almost nothing about when you're training one from scratch.

I eventually came across this (long!) post from Hugging Face, which I definitely need to spend time going through in the future, because it covers a lot of the ground I've been going over in this post series. But for this post, I think the most relevant part is in the section "Scaling Laws for Hyperparameters", where they include a figure from this DeepSeek paper. Here it is, with some of the (also relevant) surrounding text:

DeepSeek hyperparameter optimisation

In our trains we're using something like 5×1018 total FLOPs. Now, they are specifically charting things in terms of non-embedding FLOPs, but I'm going to play a little fast and loose here and ignore that, so reading off their chart, that looks like we should be using about 1.4×103 as our learning rate. We can double-check that against their formula, where C is the compute budget:

ηopt=0.3118×C0.1250 =0.3118×(5×1018)0.1250 =0.3118×0.004598632978267703 =0.00143385376262387

Nice, a close match!

However, it's definitely worth noting that we're using a simple GPT-2 architecture, and they are using something quite different -- RMSNorm instead of LayerNorm, SwiGLU as the activation function on the feed-forward networks, Rotary Position Embedding rather than the fixed ones we're using, and so on.

As a sanity check: you can see that they also give a formula for the optimal batch size in terms of tokens. For our FLOP budget, that comes in at 381,782, which is about 373 of our 1,024-token sequences. That is quite a lot higher than the 97-or-so sequences that we appeared to be optimal in our earlier experiments. That is a little concerning, though of course the 97 number came out of a very ad-hoc bit of curve-fitting.

For now, I'm going to hope that that doesn't matter too much for the learning rate. This may come back to bite me; if the results of a train with 1.4×103 are radically worse than the existing rate of 4×104, I'll have to do a bit more investigation.

So, now I think we have all of the theoretical pieces in place to do a train. Let's move on to the practicalities.

The code

We started by looking at this:

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0004, weight_decay=0.1
    )

What should we change -- disregarding the weight_decay until the next post?

Based on the above, we want to do a linear warmup of about 5% of our steps, going up to a learning rate of 1.4×103, followed by a cosine decay down to one tenth of that, 1.4×104.

What does that look like in code?

Basic learning rate scheduling

The relevant API for scheduling the learning rate in PyTorch is, logically enough, in the torch.optim.lr_scheduler module, and there are a bunch of different scheduling classes. You create your optimiser, then create a scheduler for the shape you want, and then you can call step on the scheduler (after the step on the optimiser) to adjust the optimiser's learning rate over time.

Let's make that more concrete; one of the schedulers is LinearLR, which is what we'll need for our linear warmup period. It takes as its parameters:

  • optimizer, which is the optimiser we're applying it to.
  • start_factor, which the optimiser's learning rate is multiplied by to work out where we want to start up.
  • end_factor, which is likewise applied to the optimiser's learning rate to work out the value we're heading for.
  • total_iters, which is the number of steps over which it should go from the initial learning rate to the final one.
  • last_epoch, which lets the scheduler know how many steps into its schedule it currently is -- this defaults to -1, meaning it hasn't started yet. This can be useful if you're resuming from a checkpoint, but for our purposes we can ignore it.

Let's say that we want to go from almost-zero to our optimiser's learning rate over 1,600 steps -- we'd create our scheduler like this:

    scheduler = torch.optim.lr_scheduler.LinearLR(
        optimizer,
        start_factor=0.00001,
        end_factor=1.0,
        total_iters=1600
    )

...then in our training loop, after we've done the scaled step of the optimiser, we'd also step the scheduler:

        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

This confused me a little bit the first time I saw it; after all, if the scheduler hasn't been "triggered" when we step the optimiser, how does the optimiser know what learning rate to use? Surely it would just use whatever it was initialised with?

The answer is that when you create the optimiser, it stores away the learning rate that you give it in two places -- an "initial learning rate" and a "current learning rate". Next, when you create your scheduler, it uses the initial learning rate to work out the start and end values, and then sets the current one to the start value immediately. Just by creating a scheduler, you're changing the optimiser's current learning rate -- but not the initial one, which is important, as we'll see in a moment.

So, we have a scheduler that handles our warmup period nicely. Another scheduler that's relevant to our interests is the CosineAnnealingLR. This takes:

  • optimizer, which is the same as the LinearLR's.
  • T_max, which is the number of steps before it reaches its minimum
  • eta_min, the minimum learning rate we want to get to.
  • last_epoch, again the same as the LinearLR's.

On creation, this scheduler will read in the optimiser's initial learning rate -- note, not the current one -- and then the first time it's stepped, it will set the current learning rate to that value, and then for steps after that it will reduce it so that it follows a nice cosine decay, reaching eta_min after T_max steps.

So those two cover the two regimes that we want -- the warmup and then the cosine decay. But now we need to put them together; we want to do one and then the other.

Chaining learning rate schedulers

There's a very useful class, SequentialLR, which allows you to chain schedulers and tell it when each one takes over from the previous one.

Let's sketch out some code to use that to do a train with our new peak learning rate of 1.4×103, a warmup of 1,600 steps, followed by a cosine decay for the next 32,000 steps to one tenth of the peak learning rate:

    peak_lr = 0.0014
    warmup_period = 1600
    decay_period = 32000
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=peak_lr, weight_decay=0.1
    )
    warmup_scheduler = torch.optim.lr_scheduler.LinearLR(
        optimizer,
        start_factor=0.00001,
        end_factor=1.0,
        total_iters=warmup_period
    )
    cosine_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=decay_period,
        eta_min=peak_lr / 10
    )
    scheduler = torch.optim.lr_scheduler.SequentialLR(
        optimizer,
        schedulers=[warmup_scheduler, cosine_scheduler],
        milestones=[warmup_period],
    )

    ...

        # In the training loop
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

That actually works quite nicely! I wrote a dummy training loop to plot the current learning rate over a fake train using code like the above, and got this:

LR schedule from PyTorch code

...with the output confirming that the values were good at the "milestone" point, the start and the end:

Initial learning rate:  1.4000000000000001e-08
Step 1596 learning rate:  0.0013965000350000011
Step 1597 learning rate:  0.001397375026250001
Step 1598 learning rate:  0.0013982500175000012
Step 1599 learning rate:  0.0013991250087500011
Step 1600 learning rate:  0.0014
Step 1601 learning rate:  0.00139999999696394
Step 1602 learning rate:  0.0013999999878557604
Step 1603 learning rate:  0.0013999999726754609
Step 1604 learning rate:  0.0013999999514230418
Step 33596 learning rate:  0.00014000004857695852
Step 33597 learning rate:  0.00014000002732453932
Step 33598 learning rate:  0.00014000001214423973
Step 33599 learning rate:  0.00014000000303605992

I was initially a bit surprised by that, as at the time I ran it, I didn't realise that there was that split between the initial and the current learning rates on the optimiser, so I thought that the cosine scheduler would pick up whatever tiny starting value the warmup scheduler had overwritten the optimiser's learning rate with -- but that split saves the day.

That means that now we have the outline of how to schedule our learning rate. But before we can put that into the code, we need to think about how it affects our checkpoints.

Checkpoints

Just like the scheduler and the optimiser, the learning rate scheduler -- or, indeed, our two schedulers here -- contain information about the state of the train. That means that if we recover from a checkpoint, we need to provide them with the information they need. If we just created them afresh, they'd start from the beginning -- for example, if we restarted from step 20,000 in a train like the one above, we'd start a new warmup from pretty much zero, and then start a fresh cosine decay. That would be bad:

LR schedule with no scheduler checkpointing

(Dummy test code here.)

Now, we could use the last_epoch parameter to initialize them with the correct current global step. But they have a state dict, like most other PyTorch objects, so the simplest thing to do is just to write that to another checkpoint file:

torch.save(scheduler.state_dict(), checkpoint_dir / "scheduler.pt")

...and then load it likewise:

scheduler.load_state_dict(torch.load(checkpoint_dir / "scheduler.pt"))

(Dummy test code here.)

Conveniently, if you save the state dict of a SequentialLR, it will also include the state of all of its component schedulers, and likewise if you reload it, it will load the components' states back in too.

The one thing you have to be careful about is what they warn about in the PyTorch docs:

Initializing a scheduler overwrites its optimizer’s param_group["lr"]s. When restoring a checkpoint, initialize the scheduler before calling your optimizer's load_state_dict() to avoid overwriting the loaded learning rates.

Luckily enough, in our code as it stands, we create all of the things that are checkpointed -- the optimiser and the scaler so far, but shortly the scheduler as well -- before we load in the state dicts, so that drops out quite nicely.

So, we have some sketched-out code -- it's time to put it in place for the real training run.

The actual code

I won't go through the details of the changes to my existing DDP training code, though you can see the diff here if you're interested.

Much of the complexity was due to keeping backward compatibility so that we don't have to always use a learning rate scheduler; remember that in this mini-series, I'm trying making various changes ("interventions") to the training loop in isolation, seeing whether each one improves things. So it's important to be able to easily train with or without learning rate scheduling; I did that with a flag schedule_learning_rate in the train.json

Implementation-wise, initially I was thinking that it would be easiest to always have a scheduler, and in the "non-scheduled" case to just set it to a linear one that didn't change the value over the course of the train. But in the end it turned out to be easier to use scheduler == None as being the switch to tell the training loop which "mode" it was in.

The placement of the code to create the schedulers was also a little tricky; the "natural" place was just after the optimiser is created, like it is in the example code above. However, at that point, we don't know how many global steps we're going to have in the train, because we don't have the dataset -- which means that working out the numbers to pass in to the schedulers for the warmup and decay steps would be impossible. It turned out to be easiest to put it in the function load_datasets_and_train, just after the datasets are loaded, as at that point we have all of the information we need.

Anyway, that's the code done, so let's see what happens!

The training run, part 1: scheduling the learning rate

I wanted to do two trains; one with the learning rate scheduling, and one with just the new value for the learning rate, 0.0014 instead of 0.0004. I was expecting the updated learning rate alone to be too high and to cause a very choppy train, but had high hopes for the train with the scheduling. Here's how it did; the scheduled learning rate train first:

Training complete in 12,270.413 seconds
Tokens seen: 3,260,252,160
Throughput: 265,700 tokens/second
Final train loss: 3.654

Here's what the training loss looked like over that:

Loss for training run with learning rate scheduling

Quite a few loss spikes early on in the train when the learning rate is at its peak, but nothing unmanageable -- and, as you'd expect, things calmed down quite a lot later on.

I also charted the learning rate, to make sure it really was doing what I thought it was doing:

Learning rate for training run with learning rate scheduling

So, a pretty smooth train, and we definitely did the right learning rate scheduling. Time to upload it to Hugging Face, and see what the evals look like.

Evals for the scheduled learning rate:

Firstly, the smoke test:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-schedule-learning-rate/model.json runs/8xa100m40-schedule-learning-rate/checkpoints/best/model.safetensors
Every effort moves you closer to the fact they will not see or use their product if you are using a computer or notebook

Reasonably coherent, at least, though it's not super-impressive. On to the loss on our test set:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets runs/8xa100m40-schedule-learning-rate/model.json runs/8xa100m40-schedule-learning-rate/checkpoints/best/model.safetensors
Fetching 4 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4681.14it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [05:00<00:00, 10.66it/s]
Loss against our test dataset: 3.602

That's our best loss so far! Let's put it into the table:

Test set loss Improvement vs baseline
8xa100m40-baseline 3.692 -
8xa100m40-gradient-clipping 3.678 0.014
8xa100m40-qkv-bias 3.669 0.023
8xa100m40-remove-dropout 3.641 0.051
8xa100m40-schedule-learning-rate 3.602 0.09

So, it definitely looked like it was worth it. But was it the scheduling of the learning rate that helped, or just the change from 0.0004 to 0.0014?

The training run, part 2: just updating the learning rate

I kicked off a second run with no scheduling, just a learning rate of 0.0014, to see what would happen. After about an hour, I noticed that the loss chart had stopped updating. The last point had a maximum and minimum loss but no average -- but after that, nothing:

Loss for training run with learning rate updated

However, the learning rate was still being charted, so the train was definitely running:

Learning rate for training run with learning rate updated

Looking at the checkpoint metadata showed what had happened. At global step 1851, we had this 3:

  "min_train_loss": 6.198999881744385,
  "max_train_loss": 7.729835510253906,
  "avg_train_loss": NaN,

...and at the next checkpoint at step 2468, we had this:

  "min_train_loss": NaN,
  "max_train_loss": NaN,
  "avg_train_loss": NaN,

...and the same for all checkpoints thereafter.

Clearly the parameters had gone off the rails -- exactly what we'd expect with an excessive learning rate:

No convergence with a too-high learning rate

There was no point in continuing the train, as it was pretty much certainly unrecoverable, so I stopped it. Out of interest, I downloaded the model, but I couldn't even run the smoke test on it:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-update-learning-rate/model.json runs/8xa100m40-update-learning-rate/checkpoints/latest/model.safetensors
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion ``probability tensor contains either ``inf``, ``nan`` or element < 0`` failed.

So it was pretty clear that just updating the learning rate to 0.0014 was actively harmful. No need to upload that one to HF! And time to wrap up this experiment.

Conclusion

While this has been quite a long post, I've really only scratched the surface of how learning rates are set. If I were doing things in more detail, the best would probably be to do a "sweep" over multiple values to try to at least approximate the best possible rate for this model.

That would be pretty expensive for me, though, so I decided to stick with the DeepSeek number. It might not be ideal for the specific architecture that I'm using, given how different that is to theirs, but given the results, it's a decent one compared to what I was using. 4

Something that I found interesting is that exactly how to schedule your learning rate is still an area being actively researched. Even in my relatively minimal research, I came across three alternatives to the mainstream warmup-cosine decay pattern:

I'm sure there are many more. But for this train, I decided to stick to the mainstream, and the results were pretty good! To reiterate, this has been the most positive intervention so far:

Test set loss Improvement vs baseline
8xa100m40-baseline 3.692 -
8xa100m40-gradient-clipping 3.678 0.014
8xa100m40-qkv-bias 3.669 0.023
8xa100m40-remove-dropout 3.641 0.051
8xa100m40-schedule-learning-rate 3.602 0.09

So I'll stick with that, and move on to the next thing: what is the weight_decay parameter that we're passing in to the AdamW optimiser? Tune in next time :-)