Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results

The background, and the last test

At the end of the previous round of IFT tests, I had this table. It's sorted by the loss on the test set (shown to 3 decimal places), and has the score that the model got from an instruction fine-tuning run:

	Test loss	IFT score
OpenAI weights: medium	3.231	39.64
OpenAI weights: small	3.500	16.66
Cloud FineWeb, 8x A100 40 GiB	3.674	16.5
Cloud FineWeb, 8x H100 80 GiB	3.725	11.59
Cloud FineWeb, 8x A100 80 GiB	3.730	11.23
Cloud FineWeb, 8x B200 160 GiB	3.771	11.59
Local FineWeb train	3.944	11.32
Local FineWeb-Edu extended train	4.135	16.41
Local FineWeb-Edu train	4.167	15.77

There's a loose correlation where lower loss means a higher IFT score, with two weird exceptions: the two FineWeb-Edu training runs, where they got much higher results than you'd expect from the loss.

My working hypothesis was that there were two components that led to a model getting a good score:

Its raw intelligence: lower-loss models were smarter, so they were better at instruction-following after the fine-tune.
Its knowledge. All of the models -- mine and OpenAI's -- apart from the FineWeb-Edu ones were trained on what amounted to minimally-curated data from the Internet. But FineWeb-Edu is meant to be "the most educational" subset of FineWeb, so it presumably is more dense in useful facts.

So in those terms, the OpenAI models and Cloud FineWeb, 8x A100 40 GiB might be smart but not know very much, and the FineWeb-Edu ones might be dumb but knowledgeable. The ones in between, by contrast, could be relatively dumb too, but also not know very much.

There was one other oddity: the Cloud FineWeb, 8x A100 40 GiB model seemed surprisingly good on the IFT results when considering its loss -- but perhaps there was some kind of step function, where as soon as a model got better than (say) 3.7 on the loss, it suddenly became smart in whatever way mattered.

All very hand-wavy, of course, but it was a hypothesis of sorts. Would the new models fit that pattern? It was time to find out.

The initial run, and the mystery

I didn't think it was worth adding all 14 models that I've trained in my intervention-testing to that table, so I decided to just add four of them:

8xa100m40-baseline, the baseline cloud-trained model for all of the interventions.
1xrtx3090-baseline, the locally-trained version of the same -- the first model from this post.
8xa100m40-stacked-interventions-1, the best model we managed to get in the cloud.
1xrtx3090-stacked-interventions, the best local model -- the second from this post.

Now, I already had files containing responses from fine-tuned versions of the other models, so I just needed to run the first of my two fine-tuning scripts against all four of the new models.

I did that, and then also tweaked the judge script so that instead of using GPT-5.1, it used GPT-5.4. If you run the script multiple times, each time will normally give you different scores anyway; hopefully the ranking will remain roughly the same. So given that I was going to have to re-run the script to get new aggregate results, and those would not really be comparable to the original ones anyway, this seemed like a reasonable price to pay for (hopefully) a smarter judge.

I ran that once, and got some results that surprised me -- so much that I decided to do three runs and see if the results stood up. They did; here's the new table, with scores for each run, the average, and the rank that each one got based on the average.

	Test loss	IFT score 1	IFT score 2	IFT score 3	IFT average	IFT rank
OpenAI weights: medium	3.231442	43.44	41.83	41.30	42.19	1
OpenAI weights: small	3.499677	19.27	19.37	18.36	19.00	2
`1xrtx3090-stacked-interventions`	3.538161	19.20	18.60	18.15	18.65	3
`8xa100m40-stacked-interventions-1`	3.577761	11.70	12.74	11.28	11.91	13
Cloud FineWeb, 8x A100 40 GiB	3.673623	18.25	18.40	17.83	18.16	4
`1xrtx3090-baseline`	3.683835	13.59	13.93	12.56	13.36	10
`8xa100m40-baseline`	3.691526	17.72	17.33	16.26	17.10	6
Cloud FineWeb, 8x H100 80 GiB	3.724507	14.87	15.05	13.68	14.53	8
Cloud FineWeb, 8x A100 80 GiB	3.729900	12.65	13.34	12.55	12.85	11
Cloud FineWeb, 8x B200 160 GiB	3.771478	14.39	14.72	12.87	13.99	9
Local FineWeb train	3.943522	12.66	13.06	11.67	12.46	12
Local FineWeb-Edu extended train	4.134991	17.64	16.93	16.29	16.95	7
Local FineWeb-Edu train	4.166892	17.94	18.92	17.05	17.97	5

You can see that relative rankings are fairly consistent across the IFT runs. But while in general the lower-loss runs get better IFT results, now there are even more exceptions to that trend than there were before.

Let's look down the "IFT rank" column, which is based on the IFT average:

The first surprise is 8xa100m40-stacked-interventions-1. It has the fourth-best loss, but it's the worst model out of all of them on the instruction fine-tuning test! It was trained on exactly the same data as all of the others apart from the OpenAI ones and the FineWeb-Edu ones. Even more perplexingly, it was as close a match to 1xrtx3090-stacked-interventions as I could make it, but got completely different results. You might remember from the post that those two runs started with the same weights and had exactly the same training config; the only difference was that they were trained on different architectures, and one used DDP with a real global batch size of 96, while the other used gradient accumulation to get the same batch size.
1xrtx3090-baseline also does much worse than you'd expect from its loss numbers; it's only a tiny bit worse than Cloud FineWeb, 8x A100 40 GiB in loss terms, but much worse on the IFT test. Again, this one is essentially a clone of another: 8xa100m40-baseline, which was the same training run but using DDP rather than gradient accumulation. The same problem -- one of a pair of closely-matched models has worse results on the IFT test. But in this case, it's the gradient accumulation model that turned out bad.

That's a really odd situation. If the training runs using gradient accumulation rather than DDP had been consistently worse -- or vice versa -- then we could imagine some kind of connection. But in the first case, GA beat DDP, but in the second, it was the other way around.

Apart from that, we do still see that the two FineWeb-Edu models are doing much better than the others. And the remaining models are all pretty close together, both in terms of loss and in terms of their ranking, apart from the Local FineWeb train, which is bad in both.

It is, however, interesting that Local FineWeb-Edu extended train, which was trained on twice as much data as Local FineWeb-Edu train, is consistently worse in terms of the IFT numbers, though. That wasn't the case in my tests previously.

All of this puzzled me. The "lots of knowledge makes a model better at this" idea seemed to be weakened by the relative ranks of the two FineWeb-Edu models (after all, if it was true, you'd expect the model trained on more data to be consistently better). And the "smart, low-loss models are better" side seemed to be contradicted by 8xa100m40-stacked-interventions-1 and 1xrtx3090-baseline's bad results.

What might be going on here?

Epochs of fine-tuning

Looking at the training code, one thing stood out to me. The process was:

Fine-tune the model for a maximum of 100 epochs over the training set.
If loss on a held-back validation set went above the result for the previous epoch, we did an early exit and used the previous epoch's model for the generation of the responses.

In practice, the early-exit code always cut in pretty quickly. I'd noticed that during my original generation of the results for the new models:

8xa100m40-baseline took 6 epochs until validation loss started rising.
1xrtx3090-baseline took 5
8xa100m40-stacked-interventions-1 took 4
1xrtx3090-stacked-interventions took 5

I decided to regenerate responses for all of the models, and then run the new responses past the LLM judge again. But this time I would keep a record of how many epochs of training we got before the exit:

	Test loss	IFT score	Epochs	IFT rank
OpenAI weights: medium	3.231442	39.14	2	1
OpenAI weights: small	3.499677	24.93	2	2
`1xrtx3090-stacked-interventions`	3.538161	16.97	4	5
`8xa100m40-stacked-interventions-1`	3.577761	10.40	4	13
Cloud FineWeb, 8x A100 40 GiB	3.673623	20.73	7	3
`1xrtx3090-baseline`	3.683835	13.61	6	9
`8xa100m40-baseline`	3.691526	13.57	4	10
Cloud FineWeb, 8x H100 80 GiB	3.724507	14.25	4	8
Cloud FineWeb, 8x A100 80 GiB	3.729900	11.66	4	12
Cloud FineWeb, 8x B200 160 GiB	3.771478	15.17	4	7
Local FineWeb train	3.943522	13.25	7	11
Local FineWeb-Edu extended train	4.134991	16.39	7	6
Local FineWeb-Edu train	4.166892	17.80	7	4

It was getting even harder to see any useful pattern! One thing that did stand out, though, was that the still oddly-high Cloud FineWeb, 8x A100 40 GiB model was being instruction-trained for seven epochs. It was also rather noticeable that the two FineWeb-Edu models had the same "advantage", if that's what it was. But the Local FineWeb train had seven epochs too, and got a poor score, the OpenAI models only got two each, and led the pack, and 1xrtx3090-baseline got a pretty poor result given its six epochs of training.

Still, what would happen if we got rid of that confounder? I did yet another set of runs; this time, I changed the fine-tuning/generation script to always do four epochs -- no early exit. I chose four because it was the modal number in the previous trains -- no strong reason for it beyond that.

Training for four epochs

Here's what came out at the end:

	Test loss	IFT score	Epochs	IFT rank
OpenAI weights: medium	3.231442	43.99	4	1
OpenAI weights: small	3.499677	25.70	4	2
`1xrtx3090-stacked-interventions`	3.538161	14.46	4	4
`8xa100m40-stacked-interventions-1`	3.577761	10.07	4	11=
Cloud FineWeb, 8x A100 40 GiB	3.673623	13.51	4	5
`1xrtx3090-baseline`	3.683835	10.65	4	8
`8xa100m40-baseline`	3.691526	12.55	4	6
Cloud FineWeb, 8x H100 80 GiB	3.724507	11.41	4	7
Cloud FineWeb, 8x A100 80 GiB	3.729900	9.48	4	13
Cloud FineWeb, 8x B200 160 GiB	3.771478	10.07	4	11=
Local FineWeb train	3.943522	10.16	4	10
Local FineWeb-Edu extended train	4.134991	10.54	4	9
Local FineWeb-Edu train	4.166892	15.09	4	3

Still no obvious pattern.

Training for seven epochs

What if we try seven epochs of training for all of them, so that they all get as much "benefit" (if that's what it is) as the FineWeb-Edu models?

	Test loss	IFT score	Epochs	IFT rank
OpenAI weights: medium	3.231442	40.74	7	1
OpenAI weights: small	3.499677	24.87	7	2
`1xrtx3090-stacked-interventions`	3.538161	16.91	7	3
`8xa100m40-stacked-interventions-1`	3.577761	10.59	7	13
Cloud FineWeb, 8x A100 40 GiB	3.673623	15.94	7	5
`1xrtx3090-baseline`	3.683835	13.68	7	9
`8xa100m40-baseline`	3.691526	14.82	7	7
Cloud FineWeb, 8x H100 80 GiB	3.724507	10.82	7	11
Cloud FineWeb, 8x A100 80 GiB	3.729900	10.70	7	12
Cloud FineWeb, 8x B200 160 GiB	3.771478	13.81	7	8
Local FineWeb train	3.943522	13.09	7	10
Local FineWeb-Edu extended train	4.134991	16.27	7	4
Local FineWeb-Edu train	4.166892	15.54	7	6

Just as confused as ever...

Putting it all together

Here's a table with all of the ranks we got from these tests:

	Initial rank	Updated script rank	4-epoch rank	7-epoch rank
OpenAI weights: medium	1	1	1	1
OpenAI weights: small	2	2	2	2
`1xrtx3090-stacked-interventions`	3	5	4	3
`8xa100m40-stacked-interventions-1`	13	13	11=	13
Cloud FineWeb, 8x A100 40 GiB	4	3	5	5
`1xrtx3090-baseline`	10	9	8	9
`8xa100m40-baseline`	6	10	6	7
Cloud FineWeb, 8x H100 80 GiB	8	8	7	11
Cloud FineWeb, 8x A100 80 GiB	11	12	13	12
Cloud FineWeb, 8x B200 160 GiB	9	7	11=	8
Local FineWeb train	12	11	10	10
Local FineWeb-Edu extended train	7	6	9	4
Local FineWeb-Edu train	5	4	3	6

It's hard to draw much sense out of this, but a few things are clear:

Performance on this test is correlated with loss, but it's far from the only factor.
The OpenAI weights consistently lead the pack.
Of our own models, 1xrtx3090-stacked-interventions, Cloud FineWeb, 8x A100 40 GiB, and Local FineWeb-Edu train do pretty well.
Strangely, Local FineWeb-Edu extended train, which is just Local FineWeb-Edu train that has been trained on a further 3B tokens of the FineWeb-Edu dataset, is consistently worse than the model it was based on.
8xa100m40-stacked-interventions-1 and 1xrtx3090-baseline are consistently bad. Cloud FineWeb, 8x A100 80 GiB is also not great.

On the one hand, training different models for different numbers of epochs feels wrong for an evaluation like this, as they're being "treated differently". On the other hand, if it's meant to be a good evaluation of model usefulness in the real world, then individual models would be fine-tuned for different amounts of time, depending on validation loss. So perhaps it is better?

But the differing results are still quite a puzzle. I figured that a modern AI could easily build me a data exploration interface, specifically for the original results and seven-epoch ones, so I asked Claude and got this rather nice one.

After poring over that, though, I couldn't find a smoking gun -- for example, some kind of systematic error that 8xa100m40-stacked-interventions-1 was always making that pulled its score down.

I think that the best -- albeit hand-wavy and incomplete -- mental model that I have right now is something like this. If we consider the loss landscape that these models are all in, they've all been trained to try to get to a place with as low loss as we could manage. When we do the instruction fine-tune on them, we're changing the landscape -- the objective of "be better at following instructions" is different to "be better at minimising loss".

Now, those two landscapes could be completely different! You can imagine a task that we might set instead of instruction-following that could be completely uncorrelated with loss minimisation, or even inversely correlated.

But instruction-following is relatively close; it at least shares features like "generate coherent text". So when we do the instruction fine-tuning, what we're trying to do is to move from the place where the model ended up after its pre-training, to a place where performance on the new goal -- instruction-following -- is best.

Here's where I'm going to get more than a bit hand-wavy. You can easily imagine that some places where the loss was low, there might be downhill slopes pointing towards good locations in the new instruction-following landscape. With instruction fine-tuning, you'd be able to get a good IFT model.

But other places with low loss might not have that advantage; maybe they're at or near a poor "local minimum" in the IFT landscape -- that is, a place where there is no downhill route to a better place. So simple fine-tuning like this might never get a good result!

With this mindset, we might say that the OpenAI weights are pretty well-positioned, not just in the loss landscape but also in the IFT landscape. The FineWeb-Edu models happened to get lucky, and wind up in a place that (despite having poor loss), is well-positioned for the IFT objective. And by contrast, 8xa100m40-stacked-interventions-1 and 1xrtx3090-baseline were just unlucky: they got to a place where the loss landscape was not well-correlated with the IFT landscape.

This seems plausible enough for me to use it as my working model for now, and see if I can work out some way to test it. Keeping track of the validation loss during the instruction fine-tuning process would certainly be a good start; unfortunately I only realised that after doing all of the tests above, and re-doing them would be quite a lot of work.

One final thing is worth repeating. Our two "unlucky" models, 8xa100m40-stacked-interventions-1 and 1xrtx3090-baseline, each had a twin. The former was the DDP-trained counterpart of the gradient-accumulated 1xrtx3090-stacked-interventions, while the latter was the gradient-accumulated counterpart of 8xa100m40-baseline. So while something odd clearly happened, it doesn't look like DDP or gradient accumulation by themselves are the culprit.

I think that at this point, it's best for me to draw a line under this -- I have a bunch of other things I'd like to get to, and this is a bit of a side quest at this point.

Still, I have one main takeaway from this: chasing lower loss is technically interesting but is not the only goal. In some cases, it seems likely that lower-loss models can be worse for actual use.

Coming up next: I'm going to wrap up this "interventions" mini-series, and move on to the final steps in my LLM from scratch journey. See you then!

Here's a link to the next post in this series.