Archives
Categories
Blogroll
I've been working on a GPT-2-small-style LLM based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post, I managed to train one that was almost (if not quite) there.
Now, back before I started digging into these interventions, I was doing three evals for each model I built; a smoke test (to see if it could give a coherent completion to "Every effort moves you"), a test for that test set loss, and an instruction-following test that fine-tuned the model on the Alpaca dataset, got it to generate results for a test set of instructions, and then used an LLM as a judge to score them.
The idea behind this was that the loss on the test set was an interesting technical measure of the quality of a model, but it didn't really tell us much about how useful it might be in reality.
Unfortunately, in January, I realised that my methodology was bad; because I was asking the LLM to score a model in isolation, the LLM's natural randomness would mean that results were not really comparable, at least for models that were reasonably close in quality.
For example, if two models both replied to
Name the author of 'Pride and Prejudice'.
with:
The author of 'Pride and Prejudice' is Sarah Palin.
...then one run of the instruction-following test might "find the judge LLM in a good mood" and get, say, 5% -- after all, the model tried to answer, and actually used a real person's name, even if the answer was totally wrong. But in another run, the judge might be in a "worse mood" and score it at 0%.
My fix was to have two scripts:
- One that fine-tuned the model then got it to generate responses, then saved those responses in a file.
- One that took a bunch of files generated by the above, one for each of a set of different models, and presented them to the LLM together, so that it would (hopefully) be consistent in how it rated them relative to each other.
The details are here.
Because doing it that way was significantly more work, I've not been doing these tests as part of the interventions mini-series. I felt it would make more sense to wait until I'd tried a bunch of interventions and got a number of models to try.
Now I have those, so let's give it a go!
The background, and the last test
At the end of the previous round of IFT tests, I had this table. It's sorted by the loss on the test set (shown to 3 decimal places), and has the score that the model got from an instruction fine-tuning run:
| Test loss | IFT score | |
|---|---|---|
| OpenAI weights: medium | 3.231 | 39.64 |
| OpenAI weights: small | 3.500 | 16.66 |
| Cloud FineWeb, 8x A100 40 GiB | 3.674 | 16.5 |
| Cloud FineWeb, 8x H100 80 GiB | 3.725 | 11.59 |
| Cloud FineWeb, 8x A100 80 GiB | 3.730 | 11.23 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771 | 11.59 |
| Local FineWeb train | 3.944 | 11.32 |
| Local FineWeb-Edu extended train | 4.135 | 16.41 |
| Local FineWeb-Edu train | 4.167 | 15.77 |
There's a loose correlation where lower loss means a higher IFT score, with two weird exceptions: the two FineWeb-Edu training runs, where they got much higher results than you'd expect from the loss.
My working hypothesis was that there were two components that led to a model getting a good score:
- Its raw intelligence: lower-loss models were smarter, so they were better at instruction-following after the fine-tune.
- Its knowledge. All of the models -- mine and OpenAI's -- apart from the FineWeb-Edu ones were trained on what amounted to minimally-curated data from the Internet. But FineWeb-Edu is meant to be "the most educational" subset of FineWeb, so it presumably is more dense in useful facts.
So in those terms, the OpenAI models and Cloud FineWeb, 8x A100 40 GiB might be smart but not know very much, and the FineWeb-Edu ones might be dumb but knowledgeable. The ones in between, by contrast, could be relatively dumb too, but also not know very much.
There was one other oddity: the Cloud FineWeb, 8x A100 40 GiB model seemed surprisingly good on the IFT results when considering its loss -- but perhaps there was some kind of step function, where as soon as a model got better than (say) 3.7 on the loss, it suddenly became smart in whatever way mattered.
All very hand-wavy, of course, but it was a hypothesis of sorts. Would the new models fit that pattern? It was time to find out.
The initial run, and the mystery
I didn't think it was worth adding all 14 models that I've trained in my intervention-testing to that table, so I decided to just add four of them:
8xa100m40-baseline, the baseline cloud-trained model for all of the interventions.1xrtx3090-baseline, the locally-trained version of the same -- the first model from this post.8xa100m40-stacked-interventions-1, the best model we managed to get in the cloud.1xrtx3090-stacked-interventions, the best local model -- the second from this post.
Now, I already had files containing responses from fine-tuned versions of the other models, so I just needed to run the first of my two fine-tuning scripts against all four of the new models.
I did that, and then also tweaked the judge script so that instead of using GPT-5.1, it used GPT-5.4. If you run the script multiple times, each time will normally give you different scores anyway; hopefully the ranking will remain roughly the same. So given that I was going to have to re-run the script to get new aggregate results, and those would not really be comparable to the original ones anyway, this seemed like a reasonable price to pay for (hopefully) a smarter judge.
I ran that once, and got some results that surprised me -- so much that I decided to do three runs and see if the results stood up. They did; here's the new table, with scores for each run, the average, and the rank that each one got based on the average.
| Test loss | IFT score 1 | IFT score 2 | IFT score 3 | IFT average | IFT rank | |
|---|---|---|---|---|---|---|
| OpenAI weights: medium | 3.231442 | 43.44 | 41.83 | 41.30 | 42.19 | 1 |
| OpenAI weights: small | 3.499677 | 19.27 | 19.37 | 18.36 | 19.00 | 2 |
1xrtx3090-stacked-interventions |
3.538161 | 19.20 | 18.60 | 18.15 | 18.65 | 3 |
8xa100m40-stacked-interventions-1 |
3.577761 | 11.70 | 12.74 | 11.28 | 11.91 | 13 |
| Cloud FineWeb, 8x A100 40 GiB | 3.673623 | 18.25 | 18.40 | 17.83 | 18.16 | 4 |
1xrtx3090-baseline |
3.683835 | 13.59 | 13.93 | 12.56 | 13.36 | 10 |
8xa100m40-baseline |
3.691526 | 17.72 | 17.33 | 16.26 | 17.10 | 6 |
| Cloud FineWeb, 8x H100 80 GiB | 3.724507 | 14.87 | 15.05 | 13.68 | 14.53 | 8 |
| Cloud FineWeb, 8x A100 80 GiB | 3.729900 | 12.65 | 13.34 | 12.55 | 12.85 | 11 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771478 | 14.39 | 14.72 | 12.87 | 13.99 | 9 |
| Local FineWeb train | 3.943522 | 12.66 | 13.06 | 11.67 | 12.46 | 12 |
| Local FineWeb-Edu extended train | 4.134991 | 17.64 | 16.93 | 16.29 | 16.95 | 7 |
| Local FineWeb-Edu train | 4.166892 | 17.94 | 18.92 | 17.05 | 17.97 | 5 |
You can see that relative rankings are fairly consistent across the IFT runs. But while in general the lower-loss runs get better IFT results, now there are even more exceptions to that trend than there were before.
Let's look down the "IFT rank" column, which is based on the IFT average:
- The first surprise is
8xa100m40-stacked-interventions-1. It has the fourth-best loss, but it's the worst model out of all of them on the instruction fine-tuning test! It was trained on exactly the same data as all of the others apart from the OpenAI ones and the FineWeb-Edu ones. Even more perplexingly, it was as close a match to1xrtx3090-stacked-interventionsas I could make it, but got completely different results. You might remember from the post that those two runs started with the same weights and had exactly the same training config; the only difference was that they were trained on different architectures, and one used DDP with a real global batch size of 96, while the other used gradient accumulation to get the same batch size. 1xrtx3090-baselinealso does much worse than you'd expect from its loss numbers; it's only a tiny bit worse than Cloud FineWeb, 8x A100 40 GiB in loss terms, but much worse on the IFT test. Again, this one is essentially a clone of another:8xa100m40-baseline, which was the same training run but using DDP rather than gradient accumulation. The same problem -- one of a pair of closely-matched models has worse results on the IFT test. But in this case, it's the gradient accumulation model that turned out bad.
That's a really odd situation. If the training runs using gradient accumulation rather than DDP had been consistently worse -- or vice versa -- then we could imagine some kind of connection. But in the first case, GA beat DDP, but in the second, it was the other way around.
Apart from that, we do still see that the two FineWeb-Edu models are doing much better than the others. And the remaining models are all pretty close together, both in terms of loss and in terms of their ranking, apart from the Local FineWeb train, which is bad in both.
It is, however, interesting that Local FineWeb-Edu extended train, which was trained on twice as much data as Local FineWeb-Edu train, is consistently worse in terms of the IFT numbers, though. That wasn't the case in my tests previously.
All of this puzzled me. The "lots of knowledge makes a model better at this" idea
seemed to be weakened by the relative ranks of the two FineWeb-Edu models (after all,
if it was true, you'd expect the model trained on more data to be consistently better).
And the "smart, low-loss models are better" side seemed to be contradicted by
8xa100m40-stacked-interventions-1 and 1xrtx3090-baseline's bad results.
What might be going on here?
Epochs of fine-tuning
Looking at the training code, one thing stood out to me. The process was:
- Fine-tune the model for a maximum of 100 epochs over the training set.
- If loss on a held-back validation set went above the result for the previous epoch, we did an early exit and used the previous epoch's model for the generation of the responses.
In practice, the early-exit code always cut in pretty quickly. I'd noticed that during my original generation of the results for the new models:
8xa100m40-baselinetook 6 epochs until validation loss started rising.1xrtx3090-baselinetook 58xa100m40-stacked-interventions-1took 41xrtx3090-stacked-interventionstook 5
I decided to regenerate responses for all of the models, and then run the new responses past the LLM judge again. But this time I would keep a record of how many epochs of training we got before the exit:
| Test loss | IFT score | Epochs | IFT rank | |
|---|---|---|---|---|
| OpenAI weights: medium | 3.231442 | 39.14 | 2 | 1 |
| OpenAI weights: small | 3.499677 | 24.93 | 2 | 2 |
1xrtx3090-stacked-interventions |
3.538161 | 16.97 | 4 | 5 |
8xa100m40-stacked-interventions-1 |
3.577761 | 10.40 | 4 | 13 |
| Cloud FineWeb, 8x A100 40 GiB | 3.673623 | 20.73 | 7 | 3 |
1xrtx3090-baseline |
3.683835 | 13.61 | 6 | 9 |
8xa100m40-baseline |
3.691526 | 13.57 | 4 | 10 |
| Cloud FineWeb, 8x H100 80 GiB | 3.724507 | 14.25 | 4 | 8 |
| Cloud FineWeb, 8x A100 80 GiB | 3.729900 | 11.66 | 4 | 12 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771478 | 15.17 | 4 | 7 |
| Local FineWeb train | 3.943522 | 13.25 | 7 | 11 |
| Local FineWeb-Edu extended train | 4.134991 | 16.39 | 7 | 6 |
| Local FineWeb-Edu train | 4.166892 | 17.80 | 7 | 4 |
It was getting even harder to see any useful pattern! One thing that did
stand out, though, was that the still oddly-high Cloud FineWeb, 8x A100 40 GiB
model was being instruction-trained for seven epochs. It was also rather noticeable
that the two FineWeb-Edu models had the same "advantage", if that's what it was. But
the Local FineWeb train had seven
epochs too, and got a poor score, the OpenAI models only got two each, and
led the pack, and 1xrtx3090-baseline got a pretty poor result given its six
epochs of training.
Still, what would happen if we got rid of that confounder? I did yet another set of runs; this time, I changed the fine-tuning/generation script to always do four epochs -- no early exit. I chose four because it was the modal number in the previous trains -- no strong reason for it beyond that.
Training for four epochs
Here's what came out at the end:
| Test loss | IFT score | Epochs | IFT rank | |
|---|---|---|---|---|
| OpenAI weights: medium | 3.231442 | 43.99 | 4 | 1 |
| OpenAI weights: small | 3.499677 | 25.70 | 4 | 2 |
1xrtx3090-stacked-interventions |
3.538161 | 14.46 | 4 | 4 |
8xa100m40-stacked-interventions-1 |
3.577761 | 10.07 | 4 | 11= |
| Cloud FineWeb, 8x A100 40 GiB | 3.673623 | 13.51 | 4 | 5 |
1xrtx3090-baseline |
3.683835 | 10.65 | 4 | 8 |
8xa100m40-baseline |
3.691526 | 12.55 | 4 | 6 |
| Cloud FineWeb, 8x H100 80 GiB | 3.724507 | 11.41 | 4 | 7 |
| Cloud FineWeb, 8x A100 80 GiB | 3.729900 | 9.48 | 4 | 13 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771478 | 10.07 | 4 | 11= |
| Local FineWeb train | 3.943522 | 10.16 | 4 | 10 |
| Local FineWeb-Edu extended train | 4.134991 | 10.54 | 4 | 9 |
| Local FineWeb-Edu train | 4.166892 | 15.09 | 4 | 3 |
Still no obvious pattern.
Training for seven epochs
What if we try seven epochs of training for all of them, so that they all get as much "benefit" (if that's what it is) as the FineWeb-Edu models?
| Test loss | IFT score | Epochs | IFT rank | |
|---|---|---|---|---|
| OpenAI weights: medium | 3.231442 | 40.74 | 7 | 1 |
| OpenAI weights: small | 3.499677 | 24.87 | 7 | 2 |
1xrtx3090-stacked-interventions |
3.538161 | 16.91 | 7 | 3 |
8xa100m40-stacked-interventions-1 |
3.577761 | 10.59 | 7 | 13 |
| Cloud FineWeb, 8x A100 40 GiB | 3.673623 | 15.94 | 7 | 5 |
1xrtx3090-baseline |
3.683835 | 13.68 | 7 | 9 |
8xa100m40-baseline |
3.691526 | 14.82 | 7 | 7 |
| Cloud FineWeb, 8x H100 80 GiB | 3.724507 | 10.82 | 7 | 11 |
| Cloud FineWeb, 8x A100 80 GiB | 3.729900 | 10.70 | 7 | 12 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771478 | 13.81 | 7 | 8 |
| Local FineWeb train | 3.943522 | 13.09 | 7 | 10 |
| Local FineWeb-Edu extended train | 4.134991 | 16.27 | 7 | 4 |
| Local FineWeb-Edu train | 4.166892 | 15.54 | 7 | 6 |
Just as confused as ever...
Putting it all together
Here's a table with all of the ranks we got from these tests:
| Initial rank | Updated script rank | 4-epoch rank | 7-epoch rank | |
|---|---|---|---|---|
| OpenAI weights: medium | 1 | 1 | 1 | 1 |
| OpenAI weights: small | 2 | 2 | 2 | 2 |
1xrtx3090-stacked-interventions |
3 | 5 | 4 | 3 |
8xa100m40-stacked-interventions-1 |
13 | 13 | 11= | 13 |
| Cloud FineWeb, 8x A100 40 GiB | 4 | 3 | 5 | 5 |
1xrtx3090-baseline |
10 | 9 | 8 | 9 |
8xa100m40-baseline |
6 | 10 | 6 | 7 |
| Cloud FineWeb, 8x H100 80 GiB | 8 | 8 | 7 | 11 |
| Cloud FineWeb, 8x A100 80 GiB | 11 | 12 | 13 | 12 |
| Cloud FineWeb, 8x B200 160 GiB | 9 | 7 | 11= | 8 |
| Local FineWeb train | 12 | 11 | 10 | 10 |
| Local FineWeb-Edu extended train | 7 | 6 | 9 | 4 |
| Local FineWeb-Edu train | 5 | 4 | 3 | 6 |
It's hard to draw much sense out of this, but a few things are clear:
- Performance on this test is correlated with loss, but it's far from the only factor.
- The OpenAI weights consistently lead the pack.
- Of our own models,
1xrtx3090-stacked-interventions, Cloud FineWeb, 8x A100 40 GiB, and Local FineWeb-Edu train do pretty well. - Strangely, Local FineWeb-Edu extended train, which is just Local FineWeb-Edu train that has been trained on a further 3B tokens of the FineWeb-Edu dataset, is consistently worse than the model it was based on.
8xa100m40-stacked-interventions-1and1xrtx3090-baselineare consistently bad. Cloud FineWeb, 8x A100 80 GiB is also not great.
On the one hand, training different models for different numbers of epochs feels wrong for an evaluation like this, as they're being "treated differently". On the other hand, if it's meant to be a good evaluation of model usefulness in the real world, then individual models would be fine-tuned for different amounts of time, depending on validation loss. So perhaps it is better?
But the differing results are still quite a puzzle. I figured that a modern AI could easily build me a data exploration interface, specifically for the original results and seven-epoch ones, so I asked Claude and got this rather nice one.
After poring over that, though, I couldn't find a smoking gun -- for example, some
kind of systematic error that 8xa100m40-stacked-interventions-1 was always making
that pulled its score down.
I think that the best -- albeit hand-wavy and incomplete -- mental model that I have right now is something like this. If we consider the loss landscape that these models are all in, they've all been trained to try to get to a place with as low loss as we could manage. When we do the instruction fine-tune on them, we're changing the landscape -- the objective of "be better at following instructions" is different to "be better at minimising loss".
Now, those two landscapes could be completely different! You can imagine a task that we might set instead of instruction-following that could be completely uncorrelated with loss minimisation, or even inversely correlated.
But instruction-following is relatively close; it at least shares features like "generate coherent text". So when we do the instruction fine-tuning, what we're trying to do is to move from the place where the model ended up after its pre-training, to a place where performance on the new goal -- instruction-following -- is best.
Here's where I'm going to get more than a bit hand-wavy. You can easily imagine that some places where the loss was low, there might be downhill slopes pointing towards good locations in the new instruction-following landscape. With instruction fine-tuning, you'd be able to get a good IFT model.
But other places with low loss might not have that advantage; maybe they're at or near a poor "local minimum" in the IFT landscape -- that is, a place where there is no downhill route to a better place. So simple fine-tuning like this might never get a good result!
With this mindset, we might say that the OpenAI weights are pretty well-positioned,
not just in the loss landscape but also in the IFT landscape. The FineWeb-Edu
models happened to get lucky, and wind up in a place that (despite having poor loss),
is well-positioned for the IFT objective. And by contrast, 8xa100m40-stacked-interventions-1
and 1xrtx3090-baseline were just unlucky: they got to a place where the loss
landscape was not well-correlated with the IFT landscape.
This seems plausible enough for me to use it as my working model for now, and see if I can work out some way to test it. Keeping track of the validation loss during the instruction fine-tuning process would certainly be a good start; unfortunately I only realised that after doing all of the tests above, and re-doing them would be quite a lot of work.
One final thing is worth repeating. Our two "unlucky" models,
8xa100m40-stacked-interventions-1 and 1xrtx3090-baseline, each had a twin.
The former was the DDP-trained counterpart of the gradient-accumulated
1xrtx3090-stacked-interventions, while the latter was the gradient-accumulated counterpart
of 8xa100m40-baseline. So while something odd clearly happened, it doesn't look like
DDP or gradient accumulation by themselves are the culprit.
I think that at this point, it's best for me to draw a line under this -- I have a bunch of other things I'd like to get to, and this is a bit of a side quest at this point.
Still, I have one main takeaway from this: chasing lower loss is technically interesting but is not the only goal. In some cases, it seems likely that lower-loss models can be worse for actual use.
Coming up next: I'm going to wrap up this "interventions" mini-series, and move on to the final steps in my LLM from scratch journey. See you then!