Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

11 min read Original article ↗

Archives

Categories

Blogroll

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI:

  1. A simple cross entropy loss over a fixed test set.
  2. The results for an instruction fine-tune test that's covered in the book.

Here were the results I got, sorted by the loss:

Test loss IFT score
OpenAI weights: medium 3.231 38.53
OpenAI weights: small 3.500 22.98
Cloud FineWeb, 8x A100 40 GiB 3.674 17.09
Cloud FineWeb, 8x H100 80 GiB 3.725 11.98
Cloud FineWeb, 8x A100 80 GiB 3.730 11.71
Cloud FineWeb, 8x B200 160 GiB 3.771 13.89
Local FineWeb train 3.944 16.01
Local FineWeb-Edu extended train 4.135 14.55
Local FineWeb-Edu train 4.167 16.86

Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern.

I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now.

In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change.

Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this:

Below is an instruction that describes a task.  Write a response that
appropriately completes the request.

### Instruction:

<some instructions>


### Input:

<optional, some input>

### Response:

More details in this post.

In the version I've settled on, I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples.

Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this:

Given the input `
Below is an instruction that describes a task.  Write a response that
appropriately completes the request.

### Instruction:

Rewrite the sentence using a simile.


### Input:

The car is very fast.

### Response:
`
and correct output `
The car is as fast as lightning.
`,
score the model response `
The car is as fast as a cheetah.
`
on a scale of 0 to 100, where 100 is the best score.
Respond with the integer number only.

We do that for every model-generated response in the test set, then take the average of the scores and use that as our result.

To see why that's problematic, imagine this simple instruction with no separate input:

Below is an instruction that describes a task.  Write a response that
appropriately completes the request.

### Instruction:

Name the author of 'Pride and Prejudice'.


### Response:

One response I've seen from my models was this:

The author of 'Pride and Prejudice' is 'Pride and Prejudice'.

That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that.

Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this:

The author of 'Pride and Prejudice' is Jane Austen.

That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset).

But now how about this one:

The author of 'Pride and Prejudice' is Sarah Palin.

One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response.

The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment.

I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this:

You are judging the comparative capabilities of a number of different LLM
models.  They have been trained to follow instructions.

The input was this:

`
{input}
`

An example correct output is this:

`
{correct_output}
`

Please produce a score of between 0 and 100 for each model, and respond
with a JSON structure like this (note that the number of models may differ
from this example):

`
{
    "Model 1": {"score": XXX, "comments": "optional comments"},
    "Model 2": {"score": YYY, "comments": "optional comments"},
    "Model 3": {"score": ZZZ, "comments": "optional comments"}
}
`

...where the XXX, YYY and ZZZ are the scores for the respective models.
You can optionally add the "comments" field if you want to explain your
reasoning.

Here are the models' responses:

# Model 1

{model 1 response}


# Model 2

{model 2 response}


# Model 3

{model 3 response}

The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better.

Here's the code:

Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look):

Test loss IFT score
OpenAI weights: medium 3.231 39.64 openai-medium-ift-test-results-annotated.json
OpenAI weights: small 3.500 16.66 openai-small-ift-test-results-annotated.json
Cloud FineWeb, 8x A100 40 GiB 3.674 16.5 8xa100m40-ift-test-results-annotated.json
Cloud FineWeb, 8x H100 80 GiB 3.725 11.59 8xh100m80-ift-test-results-annotated.json
Cloud FineWeb, 8x A100 80 GiB 3.730 11.23 8xa100m80-ift-test-results-annotated.json
Cloud FineWeb, 8x B200 160 GiB 3.771 11.59 8xb200m160-ift-test-results-annotated.json
Local FineWeb train 3.944 11.32 local-fineweb-ift-test-results-annotated.json
Local FineWeb-Edu extended train 4.135 16.41 local-fineweb-edu-extended-ift-test-results-annotated.json
Local FineWeb-Edu train 4.167 15.77 local-fineweb-edu-ift-test-results-annotated.json

(Still sorted by loss so that you can compare it more easily with the one above.)

That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern.

It looks like we have three groups of models:

  1. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores
  2. The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores.
  3. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's.

I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower.

A hypothesis: there are two things that contribute to how good a model is at these IFT tests:

  1. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning.
  2. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as

a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.

Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible.

By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training.

So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge.

Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores.

And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better.

Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth.

TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest.

So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them.

Here's a link to the next post in this series.