Quo vadis, LLM benchmarks?

LLM Benchmarks are dead. They don't give any signal, and labs, especially open ones, just train on test to generate hype on social media and to lure innocent bystanders into using them, robbing them of their time and money in broad daylight.

This (or variations of it) is something that people claim every time their favorite benchmark releases results that go against their intuition. Or, which seems to be increasingly the case: Because they didn't even look into the benchmark(s) once and thus interpret whatever they want into the fun graph.

Let's get the most important point out of the way first: Benchmarks are not meant to be taken at face value, but rather indicate the (relative) strengths of models and the general progress in the respective area of the benchmark. While comparisons at the top get fuzzy (a 1-3% difference is well within statistical error), bigger differences in capabilities are represented well by existing benchmarks. This feels obvious, but needs to be spelled out against the noise of social media every time a model gets a tiny improvement over SOTA on a single benchmark.

Benchmarks are always constrained #

Benchmarks always have to work with certain assumptions or rules, which makes them different from general, real-world usage. Coding benchmarks use repos with comprehensive test suites, QA benchmarks have to agree on final facts to check against, etc. In practice, those constraints show up in three places: how the task is specified, how the data is interpreted, and how much compute you can afford.

Prompts (and their corresponding grading functions) have to be so well defined that they allow for one (or a defined set of) solutions and not bad submissions, yet be vague enough that the answer is not spoiled in the prompt. Initially, I was not a fan of SWE-Bench Pro, which uses existing (GitHub Issue) texts and then makes them overly specific, but I've since come around and think that this is a rather good approach.

Even if you get the prompt "right", you can still have reasonable disagreement about the sample. Another aspect that I only got to appreciate by looking at hundreds of samples and discussing them with others: There are different, but valid viewpoints about any sample. You might dislike one for being too ambiguous, others think it's fine as-is. There are samples (and benchmarks) which are objectively bad or wrong, but for a lot of the well-known ones, there are reasons for samples to be included.

With the rampant advancements of model capabilities, a different constraint becomes an increasing problem to accurately measure capabilities, especially for independent organizations and academia: Money. Running benchmarks is expensive, especially if you run them multiple times to accurately report avg@X numbers.

And that cost pressure inevitably bleeds back into the benchmark design. To manage costs in some way, benchmarks often set some limit, whether it's wall-clock time or a certain amount of steps or tokens. But what if simply raising those limits results in way better performance? What happens to (your) bench if you were to 10x the money spent? What if you 100x or 1000x it? SOTA models can run for hours or even days uninterrupted, port entire libraries or build C compilers. How would the benchmark landscape change if it weren't constrained by those arbitrary limits?

Benchmarks have an elicitation problem #

Some benchmarks (deliberately) play into the weaknesses of current models to stay relevant for longer than they ought to be, while others use a reasonable amount of effort to make the samples fair, but challenging.

To give an example, here is a sample from a vision benchmark, which just wants to be hard, rather than measuring the vision capabilities of a model:

The corresponding question is:

In this image, consider the following values:

Consider all the cars that are driving on the road. Of these cars, one has a number plate which has a 2 digit prime number visible. How many letters does this car's manufacturer's name have?

There is a car reversing. What is the 2 digit prime number on the number plate of this car?

Consider which country this picture is taken in, how many letters does this country's name have in English?

How many white stripes can be seen on the road where it is normally used by pedestrians to cross the road?

How many active traffic lights indicating that road traffic can proceed are visible?

What is the product of all 5 of these values?

The models are used without tools, so aside from it being a vision benchmark that is needlessly convoluted, the models are also asked to reason and do math calculations about their intermediate results. It is hard to see real-world utility being measured here. Even funnier: The ground truth is wrong depending on whether you consider the one stripe next to the man valid or not.

Other benchmarks do not try hard enough to properly elicit the raw capabilities of models. This affects open models more than closed models, as the former are generally weaker, but can thrive when set up and used correctly. People genuinely use models like Kimi K2.5 in harnesses like OpenCode because it is a good model and not just to save money.

To showcase how weak elicitation affects the reported capabilities, let's look at an example benchmark: AlgoTune. It was made by Tübingen and Princeton people, who release one industry-defining software engineering benchmark after another, from SWE-bench Verified to CodeClash. AlgoTune is a fun benchmark, asking models to optimize small Python code, which often uses optimized libraries itself.

However, the ranking feels off:

GPT-OSS being placed higher than Opus 4.1 and GPT-5 Pro? o4-mini over GPT-5 and GLM-4.5? Looking into the samples, it becomes clear as to why: Each sample is constrained to $1 in API usage, so a lot of the more expensive models simply bust the pricing in their first message(s)!

The other issue is the harness: It includes a set of tools to look at the files, revert to a previous step and edit code, but the model has to return a block of reasoning, followed by the tool call in triple-backtick delimited markdown. This is not how models work these days! Unsurprisingly, the models struggle with this format, resulting in a lot of useless or broken "tool calls".

So, what happens when you fix those mistakes? To test this, I chose the "hardest" sample, vectorized_newton, which no model was able to solve. As the repo is rather bespoke (to run all samples efficiently on AWS), I used Codex to extract the issue, reference, and grader and then used both Codex CLI (+ GPT-5.3 Codex xhigh) as well as Kimi CLI (+ Kimi K2.5) to tackle the problem.

The results should not be surprising: Both models equally crush the problem, achieving "SOTA" immediately. I played around with the prompts as well: The initial AlgoTune prompt (minus the tool explanations) and a prompt which tries to get the model more into a loop to achieve even better outcomes. For this problem, it did not really matter. I also translated the latter into Chinese as well, which resulted in a worse performance for K2.5.

I tested a bunch of different problems with this setup in a similar (vibe-based, unscientific) way and the usage of a CLI always trumped the current SOTA by... a lot. This also holds when you limit for costs.

Those experiments also led to one of my favorite env hacks in recent times: For sha256_hashing, models are asked to optimize a SHA-256 hash function. However, the reference already uses Python's cryptography, which uses OpenSSL under the hood, so a real speedup is basically impossible. I told Codex repeatedly to optimize it, no matter what. It then came to the genius solution to disable OPENSSL_armcap with environment variables at import time, which means the CPU crypto capabilities are disabled, which makes OpenSSL (and thus the reference implementation) slower. Codex's own solution then used Apple's libcommonCrypto, which obviously isn't affected by the environment variable, resulting in a >5x speedup. Codex did call it out before and after it implemented it, so that was easy to catch!

I think the best way to balance the act between "we want to accurately report the capabilities of models" and "we want to compare models in a scientific way and not just products" is what SWE-bench does: Have one simple standard harness with sane defaults and one leaderboard where people (and organizations) can hill climb as hard as they want. To test raw capabilities, you have to use the production harnesses, which is what PostTrainBench does.

Conclusion #

Are benchmarks dead? Of course not. However, it is important to know what benchmarks are for and to look at the implementations (and data) of benchmarks to know what it measures and within which constraints it measures.

For benchmarks, it's important to either go with the current developments of the field or, probably even more important, acknowledge when it ceases to be a useful indicator. The fun of working in a field that is so rapidly advancing: There is a real possibility that harnesses (and their respective impact) are a non-factor for eliciting model capabilities in a year. Let's see how well this blog will age!