Not smarter, just better

3 min read Original article ↗

I follow Hacker News, r/LocalLLaMA, and all the other channels out there where new models are posted along with their benchmark scores. Can we please stop kidding ourselves? Models are not actually getting smarter. At the top end of the market, OpenAI and Anthropic are deadlocked in model ability (for lack of a better word). I think this is why Grok was able to catch up so fast. We’re simply maxing out what the transformer architecture can produce with the data we have available to train with.

But, if you use Claude or ChatGPT daily, then you might disagree. Both GPT 5 and Claude Sonnet 4 are kind of sorta better at a few niche things. Sure, there are some marginal improvements in each flagship release, but it’s not like upgrading from GPT3 to GPT4. It’s more like, “oh GPT-whatever can read this cursive handwriting like 60% of the time.” It’s mostly a vibe check. Gone are the days of 10x improvements—at least until we discover a fundamentally new model architecture. (Something something the BigO of the transformer architecture etc etc).

The models themselves are not smarter but ChatGPT and Claude (the product) have gotten a lot better over the last 2 years. No more “my knowledge cutoff is whatever” type of misinformation, because they can now search the internet. You don’t need to ask for “step-by-step” because they can automatically run multiple tasks on their own. It does exactly what I would do: search google, write and eval some code, and then have it summarize the results.

And just today Anthropic announced that Claude (still talking about the product) can read and write excel files, slide decks, and more. The chatbot experience is getting much better, but again, the magic isn’t in the model, it’s all tooling. Watching Anthropic and OpenAI continue to pump engineering time into chatbot add-ons only reinforces my point. They know they’re stuck for the moment, so if they want a moat, they’ll need to build it the old fashioned way.

Ok, but isn’t this all just better prompt engineering? I mean, sure, giving the model more relevant data will make LLMs perform better, but that wasn’t the AGI promise. We’re miles away from sneezing out a prompt and watching a billion dollar product emerge token by token. You still cannot trust an LLM to make business critical decisions, or even plan your upcoming family trip without fact checking. And writing agents to check other agents just doesn’t work. Try it. You’ll be disappointed.

But here’s the thing—I’m not mad. I love these improvements. Please, keep them coming! OpenAI and Anthropic should be very proud of what they’ve released so far. I just want us to be honest about where and how the needle is moving. Let’s cut it out with “this new flagship model is 50% better than the last one” marketing BS . Nobody believes you. What we all want is faster and cheaper inference; and more investment in product tooling. That’s it. No more juiced benchmark scores. No more delusions of AGI. Just keep making the product better. Because frankly, that’s more exciting than whatever synthetic benchmark you’re gaming this quarter.

Discussion about this post

Ready for more?