GPT-4.5 is #1 on Chatbot Arena (LMSYS) in all categories
lmarena.aiI'm not surprised. You get the same voices poo-pooing scaling even though it just works. And then you get the wave of articles downplaying the hallucination problem. In reality, 61.8% -> 37.1& on SimpleQA means that there are many classes of problems that GPT-4.5 just solves (or at least tells you it can't) For something like OCR or data entry this just this is real gains.
Also there is likely over-fitting to current benchmarks and use-cases that make objectively dumber models perform better. Within the year people will create reasoning models (allegedly) distilled from 4.5 that will match or beat it in most usecases and benchmarks humans care about.
A separate problem is that post-training is limiting current models to only "expert level". It's likely that superhuman abilities from the base model are lost.
IMO scale beats all. It's just that it's hard and that there has been comparatively little since gpt-3. The entire human race, companies and countries need to come together and work together on a distributed solution.
I’m surprised it topped the reasoning models for code generation and hard prompts. The style control results are also impressive: https://twitter.com/lmarena_ai/status/1896590154871210154