Settings

Theme

CursorBench 3.1

cursor.com

153 points by handfuloflight 16 hours ago · 93 comments

Reader

mdasen 15 hours ago

I'm a bit skeptical.

Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.

I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.

  • leerob 5 hours ago

    (I work at Cursor) When Composer 2.5 launched, we initially scored very competitively on AA's composite benchmark. I believe 3rd place overall. They have recently updated to use DeepSWE, which has more of a focus on very long-horizon tasks, and Composer isn't as good at those yet. We're aware and working on this for our next model.

    Overall, some benchmarks show Composer doing well, others not so much. We think the model is very capable at the given price point. There's lots to improve! If you see any specific behaviors or places the model isn't very good, lmk here or can email me lrobinson at cursor.com.

    • BugsJustFindMe 3 hours ago

      > We think the model is very capable at the given price point.

      The "price point" comparison is a lie though because Composer is only available with a monthly Cursor subscription, and Cursor's external-model-per-token charges for other models are not representative of what other models' monthly subscribers get. An OpenAI $200 subscription gets you at least as much GPT 5.5 as a $200 Cursor subscription gets you Composer 2.5.

    • giancarlostoro 4 hours ago

      How does it compare to a $100 Claude subscription at $60? Especially in terms of how much of it I can use, because I havent found anything that is in the US that can get me similar usage as Claude at $100 per month or less, really open to alternatives.

      Grok build only gave me roughly 10 hours of use for $40 for the entire month...

      I don't even care about long horizon, can I use it a reasonable amount of time through the month? I use AI for hobby projects, Claude gets me quite far, but I tire of dropping $100 every month. I'm not sending my money to some Chinese firm that now has access to my computer.

    • artooro 4 hours ago

      I never run long horizon tasks. So Composer 2.5 is great.

    • forgot-my-pw 4 hours ago

      Even with the new benchmark, Composer 2.5 seems to be just a bit worse than Opus 4.7. So I assume it's going to be about similar with Sonnet 5.0 at 1/6 of the cost.

    • ai_slop_hater 4 hours ago

      Don't lie. You forked a Chinese model.

  • CuriouslyC 6 hours ago

    Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

    Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.

    • leerob 5 hours ago

      (I work at Cursor) CursorBench includes many evals from actual engineering tasks from the Cursor team, which include our private codebase. This codebase is held-out from training so models haven't seen it, including Composer.

  • burmanm 13 hours ago

    DeepSWE is slightly flawed in the sense that is uses only its own harness and that causes issues on models that are not correctly supported by it. There's huge amount of evidence that the harness plays a big role in how these models work and yet DeepSWE entirely removes that (and has probably only tested that it works fine with some favourite model of them).

    There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.

    None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.

    • extr 5 hours ago

      I think DeepSWE is flawed in a different way: the tasks look like someone took a bunch of big highly technical PRs they found really well done, and inverted it into specs for agents to autistically execute. This is not really how people use agents in practice IMO. And it's why DeepSWE is so generous to OAI models, rigid task execution is the thing they're best at. I think FrontierCode matches the vibes a lot better.

  • famouswaffles 15 hours ago

    Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.

  • justachillguy 11 hours ago

    Naturally, given it’s their benchmark they have overfitted their model somewhat to it.

  • jmcqk6 2 hours ago

    I can't speak to benchmarks, but I have used Composer 2.5 extensively and it's performed quite well in my real world tasks.

  • muzani 12 hours ago

    Anecdotally, I find Composer 2.5 to be useless. I do use light LLMs like Claude Haiku and some of Cursor's older free models, but Composer is negative productivity for me.

    • maxdo 12 hours ago

      The opposite , I use for everything like trigger and monitor a 10 steps release process using composer , a very capable model

      • vorticalbox 10 hours ago

        this is my finding too, i have moved to it fully for most of the plan/coding.

        for most tasks is capable and very cheap, for a days worth of tasks is costing about $10

        • urbsgpw 9 hours ago

          Same here, maybe I'm underusing it a bit, because for anything that is a bit more complex i tend to err on the safe side and go with anthropic, but i wonder if thats just a placebo effect because i pay more for it.

          I do feel that they've really upped their game with composer this year though.

  • datadrivenangel 14 hours ago

    For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.

    • anon7000 11 hours ago

      Agreed. It’s worse than Opus of course. But Opus takes more than 10x longer to give you something to look at. I’m not kidding, I “benchmarked” a real ticket I was working on. Opus 4.7 took more than 30min. Opus 4.8 took over an hour. Composer 2.5 took 5min on the exact same prompt & local setup. My subjective review is that composer’s code was only like 10-20% worse. It still worked, it was just a bit less clean and a little more hacky. But it’s not like Opus is flawless either. At the end of the day, if it takes an hour to get to draft code I can look at and iterate on… that’s fucking impossible for me. Unless it did an excellent job. But as long as I still need to review and follow up with changes, Opus is just too slow. It’s really frustrating because it’s a lot slower than it was 6mo ago, and not noticeably better. Fable seems a step in the right direction but is $$$$

      • arcanemachiner 2 hours ago

        Opus is so damn slow that it's forced me to work on multiple items at a time, whenever possible. Waiting for one session? Just switch to another!

      • subhobroto 2 hours ago

        I'm not disputing what you're saying. The slowness of Opus in particular is pretty accurate but you should have been getting little popups in Cursor saying Opus is under load and to try switching to other models?

        I mentioned my frustration with Composer in another thread and why I rely on Opus, but Opus, atleast via Cursor is practically unusable for me M-F 9-5 EST. As a result, I have modified my working schedule outside those hours when I use Opus. On weekends and nights, Opus via Cursor is at the same speed as Composer but vastly superior quality where it's not even comparable.

        Composer is not 20% worse than Opus for me. Composer hands me a quickly put together college project that was started the night before it was due. Opus hands me a actual production ready deliverable that I can defend if I was sued in court.

  • subhobroto 2 hours ago

    > Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

    Your skepticism is well-founded IMHO. I have found that if you are one-shotting a Django/Next CRUD app, a typical React/Vue UI, shell scripts or GitHub Actions, Composer 2.5 is fantastic!

    But for anything outside the median of the last decade's web development - like free-body physics, kinematics, or optimization - Composer is horribly unpredictable.

    That's what makes it _dangerous_ IMHO.

    It isn't universally trash! Rather, it confidently makes subtle, incorrect assumptions. It will hallucinate formulas that don't appear in your specification and design docs. Then write tests that pass it.

    It inserts tiny footguns that require you to scrutinize every single token it generates. At that point, I would rather be coding by hand.

    Opus 4.8 max, on the other hand, refuses to guess, atleast the way I have set it up. If there's any ambiguity about the implementation or how tests should be written, it stops and asks me for clarification. I actually trust the output without worrying about hidden disasters and ticking timebombs. I can confidently review the test suite, add a few edge cases on my own, spot check the code and be comfortable knowing there are no disastrous footguns lurking in the shadows only to come out in the darkness of production deployments.

    Let me repeat - Opus 4.8 max stops and asks me for clarification. It writes the tests I would have written. It writes tests that fail, exposing gaps and errors, that then allows me to iterate.

    Composer 2.5 OTOH will run with whatever it decides I meant and write something that steals productivity, not add to it.

    Same harness (Cursor), same rules, same prompts, vastly different outcomes!

    Yes, Opus is far more expensive, but it's worth it for the time saved on review and refactors, which are our current blockers.

    The real friction is that Cursor's marketing is so aggressive that the people paying the bills look at my Opus usage and demand to know why I'm not using the cheaper alternative!

    It's an impossible argument to win when the rest of the company's devs are happily building standard web apps on Composer without issue, blissfully unaware of how the model not only falls apart but is just unreliable on harder engineering problems.

    Fable 5 is on a league on its own. If history in the LLM space is any predictor of the future, in ~6 months (Q1 2027) we should have open weight models that are competitive with Fable 5. Without considering what it will take to run such a thing, I would be extremely excited to have open access to such a capability. Great times ahead!

  • WinstonSmith84 12 hours ago

    that benchmark seems to match my experience. GPT 5.5 is significantly better than Opus 4.8, last time I tried composer 2.5 it was truly dumb, and Fable to me looks to be on par with GPT 5.5 but .. different overall ... The best is to have a LLM-peer-review between GPT and Opus (now Fable) for best outcome.

  • ciaf 13 hours ago

    By the same token, Fable 5 is given a score of 77 vs 76 for GPT 5.5

  • whazor 13 hours ago

    I mean, they train their model on their training data. So by it should score well on their own benchmark.

  • apothegm 10 hours ago

    Composer writes the worst, stupidest, most naive and straight up brains-dead code you could imagine. Fast and cheap is about all it’s got going for it. I mostly use it for “sort these lines alphabetically” and stuff that’s a smidge too complex for regex find/replace.

    • bengale 2 hours ago

      It’s starting to feel like people need to say what language/stack and problem space they’re working in. It would be interesting to see why we’re seeing such wild variance.

    • simondotau 10 hours ago

      I primarily use composer. I wanted to build something from scratch recently and, thinking I was missing out on something, I got Opus to build it. I wasn't blown away. I gave the same prompts to composer and the code it came up with different but similar in quality. I ended up progressing with the composer code because it was easier to progress with improvements due to its faster response time.

tekacs 15 hours ago

I'm pretty baffled by their choice of axes. I would have thought that the left was the cheapest, not the most expensive. I appreciate that this layout means that top right can be best, but it's still unintuitive to have this backwards cost axis IMO.

Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.

Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.

  • budsniffer952 12 hours ago

    >Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of

    Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.

    • smokel 10 hours ago

      Most interesting things in software engineering are (laughably) subjective.

      Just check out any conversation on dynamic vs static typing, talk to a Rust zealot, or ask a backend engineer if microservices were a mistake.

      It's unfortunate, and it makes it hard to have proper discussions on these subjects. It would be worthwhile to figure out how we can have more constructive arguments.

      • abirch 9 hours ago

        "Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?" -- George Carlin

      • tekacs 9 hours ago

        Thanks very much for saying this!

        Frankly, it feels like we should just sidestep arguments entirely and just all contribute our messy data/reports, and then see how we can meld all of it together, to find the best answers for our individual situations.

        Probably a good use of frontier AI, melding all of that!

    • tekacs 9 hours ago

      It's all closed code, so I don't have a great way of showing you, but this is all pretty easy to test for yourself, and a good chunk of it is fairly objective:

      On performance: just grab CC + Codex and try Opus 4.8 xhigh and GPT 5.5 xhigh side by side. Ask them a trivial question about something that's already in their context. Opus will churn for 30 seconds, and GPT 5.5 will respond in about three seconds. If you try the same with Fable 5 you'll notice way better adaptive thinking than Opus (it'll quicker than Opus, even on xhigh – although often still slower than 5.5).

      I have many, many times done 'Opus xhigh, Opus max and GPT xhigh all tried to implement something' – Opus max is... hours and hours. Opus xhigh is usually ~1.5-2x GPT 5.5 xhigh. This feels like a pretty straightforward generalization of the first point. Again, just try racing three agents and see what you get.

      As far as 'right on the edge of what they're able to do', my specific tasks don't matter. Just find something that no matter how hard you try, with however many agents or combinations thereof, with arbitrarily detailed plans, agents can't seem to implement without massively mistakes or a hollowing-out of 'the point' of the implementation... and then try it on the 'following generation' of models. I've been doing this repeatedly with coding agents since I turned aider into a CC-like coding agent in early 2025 (this was my second one, my first modern-style coding agent was in Jan 2025): https://github.com/Aider-AI/aider/pull/3781

      A couple of examples of the latter thing that I tend to work on are database internals (indexes, query planner stuff, etc.; I built the DB in full before agents, it just works on it with me), very advanced UIs (try making a beautiful Rolex-like interactive visualization of the internals of a mechanical watch with Opus and see how far it gets – not very), and 'hardcore product questions' (all agents kinda suck at schema – Fable far less than prior ones). I have dozens and dozens of these that they can't do, though.

      • tyre 2 hours ago

        I can anecdotally back up that Opus takes a ridiculously long time to respond to basic questions. We’re talking, “you implemented this scoped feature on a web app, could you change the buttons to have a loading state like $EXAMPLE?” And it’ll be Discombobulating for 20+ seconds.

        I don’t remember this always being true.

  • andai 8 hours ago

    >it tends to leave big, dangerous holes hiding inside implementations unless babied.

    A brainwave: perhaps GLM or DeepSeek could be integrated into the mix for the purposes of red-teaming the code. Fable has been blinded to security by design[0], and the open models are pretty decent at it.

    [0] It's not clear what the situation with GPT-5.6 will be but the blog suggests similarly over-cautious safety filters.

    Amusingly the posts for recent Opus releases brag that they successfully made it worse at security! "during its [Opus 4.7] training we experimented with efforts to differentially reduce these ["cyber"] capabilities"

    • tekacs 6 hours ago

      I definitely use GPT-5.5 as a counterpart to validate these exact sorts of things in Anthropic models' implementations, in the (now-rarer) cases where I allow Anthropic's models _to_ implement.

      And yeah, it's a bit depressing to think that 5.6 might be similarly nerfed. Less secure software for us all, I guess... except BigCorps. :(

  • mklarmann 13 hours ago

    It’s Gartner. Top-right is where you want to be.

    • 0123456789ABCDE 13 hours ago

      gartner magic quadrant charts don't break the natural expectation of left-to-right, and bottom-to-top, increasing values, this charts from cursor post do.

  • daft_pink 8 hours ago

    I agree why they reverse the x axis makes this graph very hard to understand for the casual observer.

  • pbowyer 15 hours ago

    > I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

    Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.

    I think for programming the strength of GPT over Opus is winning here over the context window.

    • tekacs 11 hours ago

      > I think for programming the strength of GPT over Opus is winning here over the context window.

      On this, absolutely!

      I more often use Opus for planning than for implementation. In those cases I really do need the very large context window, because the agent has to read in a bunch of my code base and a bunch of previous plan files and product context and such, to understand what we're talking about.

      And then I need to go back and forth with it over a really extended period: getting into a bunch of details, asking it to load how things already work so that we can discuss options for evolution of those, etc.

      For that kind of thing, compaction completely destroys its effectiveness because even if you try to serialize out all the decisions made in the conversation into a plan file, the agent still loses e.g. the plan files and code files that it's read in that are adding sharp edges to its understanding of the scope of what's being planned.

      For implementation or something like what you're describing in the vein of benchmarking, often I can get away with compaction. Although even then, if the agent needs to have a lot "loaded" into its head, to implement something very, very subtle, complex or far-reaching, in those cases it can be really detrimental if it compacts.

  • subhobroto an hour ago

    > it tends to leave big, dangerous holes hiding inside implementations unless babied

    it's fascinating that I used these same exact words to express my distaste for Composer and my preference for Opus. I suspect, the domains and problems we are trying to solve need to be shared. I wrote about it here: https://news.ycombinator.com/item?id=48766275

    Would love to reach out to discuss more, if you're ok with it, or absolutely feel free to do the same as my email's in the profile like yours!

  • rc1 6 hours ago

    > I'm pretty baffled by their choice of axes

    To put their own model out in front?

  • cherryteastain 14 hours ago

    You can set GPT 5.5 to 1M context mode in Cursor but it costs more after the default 272k.

    • tekacs 13 hours ago

      Yeah I've done this, it's just unaffordably/impractically expensive compared to the official subscriptions :/

  • 0123456789ABCDE 13 hours ago

    opus@max is on average worst than opux@xhigh

    for supporting evidence, see first chart here: https://www.anthropic.com/news/claude-fable-5-mythos-5

sisve 22 minutes ago

If I understand the graph correctly;

Fable is using less tokens to achive that same tasks compared to sonet and opus. If so that is a good thing. It feels like we for a while there was spitting out tokens to get a better result. If the model themselves are getting better without generating more tokens that feels like a real win.

Q1: Why is number of steps relevant in this graph? What does it tell us?

Q2: and why have they flipped the horizontal graph so that 0 is to the right and not at origo? Is that some kind of new smart thing? can't say i have seen it before

__natty__ 14 hours ago

It's hard to believe Composer 2.5 is that good. I tried to compare it with GLM 5.2 or Opus 4.6 and it lacked thinking about the problem and critical reasoning. It's great for executing plans made by other models, but even then it does some weird code manipulation that is far from how other files around actually work.

  • bmurphy1976 5 hours ago

    I'm not using Cursor at the moment, but when I did (not too long ago) my experience was similar. Plan with Opus, implement with Composer, clean up with Opus.

    Composer did a competent but not amazing job with a good plan. What I really liked though is it was fast! Opus could take 30 minutes to do something Composer would get done in 5-10 minutes. Of course the output wasn't perfect, but that's why I'd do a cleanup pass using Opus or Codex.

    It's all a balance though, constantly changing and completely dependent on the problem you're solving. I just remain flexible and adapt my process to what's working best in the moment.

    • ignoramous 3 hours ago

      Interesting. If I may: What was this "clean up" pass? A code review? A code review with specialized prompt? A focused review to check for edge cases / logic errors / api misuse? Or, something else specific to the codebase?

      Have you settled on what the clean up pass should look like? Or, do you keep experimenting with it?

      In case one might not have been aware: Composer 2 was Kimi Base 2.5 post-trained (RL'd) by Cursor: https://news.ycombinator.com/item?id=48507474. Composer 2.5 might be something totally different.

  • apwheele 10 hours ago

    I read these and think it is just the jagged edge. I do not doubt your personal experience, I have used Composer 2.5 (via Grok and the credits I get with my X premium account) the past month.

    I am not building rockets, but have been quite impressed. All the models do dumb things sometimes, it has done the work I have asked it to pretty well though and has done to me some impressive work.

    It is fast on Grok, for other models I have worked extensively with I think it is better than gemini 3.1 (3.5 and antigravity for me is worse than the prior gemini cli). And is comparable to Opus 4.6. (Have not used the more recent models in Claude Code.)

andai 8 hours ago

Interesting that Opus 4.7 does better than 4.8. Too bad they didn't test 4.6, too. I witnessed a man here mocked yesterday for insisting it was better than its successors!

Although, the benchies are always tricksy ... On DeepSWE, GPT-5.5 beats Opus-4.8, by a fair margin, but on FrontierCode, the situation is the other way around.

The only benchmark you can trust is your actual workload!

nok22kon 13 hours ago

everytime a new benchmark appears, Chinese models are far lower than the level where they are supposed to be according to existing benchmarks. then after a while they recover :)

xyzsparetimexyz 13 hours ago

I wish all these sites would show pareto frontier graphs of cost/performance. That's the main 2 things that matter (I guess you could make it 3D with a speed param as well). https://paraplouis.github.io/llm-pareto-frontier/ is the best of these graphs I've seen but it doesn't update as frequently as I'd like.

  • mrngld 10 hours ago

    That site is useless though because thinking tokens (and caching) and the efficiency thereof aren't accounted for. GLM5.2 is promoted by every 50 Cent Party the PLA can muster on the internet but it falls short because of its extremely verbose thinking. Anthropic models have the same problem but starting from a much higher base of real intelligence.

    Which is exactly why every credible comparison now represents cost associated with completing a task, not arbitrary input and output token costs.

    • ignoramous 3 hours ago

      > much higher base of real intelligence

      Not sure how much "real intelligence" is to be found in Mythos & Sol, but at this point, ignoring the intelligence gap, I find it totally impressive that the likes of GLM, Kimi, Qwen, MiMo hold their own at 2x to 4x less cost, and work for my use case just the same.

maxdo 12 hours ago

The most interesting part is costs . Gpt 5.5 and sonnet 5 cost same amount of money as GLM 5.2 but are more capable models

BugsJustFindMe 14 hours ago

I've used both Composer 2.5 and GPT 5.5 (both in Cursor and in Codex) extensively, and their claim that Composer 2.5 is anywhere close in performance to GPT 5.5 is absolutely farcical. It's faster, but it's nowhere near as good.

And given that you can only use Composer with a Cursor monthly subscription, cost comparisons are pointless since an equivalently priced OpenAI subscription gets you just as much usage of the better model.

baq 13 hours ago

Cursor’s model excels at Cursor’s benchmark; news at 11.

The other models however are reasonably where I’d expect them to be from experience piloting all of them. Fable is outclassing everything at most things at 10x the cost, but sometimes it isn’t a choice between cheap and expensive, but expensive and possible; I’ll need to learn where that boundary is just as it was the case with other models.

kandros 9 hours ago

I like Composer a lot as a general-purpose workhorse, but putting it over gpt5.5 medium makes the whole graph lose trust to me, asme witg GLM so low

verse 14 hours ago

backwards X axis? is there a reason for that? it looks ridiculous

  • gkbrk 14 hours ago

    It looks very natural, cheaper is better after all. Performance axis going up, and cheapness axis going up match each other.

    • 0123456789ABCDE 13 hours ago

      gp's argument is that cheapness is a construct, derived from the real, and natural, cost parameter which most people are naturally accustomed to interpreting as increasing from left to right. cheapness would then replace the cost label, and feel natural. alas, this is not what we have here.

  • anon373839 14 hours ago

    This seems to be a common choice with AI industry graphs, to give you that “upward and outward” frontier shape.

shadeslayer_ 13 hours ago

Do these benchmarks even add any value at this point? This one is basically Cursor saying that their model is as good as the frontier ones at a fraction of the price. The independent benchmarks are probably part of training data now and the models are pattern-matching against them all the time. The final test of a model (and the harness, probably) is how good it works FOR YOU - since most of the models can pretty much do most of our tasks on a daily basis - it boils down to which one has the least friction to its usage.

bfjvibybd6cuvu6 13 hours ago

No shot 2.5 is beating out 4.8

tmach32 12 hours ago

Why would anyone take this benchmark seriously? Cursor is obviously biased here. They can design it and its presentation however they want to tell the story they want to tell.

mi_lk 9 hours ago

Cursor: Find me another benchmark where Composer 2.5 is a top 10 frontier coding model

  • leerob 5 hours ago

    (I work at Cursor) We score well on Terminal-Bench and SWE-bench Multilingual. DeepSWE, not so great yet, as it's more for very long-horizon tasks. We're planning to include more public benchmarks in our next model release.

xrisk 13 hours ago

Would like to see wall times. I feel that’s the part that annoys me most, my tasks aren’t particularly challenging I want them done fast

luckilydiscrete 14 hours ago

insert obama medal meme

anilgulecha 15 hours ago

is composer 2.5 that good at that pricepoint? Seems like the gemini flash playbook of trying to get most bang for the buck.

  • soyin 2 hours ago

    I'm also using it as my daily driver. I've been trying Opus 4.8 this week to see if I was missing something but haven't noticed a meaningful difference.

    I'm working on a fairly routine full stack web app that isn't doing anything incredible. Once I had the patterns I wanted in place, it's been very capable of following those with new work. I also don't ever give it long running tasks, it's always focused and small chunks.

    My typical work flow is 1. /grill-me feature description 2. Create a plan 3. Manually review plan and tweak as needed (usually very little to none) 4. Build the plan

    All with Composer 2.5. Earlier on in the project I used Claude and GPT for #1 and #2.

    I find it really hard to justify the other models for the performance/cost I'm getting with Composer 2.5. Maybe it's not as strong as the frontier models, but it's been plenty good enough for my use cases.

  • danfritz 15 hours ago

    It's my daily driver, it's fast affordable and with a bit of guidance gets the job done.

    I only reach for Claud when i need to plan something big or want to have a sparring partner to fire of some ideas.

    I think what a lot of people don't realize is that you don't need a fronteer model for 80% of coding tasks. Composer 2.5 is often more than good enough, less token hungry and way faster

    • shockembopper 15 hours ago

      I have been doing the same for quite a while now. Composer 2.5 is incredible when you’re working in the loop.

      • simondotau 10 hours ago

        When you normalise for time and money, Composer 2.5 is way, way, way, way better than anything else out there. Yes it requires more babysitting, but that's a good thing.

  • uf00lme 15 hours ago

    It's surprising usable and cheap enough to run in 'fast' mode when vibing something quick. For simple code I find I prefer the code it writes over GLM or Gemini family.

  • fumar 15 hours ago

    It’s fast and affordable.

  • aabdi 15 hours ago

    yes, its very good.

o10449366 15 hours ago

I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.

Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?

I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.

  • cbg0 11 hours ago

    I've had no problems like the ones you've mentioned while using Opus 4.8. It does overthink stuff with higher effort levels but that's kind of expected.

  • pbowyer 15 hours ago

    > I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.

    I keep Claude around for some specific tasks:

    - Linked up to Figma MCP to implement front-end stuff

    - Data analysis, in the "Connect AI to a data source and ask questions" way. I've tried both Opus 4.8 high and GPT 5.5 high for this and Opus is stronger because it gets the intent in the question better

    I used to keep it around for planning too, but the 4.8 plans have had more holes than swiss cheese.

  • anon373839 14 hours ago

    > I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model

    Related: Sonnet 5’s new tokenizer increases token usage by 30%. (https://simonwillison.net/2026/Jun/30/claude-sonnet-5/)

  • mrngld 10 hours ago

    Now that enterprise customers are pay-as-you-go with tokens I suspect we'll see renewed interest in OpenAI and their focus on token efficiency. At least I hope so if the alternative is abandoning the tools entirely.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection