Settings

Theme

No, it doesn't cost Anthropic $5k per Claude Code user

martinalderson.com

480 points by jnord 18 days ago · 373 comments

Reader

hirako2000 18 days ago

> Qwen 3.5 397B-A17B is a good comparison

It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.

That's why the difference between open router prices and those official providers isn't that different. Plus who knows what open routed providers do in term quantization. They may be getting 100x better efficiency, thus the competitive price.

That being said not all users max out their plan, so it's not like each user costs anthropic 5,000 USD. The hemoragy would be so brutal they would be out of business in months

  • jychang 18 days ago

    That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper, and then you use that to claim that they're 10x more efficient.

    Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.

    Also, you can select BF16 or Q8 providers on openrouter.

    • irthomasthomas 18 days ago

      Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.

      • F7F7F7 17 days ago

        The conspiracy theorist side of me whispers "instead of the rumored Sonnet 5.0 you got Opus 4.6...suspicious"

    • aerhardt 17 days ago

      I guess more than a tautology it is an inversion of observed causes and effects?

    • grayxu 18 days ago

      This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.

      • yorwba 18 days ago

        There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.

        • erichocean 18 days ago

          Partially true, you can predict multiple tokens and confirm, which typically gives a 2-3x speedup in practice.

          (Confirmation is faster than prediction.)

          Many models architectures are specifically designed to make this efficient.

          ---

          Separately, your statement is only true for the same gen hardware, interconnects, and quantization.

        • grumpoholic 18 days ago

          With speculative decoding you can use more models to speed up the generation however.

          • salawat 15 days ago

            Yes, because speculation has NEVER bitten us in the ass before, right? Coughs in Spectre

            Speculative decoding is just running more hardware to get a faster prediction. Essentially, setting more money on fire if you're being billed per token.

    • re-thc 18 days ago

      > That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper

      They do have different infrastructure / electricity costs and they might not run on nvidia hardware.

      It's not just the models.

      • jychang 18 days ago

        Except there are providers that serve both chinese models AND opus as well. On the same hardware.

        Namely, Amazon Bedrock and Google Vertex.

        That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.

        Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.

        • Shakahs 18 days ago

          AWS and GCP both have their own custom inference chips, so a better example for hosting Opus on commodity hardware would be Digital Ocean.

        • giancarlostoro 18 days ago

          And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.

          https://code.claude.com/docs/en/microsoft-foundry

          https://www.anthropic.com/news/claude-in-microsoft-foundry

        • re-thc 18 days ago

          > Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models

          We were responded about 10x not 0.5x.

          x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.

          • atq2119 18 days ago

            These providers do not run models on CPUs, x86 vs. Arm is irrelevant.

            • re-thc 17 days ago

              They run Nvidia and Huawei for example. And mine was just an example.

        • raggi 18 days ago

          Deployments like bedrock have no where near SOTA operational efficiency, 1-2 OOM behind. The hardware is much closer, but pipeline, schedule, cache, recomposition, routing etc optimizations blow naive end to end architectures out of the water.

          • Analemma_ 17 days ago

            Do you have evidence for any of this, or are you repeating a bunch of buzzwords you’ve heard breathlessly repeated on Twitter?

            • raggi 16 days ago

              Many techniques are documented in papers, particularly those coming out of the Asian teams. I know of work going on in western providers that is similarly advanced. In short, read the papers.

          • nullstyle 17 days ago

            Evidence?

      • fennecfoxy 18 days ago

        I mean GN has covered the Nvidia black market in China enough that we pretty much know that they run on Nvidia hardware still.

        • dryarzeg 18 days ago

          How is this related to the inference, may I ask? Except for some very hardware-specific optimizations of model architecture, there's nothing to prevent one to host these models on your own infrastructure. And that's what actually many OpenRouter providers, at least some of which are based in US, are doing. Because most of Chinese models mentioned here are open-weight (except for Qwen who has one proprietary "Max" model), and literally anyone can host them, not just someone from China. So it just doesn't really matter.

          • fennecfoxy 18 days ago

            I mean sure, but in terms of cost per dollar/per watt of inference Nvidia's GPUs are pretty up there - unless China is pumping out domestic chips cheaply enough.

            Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.

            I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.

            • re-thc 18 days ago

              > unless China is pumping out domestic chips cheaply enough

              They are. Nvidia makes A LOT of profit. Hey, top stock for a reason.

              > I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware

              DS is "old". I wouldn't study them. The new 1s have a mandate to at least run on local hardware. There are data center requirements.

              I agree it could still be trained on Nvidia GPUs (black market etc), but not running.

              • yorwba 18 days ago

                > The new 1s have a mandate to at least run on local hardware.

                They do? Source?

                But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)

                • re-thc 18 days ago

                  > with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China

                  They just have a China only endpoint and likely a company under a different name.

                  Nothing to do with AI. TikTok is similar (global vs China operations).

  • Weaver_zhu 18 days ago

    Agree, but I guess the Opus 4.6 is 10x larger, rather than Chinese models being 10x more efficient. It is said that GPT-4 is already a 1.6T model, and Llama 4 behemoth is also much bigger than Chinese open-weight models. Chinese tech companies are short of frontier GPUs, but they did a lot of innovations on inference efficiency (Deepseek CEO Liang himself shows up in the author list of the related published papers).

    • jychang 18 days ago

      No, Opus cannot be 10x larger than the chinese models.

      If Opus was 10x larger than the chinese models, then Google Vertex/Amazon Bedrock would serve it 10x slower than Deepseek/Kimi/etc.

      That's not the case. They're in the same order of magnitude of speed.

      • Filligree 18 days ago

        They serve it about 2x slower. So it must have about 2x the active parameters.

        It could still be 10x larger overall, though that would not make it 10x more expensive.

        • jychang 16 days ago

          Yes, but I highly doubt they would increase sparsity much vs the chinese models.

          That's how you get Llama 4.

          Pretty much every major lab settled on ~3-5% sparsity for a reason.

      • bakugo 18 days ago

        I agree that Opus almost definitely isn't anywhere near that big, but AWS throughput might not be a great way to measure model size.

        According to OpenRouter, AWS serves the latest Opus and Sonnet at roughly the same speed. It's likely that they simply allocate hardware differently per model.

        • jychang 16 days ago

          The numbers look about right. Opus 4.5 is about 1.5x the size of Sonnet 4.6, and Opus 4/4.1 is about 5x the size of Sonnet 4.5/4.6.

          Note that Opus 4.5 is about 1/3 the size of Opus 4/4.1 (and 1/3 the price in the API)

      • torginus 17 days ago

        My understanding is that for MoE with top K architecture, model size doesn't really matter, as you can have 10 32GB experts or a thousand, if only 2-3 of them are active at the same time, your inference workload will be identical, only your hard drive traffic will incread.

        Which seems to be the case, seeing how hungry the industry lately has been for hard drives.

    • bakugo 18 days ago

      GPT-4 was likely much larger than any of the SOTA models we have today, at least in terms of active parameters. Sparse models are the new standard, and the price drop that came with Opus 4.5 made it fairly obvious that Anthropic are not an exception.

    • logicprog 18 days ago

      wasn't GPT 4 the model that was so expensive for open AI to run that they basically completely retired it in favor of later models which became much stronger but weren't as expensive for them to run?

  • DanielHall 18 days ago

    Comparing open-source models like Qwen against Anthropic’s models is absolutely foolish. First of all, Anthropic has never disclosed the actual parameter count or architecture of their models. Second, it’s well known that these open-source models more or less distill from other models and use MoE, which allows them to run at much lower computational costs. Using Qwen as a comparison point only proves the blog post author is foolish. The article devoted such a large portion to discussing Qwen on OpenRouter, I find it hard to believe.

    • yorwba 18 days ago

      Anthropic is obviously also aware of the benefits of MoE and distilling a larger model into a smaller one, so they could run a model of the same size as Alibaba's for the same inference cost if they want to. Or they can run a slightly larger model for slightly higher cost. They definitely aren't running a much larger model (except potentially as a teacher for distillation training) because then they wouldn't be able to hit the output speeds they're hitting.

      • hirako2000 17 days ago

        They are fully aware, but are playing a different game, R&D isn't something you flip a parameter and you get what the efficient oriented pipelines do.

        Chinese models were built on constraints. As we know limitations lead to innovation. So the "Chinese" R&D invested in optimisations. Teacher models were already there so they likely built the best distillation processes, along with the best MoE. Actually they published many of these works.

        Nuance, sure. Anthropic/OpenAI could revise their philosophy to adopt efficiency.

        But momentum can't be underestimated. Plus, dollar per optimisations is a different math altogether, it's not only about access to the latest Nvidia GPUs. At $400k the engineer pop a year, health coverage, pension contribution. Hardware efficiency doesn't weigh as much as making sure engineering focuses on.. the raw power factor, I suppose.

        • yorwba 17 days ago

          Every company is subject to constraints. A bigger budget is not an infinite budget. And there is no tradeoff between efficiency and raw power. An optimization that lets you build a similarly powerful model for less money also lets you build a more powerful model for the same amount of money.

          Honestly, I wonder what you think closed LLM companies do R&D on if not optimizations. And the nature of research is that most ideas that sound good turn out duds, so they already need to have an established process for testing many ideas quickly. Now if somebody publishes a new idea they haven't tried yet, setting up an experiment to try it out is just a routine task... But they aren't going to tell anybody the results, just quietly integrate it if it works.

          • hirako2000 16 days ago

            I concede we can't be sure what they do since it's proprietary. Aside leaks which give us a sense of the philosophy.

            It's clear to me the economics would make the likes of OpenAI and Anthropic's focus on raw power over optimisations. I never meant they wouldn't optimise anything, but it's earlier diminishing returns vs for a company like Alibaba, or even Mistral.

            The Chinese models were trained in the context of compute scarcity. So it isn't the same for them as "routine" optimisations, it's optimisations or nothing.

            A year or two later those optimisations allowed their models to be somewhat on par with raw power models from the US providers.

            Now despite papers being published, a design is rather sticky, it's not as simple as plugging an optimisations another lab came up with. It depends what the optimisation, perhaps multi head wasn't that big of a deal to add in, MoE would have been less so easy.

  • tom_m 17 days ago

    No they wouldn't. They have tons of funding. They absolutely can and do absorb costs like this. Don't think anyone is ever gonna tell you precise numbers (and it also varies based on workload of course)...but this is literally the business model of AI providers.

    They're goal (similar to Uber, DoorDash, Robin Hood, etc.) is to get mass adoption. Their business models only work at this kind of scale.

    It's completely impossible to have consumers pay $20-60/mo and be a profitable business without mass adoption where some are not using it as much as others...and, perhaps more importantly, the masses put pressure on their employers to pay for their tooling. This is why pricing does not need to come down.

    Quite literally I have engineers spending over $1,000/mo on Opus. That's the goal.

  • Havoc 18 days ago

    > Plus who knows what open routed providers do in term quantization

    The quantisation is shown on the provider section.

  • grayxu 18 days ago

    Actually, Opus might achieve a lower cost with the help of TPUs.

  • simianwords 18 days ago

    >It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.

    I find it a good comparison because it is a good baseline since we have zero insider knowledge of Anthropic. They give me an idea that a certain size of a model has a certain cost associated.

    I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect. Current Qwen models perform as good as Sonnet 3 I think. 2 years later when Chinese models catchup with enough distillation attacks, they would be as good as Sonnet 4.6 and still be profitable.

    • coldtea 18 days ago

      > I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.

      Define "much worse".

        +--------------------------------------+-------------+-----------+------------------+
        | Benchmark                            | Claude Opus | DeepSeek  | DeepSeek vs Opus |
        +--------------------------------------+-------------+-----------+------------------+
        | SWE-Bench Verified (coding)          | 80.9%       | 73.1%     | ~90%                 |
        | MMLU (knowledge)                     | ~91         | ~88.5     | ~97%               |
        | GPQA (hard science reasoning)        | ~79–80      | ~75–76    | ~95%             |
        | MATH-500 (math reasoning)            | ~78         | ~90       | ~115%            |
        +--------------------------------------+-------------+-----------+------------------+
      • Filligree 18 days ago

        Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.

        Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.

        • ACCount37 17 days ago

          One of the main edges Anthropic has is that "personality tuning" gap. "Nice to use" is a differentiator when raw performance isn't.

          OpenAI can sometimes get an edge over Anthropic in hard narrow STEM tasks. I trust benchmarks over vibes there - and the benchmarks show the teams trading blows release after release. Tracking Claude Code vs OpenAI Codex on SWE-bench Verified feels like watching the back alley knife fight of the AI frontier.

          But the vibe of "how easy is that model to interact with" and "how easy it is to get it to do what you want it to" does matter a lot when you are the one doing the interacting. And Opus makes for a damn good daily driver.

        • cmrdporcupine 17 days ago

          At this point it's frankly not a fair comparison since DeepSeek 3.2 is now many months old and we're waiting for a newer model which has been rumoured as "any day now" since February. (We'll see).

          GLM5, the largest Qwen 3.5 model, and Kimi K2.5 are more fair comparisons, though they are, yes, a bit behind. They're more than capable for routine operations though.

          Anyways, I'm back to using Opus & Claude Code after a month on Codex/GPT5.3 and 5.4 and it's frankly a rather obvious downgrade. Anthropic is behind OpenAI at this point on coding models, and there's nothing to say they couldn't fall behind the Chinese models as well.

          The moat is very shallow. After the events of the last two weeks there's likely a significant % of international capital very interested in breaching it. I know I would like to see this... Anthropic basically said F U to any non-Americans, and OpenAI is ... yeah.

        • torginus 17 days ago

          Dunno, I was using Cursor today and for some reason it decided to swith to GPT 5.3 at some point and I didn't even notice. I was sure that Opus is much better, but who knows?

        • coldtea 18 days ago

          >Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.

          Ah, the "trust me bro" advantage. Couldn't it just be brand identity and familiarity?

          • vidarh 18 days ago

            I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.

            My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.

            I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.

            I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.

            That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.

            Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.

            • oren1531 17 days ago

              Good point on the green/red dashboard. The opportunity cost angle is worth adding though. A failed run isn't just the wasted tokens and retry cost - it's also the task that didn't get done and the engineering required to diagnose why. On anything time-sensitive, that compounds fast.

              • vidarh 17 days ago

                Exactly. At the moment it's close enough to be a wash for some cases, or tilts seriously one direction or other for others. I expect improved harnesses means more and more we'll just be able to re-run a couple of times, and fall back to "escalating" to Sonnet or even Opus, but whenever it involves egineering time, that's a big deal.

            • Bombthecat 18 days ago

              In 12 months, opus will be better than now and you still won't use it lol

              • vidarh 17 days ago

                I still won't use what? I use Opus now, and I will use Opus then too, but as I clearly stated:

                My default model has now dropped to Sonnet, because Sonnet can now do most of my tasks, and we already use Kimi, Deepseek, and Qwen.

                They're just not cost-effective enough to be my main driver yet. They are however cheap enough that for things where the Claude TOS does not let me use my subscription, they still add substantial value. Just not nearly as much as I'd like.

                The bulk of my tasks won't get harder as time passes, and so will move down the value chain as the cheaper models get better.

                For the small proportion of my tasks that benefits from a smarter model, I will use the smartest model I can afford.

          • cesarvarela 17 days ago

            The harness makes a difference too.

      • yorwba 18 days ago

        Where are you getting those benchmark figures from? Math-500 should be closer to 98% for both models: https://artificialanalysis.ai/evaluations/math-500?models=de...

    • crooked-v 17 days ago

      > distillation "attacks"

      I find it really funny that anyone can call it this with a straight face when all the American models are based on heaps of illegally pirated books and TOS-breaking website scraping in the first place.

      • hirako2000 17 days ago

        Looters calling the police for being looted.

        Plus, Chines made distillation did good to the overall internet infrastructure. Millions of small Joe's WordPress website, paid and maintained out of good will, getting hammered by AI mining vs making the VC loaded pirates pay for what they seeded, I find the latter more fair.

  • lelanthran 18 days ago

    > That being said not all users max out their plan,

    These are not cell phone plans which the average joe takes, they are plans purchased with the explicit goal of software development.

    I would guess that 99 out of every 100 plans are purchased with the explicit goal of maxing them out.

    • serial_dev 18 days ago

      I’m not maxing them out… I have issues that I need to fix, features I need to develop, and I have things I want to learn.

      When I have a feeling that these tools will speed me up, I use them.

      My client pays for a couple of these tools in an enterprise deal, and I suspect most of us on the team work like that.

      If my goal was to max out every tool my client pays, I’d be working 24hrs a day and see no sunlight ever.

      I guess it’s like the all you can eat buffet. Everybody eats a lot, but if you eat so much that you throw up and get sick, you are special.

    • Ginden 18 days ago

      My employer bought me a Claude Max subscription. On heavy weeks I use 80% of the subscription. And among software engineers that I know, I'm a relatively heavy user.

      Why? Because in my experience, the bottleneck is in shareholders approving new features, not my ability to dish out code.

    • raihansaputra 18 days ago

      goal? yeah. but in reality just timing it right (starting a session at 7-8am, to get 2 sessions in a workday, or even 3 if you can schedule something at 5am), i rarely hit limits.

      if i hit the limit usually i'm not using it well and hunting around. if i'm using it right i'm basically gassed out trying to hit the limit to the max.

    • rustystump 18 days ago

      In saas this is not true. Most saas is highly profitable or was i suppose because they knew that most of their customers would never max out their plans.

    • solumunus 18 days ago

      There’s absolutely no way that’s true.

overrun11 18 days ago

A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true and a lot of evidence that it isn't. It's just become a meme uncritically regurgitated.

This sloppy Forbes article has polluted the epistemic environment because now theres a source to point to as "evidence."

So yes this post author's estimation isn't perfect but it is far more rigorous than the original Forbes article which doesn't appear to even understand the difference between Anthropic's API costs and its compute costs.

  • mike_hearn 18 days ago

    I'd love to be a fly on the wall when this argument is tried in front of a bankruptcy court. It drives me nuts. Of course there's evidence that they're selling tokens at a loss.

    The only thing these companies sell are tokens. That's their entire output. OpenAI is trying to build an ad business but it must be quite small still relative to selling tokens because I've not yet seen a single ad on ChatGPT. It's not like these firms have a huge side business selling Claude-themed baseball caps.

    That means the cost of "inference" is all their costs combined. You can't just arbitrarily slice out anything inconvenient and say that's not a part of the cost of generating tokens. The research and training needed to create the models, the salaries of the people who do that, the salaries of the people who build all the serving infrastructure, the loss leader hardcore users - all of it is a part of the cost of generating each token served.

    Some people look at the very different prices for serving open weights models and say, see, inference in general is cheap. But those costs are distorted by companies trying to buy mindshare by giving models away for free, and of those, both the top labs keep claiming the Chinese are distilling them like crazy including using many tactics to evade blocks! So apparently the cost of a model like DeepSeek is still partly being subsidized by OpenAI and Anthropic against their will. The cost of those tokens is higher than what's being charged, it's just being shifted onto someone else's books. Nice whilst it lasts, but this situation has been seen many times in the past and eventually people get tired of having costs externalized onto them.

    For as long as firms are losing money whilst only selling tokens, that means those tokens are selling at a loss. To not sell tokens at a loss the companies would have to be profitable.

    • overrun11 18 days ago

      The article is about compute cost though. By "lose money on inference" I mean the assertion that inference has negative gross margins which a lot of people truly believe. This is important because it's common to reason from this that LLM's are uneconomical and a ticking time bomb where prices will have to be jacked up several orders of magnitude just to cover the compute used for the tokens.

      • mike_hearn 18 days ago

        But there's no such thing as compute cost in the abstract. What exactly is compute cost for AI? Does it include:

        • Inference used for training? Modern training pipelines aren't just gradient descent, there's a ton of inference used in them too.

        • Gradient descent itself?

        • The CPUs and disks storing and managing the datasets?

        • The web servers?

        • The people paid to swap out failed components at the dc?

        Let's say you try and define it to mean the same as unit economics - what does it cost you to add an additional customer vs what they bring in. There's still no way to do this calculation. It's like trying to compute the unit economics of a software company. Sure, if you ignore all the R&D costs of building the software in the first place and all the R&D costs of staying competitive with new versions, then the unit economics look amazing, but there's still plenty of loss-making software startups in the world.

        Unit economics are a useful heuristic for businesses where there aren't any meaningful base costs required to stay in the game because they let you think about setup costs separately. Manufacturing toys, private education, farming... lots of businesses where your costs are totally dominated by unit economics. AI isn't like that.

        • overrun11 17 days ago

          Gross margins and cost of revenue are well defined accounting terms that apply to any type of business.

          > Does it include:

          > Inference used for training? Modern training pipelines aren't just gradient descent, there's a ton of inference used in them too.

          No because this is training and not inference. Just like how R&D costs for a drug aren't part of COGS either.

          > Gradient descent itself?

          No

          > The CPUs and disks storing and managing the datasets?

          Yes

          > The web servers?

          Yes

          > The people paid to swap out failed components at the dc?

          Yes to the extent they are swapping for inference and not training. If the same employees do both then the accountants will estimate what percent of their time is dedicated to each and adjust their cost accordingly.

          • mike_hearn 17 days ago

            We weren't talking about COGS, we were talking about "cost of compute", which isn't an accounting term.

            For the rest, anyone can define and apply an accounting metric but that doesn't mean it tells you anything useful. If you look at the unit cost of any typical IP business it's nearly zero. Yet, many companies lose money on making movies, video games, apps and books.

          • torginus 17 days ago

            I'm not familiar with accounting, but I suspect a lot of these cloud infrastructure companies don't throw out hardware for a very long time, just like how AWS sells you their old stuff as whitelabel compute at a markup, behind which I think are mostly old pieces of hardware, I think as long as Anthropic keeps finding uses for the old GPUS provided they dont break, they don't have to write off these assets, which means they don't incur costs using them if they are clever with their books

        • projektfu 17 days ago

          The marginal cost of the next token. That can include the power, the operating cost of the facility, repair costs, etc.

          The API price should hopefully incorporate the capitalized cost of the hardware, the facility rent, the cost to train the model, the r&d, cost of sales, etc., to make it profitable.

          Claude Code Max may be able to offer a good price by having a mix of higher and lower utilization of users and ignoring the fixed costs, treating it as a driver of API sales. But it doesn't make sense to essentially pay people to use it.

        • wasabi991011 17 days ago

          Your point is that there are more relevant quantities to calculate for checking economic viability is fair, but that doesn't negate the "cost of inference" being an interesting metric in itself.

    • emtel 17 days ago

      This comment defies common usage and accounting practices.

      When people say “selling at a loss” they mean negative unit economics. No one ever means this much more expansive definition you’ve invented.

    • SilverBirch 16 days ago

      What you are talking about isn't inference cost. Yes, fundamentally what matters is all the work that goes into the models, including R&D, training, and inference.

      But we talk about inference separately for a reason: largely inference cost is the scaling cost. Once you have a model the margin on your inference is how you get to profitability, as long as your margin is positive you can make the entire enterprise profitable by just selling more tokens. This is the same fundamental business that chip fabs work on. Yes it costs them a lot to get to the next node, but what's important is the margin they can get on the wafers they sell, because they sell tonnes of wafers.

      It's pretty core to the concept of SAAS businesses that yes, you do consider all costs. But you want to focus on the margin of the bit that scales. This is why WeWork exploded, the thing they were scaling only scaled up at negative margin.

      The point is that if their inference margin is positive, they can "just" scale up and become profitable. If their inference margin is negative, then scaling up the business actually causes problems.

    • landl0rd 18 days ago

      Actually you can slice out a lot of things. It's even a GAAP metric, i.e. one of the common baseline that public companies are required to report, known as gross margin, literally just (revenue - cogs) / revenue. It is distinct from net margin, but both are useful and low gross vs net margin say very different things concerning the long-term prospects of the business.

    • jeremyjh 18 days ago

      This is all true but it isn't really important for the argument people are making. What is more important is the marginal cost per token. If each token sold is at a marginal loss, their losses would scale with usage, that simply can't be happening with API pricing. But in general, yes I agree with you and I'm sure they are taking a huge loss on Claude Code.

      • mike_hearn 17 days ago

        It looks to me like their losses have scaled with usage, though? They keep predicting their losses will increase even as usage has gone stratospheric.

        • jeremyjh 17 days ago

          They are certainly making huge bets that are risky, and so yes on their P&L the L are scaling. That doesn't say anything at all about their marginal inference cost.

    • howmayiannoyyou 18 days ago

      You're missing costs.

      - Amortized training costs.

      - SG&A.

      - Capex depreciation.

      All the above impact profitability over various time horizons and have to rolled into present and projected P&L and cash flow analysis.

      • ACCount37 18 days ago

        We have amortized training cost estimates. Inference to training compute over model lifetime is 10:1 or over for major models at major providers.

        In part due to base model reuse and all the tricks like distillation. But mainly, due to how much inference the big providers happen to sell.

        So, not the massive economic loss you'd need to push models away from being profitable. Capex and R&D take the cake there.

    • trillic 17 days ago

      I don't think you are an accountant.

    • oneneptune 17 days ago

      One very minor note; Anthropic and others, like most "enterprise" solution, also sell SSO + SCIM + audit logs. Their business plans have lower tokens and higher prices to cover the enterprise features, which should be essentially free to provide in 2026.

    • infecto 18 days ago

      It depends how we are looking at the business. Absolutely at the end of the day a company is profitable or not but when thinking about inference, which is largely a commodity these days, you would first think about the marginal cost of it. That is your corner stone of the business. We have pretty clear indication that largely API tokens are being sold above the marginal cost. For especially a brand new business that’s critical and something that many unicorns never even hit.

      Your right that all other costs are critical to measuring the profitability of the business but for such a young industry that’s the unknown. Does training get cheaper do we hit a theoretical limit on training. Are there further optimizations to be had.

      You don’t have large capex in an industrial and then in year one argue that the business is doomed when your selling the product above the marginal cost but you have not recouped costs yet that have been capitalized.

  • bodge5000 18 days ago

    > A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true

    Theres quite a lot of evidence, no proof I'd agree, but then there's no absolute proof I'm aware to the contrary either, so I don't know where you're getting this from.

    The two pieces of evidence I'm aware of is that 1) Anthropic doesn't want their subsidised plans being used outside of CC, which would imply that the money their making off it isn't enough, and 2) last time I checked, API spending is capped at $5000 a month

    Like I say, neither of these are proof, you can come up with reasonable arguments against them, but once again the same could be said for evidence on the contrary

    • overrun11 18 days ago

      > which would imply that the money their making off it isn't enough

      I don't think this logically follows. An unlimited buffet doesn't let you resell all of the food out the backdoor. At some level of usage any fixed price plan becomes unprofitable.

      I agree the 5k cap is interesting as evidence although as you said I suspect there are other reasons for it.

      As for evidence against it: The Information reported that OpenAI and Anthropic are 30%+ gross margins for the last few years. Sam Altman and Dario have both claimed inference is profitable in various scattered interviews. Other experts seem to generally agree too. A quick search found a tweet from former PyTorch team member Horace He: https://x.com/typedfemale/status/1961197802169798775 and a response to it in agreement from Anish Tondwalkar former researcher at OpenAI and Google Brain.

      • IsTom 18 days ago

        I get the other things, but believing Altmans's words is not high on the list of things to be considered evidence.

        • bodge5000 17 days ago

          Nor Dario's frankly, I was supposed to be out of a job by now according to his predictions over the years. I can totally buy that inference is possible, but not because they said it is

    • Majromax 17 days ago

      > 1) Anthropic doesn't want their subsidised plans being used outside of CC, which would imply that the money their making off it isn't enough, a

      Claude Code use-cases also differ somewhat from general API use, where the former is engineered for high cache utilization. We know from overall API costs (both Anthropic and OpenRouter) that cached inputs cost an order of magnitude less than uncached inputs, but OpenCode/pi/OpenClaw don't necessarily have the same kind of aggressive cache-use optimizations.

      Vertically integrated stacks might also be able to have a first layer of globally shared KV cache for the system prompts, if the preamble is not user specific and changes rarely.

      > 2) last time I checked, API spending is capped at $5000 a month

      Per https://platform.claude.com/docs/en/api/rate-limits, that seems to only be true for general credit-funded accounts. If you contact Anthropic's sales team and set up monthly invoicing, there's evidently no fixed spending limit.

      • bodge5000 17 days ago

        > If you contact Anthropic's sales team and set up monthly invoicing, there's evidently no fixed spending limit.

        I don't think thats a smoking gun either, for a start we don't know if the pricing would be the same as you'd get credit-funded, but also a monthly invoicing agreement is closer to their fixed plans (you spend X per month, regardless of usage) than pay-per-use API credits, which may not be profitable.

        Not that thats a smoking gun either, I can see it both ways

    • BoredomIsFun 18 days ago

      But a simple assumption that Anthropic runs a normal large MoE LLM (which it almost certainly does) suggests that the actual price of running it (mostly energy) is pretty small.

  • davewritescode 18 days ago

    > A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true and a lot of evidence that it isn't.

    I think it’s fairly obvious that Anthropic is lighting cash on fire and focusing on whether or not they’re losing money per token on inference is missing the forest for the trees.

    Tokens become less valuable when the models aren’t continuously trained and we have zero idea what Anthropic is paying for training.

  • barrell 18 days ago

    Does this not count as evidence? I would agree that it sounds a little shaky, but I would not say there is no evidence.

    https://www.wheresyoured.at/oai_docs/

  • infecto 18 days ago

    They are and they are convinced the cost is not truly baked in because you need to factor in all the training and R&D. It’s a mixture of folks that 1) are convinced AI is terrible, 2) hate Sam Altman and 3) don’t understand how business price products.

    We don’t have clear evidence either way but it heavily leans to API pricing at least covering inference cost. Models these days have less and less differentiation and for API use there must be some thought to compete on cost but it’s not going to be winner take all. They leap frog each other with each new model.

  • bob1029 18 days ago

    I think the wafer scale compute is a massive deal. It's already being leveraged for models you can use right now and the reception on HN has been negligible. The entire model lives in SRAM. This is orders of magnitude faster than HBM/DRAM. I can't imagine they couldn't eventually break even using hardware like this in production.

  • pier25 17 days ago

    Nobody really knows but the simple fact is these companies are not making any profit. Far from it.

elbasti 17 days ago

"Any conversation about token costs devolves into an ad-hoc, informally-specified, bug-ridden implementation of half of generally accepted accounting principles."

We have a way of determining if Anthropic is, or has the capability of being profitable, and what the levers to that may be. AI may be world-changing, but the accounting principles behind AI labs are no different than those behind a Pizza Hut.

Even if the cost of "inference + serving" is lower than the cost of selling a token, the relevant question is what is the depreciation schedule of the cost of training. ie, if I spend $1 on training, how long do I have before I have to spend $1 again?

Almost certainly, any reasonable depreciation schedule of the cost of training will result in leading labs being presently wildly unprofitable. So the question is:

What can be done to make training depreciate more slowly? Perhaps users can be persuaded to stick around using non-fronteir models for longer, although then there's a shift in the competitive landscape.

If users cannot be persuaded (forced?) to use legacy models, then the entire business model is thrown into question, because there's no reason why training frontier models would ever get cheaper: even if it gets cheaper on the margin, surely that will result in more compute used to generate an even "better" model, resulting in more spend in the aggregate.

This doesn't mean that the AI industry is "doomed". A couple things could happen, and this is where the fronteir labs should be focusing their attention:

1. They could find a way to climb up the value chain and capture more of the consumer surplus.

2. There could be a paradigm shift in compute architecture/compute cost.

3. We could reach a limit of marginal utility, shifting consumption to legacy models, thereby lengthening the depreciation/utility of training.

Edit: My assertion of "Almost certainly, any reasonable depreciation schedule of the cost of training will result in leading labs being presently wildly unprofitable." is made with no real information, just a gut feeling, and should not be taken seriously.

  • nr378 17 days ago

    Dario has made a specific cohort argument here. His numbers (from various interviews) are: you train a model in 2023 for $100M, deploy it, and it earns $200M over its lifetime. Meanwhile you train the 2024 model for $1B, which goes on to earn $2B. Each vintage returns 2x on its training cost.

    However, the GAAP P&L tells the opposite story. You book $200M revenue in the same year you spend $1B training the next model, so you report an $800M loss. Next year you book $2B against $10B in training spend, reporting an $8B loss. The business looks like it's dying when every individual model generation actually generates a healthy profit.

    That's actually Dario's answer to your depreciation question. If each cohort earns back its training cost within its natural lifespan (however short that lifespan is), the depreciation schedule is already baked in. The model doesn't need to live forever, it just needs to return more than it cost before the next one replaces it. Whether that's actually happening at Anthropic is a different question, and one we can't answer without audited financials, but it's the claim Dario makes (and seems entirely reasonable from a distance).

    • calvinmorrison 17 days ago

      GAAP doesn't work here really. the R&D treadmill means you are always betting on next year and its NOT inventory or something you can defer your cost on. It's an upfront R&D expense.

      so what happens on year 10 when Anthropic hits a $10B training and only returns $8T? they're cooked

      • Verdex 17 days ago

        Yeah, that's kind of what I'm wondering about.

        It's an interesting story about how even though all metrics show massive losses actually they have massive gains.

        Accounting is a rather mature field, so I figure that someone in the past has tried this stunt and there should probably be ways for dealing with it.

        Or do they always flame out after losing all the money? Knowing the history here would be informative.

    • elbasti 17 days ago

      If those numbers are correct, then my assertion that "Almost certainly, any reasonable depreciation schedule of the cost of training will result in leading labs being presently wildly unprofitable." is incorrect.

      And I admit that I made that assertion from my gut without actually knowing if it's true or not.

    • lovich 17 days ago

      If you have to continually spend greater amounts of money to keep up with the competition on every new model then it is dying.

      Every single time a company comes around and goes "Actually GAAP are wrong, look at my new math that says were good" its led to much wailing and gnashing of teeth in the future when it inevitably isnt.

    • Verdex 17 days ago

      That's an interesting idea. I'm curious, though, are there any other industries and/or companies that have tried to pull this sort of thing off? And what ultimately happened to them?

      • stusmall 17 days ago

        Enron had a system like this. They regularly worked on large, long term contracts that became profitable over years/decades. They wanted to push rewards forward so would estimate the total value of the contract and book the profit when it closed. Mark-to-market accounting wasn't unheard of the time but using it for assets without an active market was unique. Without the market to make against, the numbers were best guess projections.

        The problem is everyone along the line is incentivized to be aggressive with estimate (commissions for sales are bigger, public financials looks better) and discouraged from correcting the estimates when they go wrong.

        Estimating multi-year returns on frontier models looks harder than estimating returns on oil and gas projects in the 90s.

      • yunwal 17 days ago

        The bar for "wildly unprofitable" has risen quite a bit since then, but Amazon basically pioneered this.

    • kikimora 17 days ago

      Why would anyone use 200M model when 1B model is available? The company increase its bet with each iteration increasing risks. It blow up at some point because it cannot guarantee 2B return after 1B investment.

      To GAAP point - 200M or 1B or 10B is not a loss but cash converted into an asset. It won’t affect the bottom line at all. Unless the company re-evaluates the asset and say it now cost 1M instead of 200M. This would hit the bottom line.

    • skybrian 17 days ago

      If you can remember where you read it, could you share a link?

      • Avshalom 17 days ago

        https://youtu.be/GcqQ1ebBqkc?t=1027 is on such but he doesn't actually say that each model has been profitable.

        He says "You paid $100 million and then it made $200 million of revenue. There's some cost to inference with the model, but let's just assume in this cartoonish cartoon example that even if you add those two up, you're kind of in a good state. So, if every model was a company, the model is actually, in this example is actually profitable. What's going on is that at the same time"

        importantly you'll notice that he's talking revenue, and assumes that inference is cheap enough/profitable enough that 100M + Inferance_Over_Lifetime < 200M

  • root_axis 17 days ago

    > They could find a way to climb up the value chain and capture more of the consumer surplus.

    Yes, this is exactly why OpenAI and Anthropic are hyping AGI. If LLMs ever become good enough to replace workers, the first sign will be frontier model companies launching competitor businesses. It doesn't make sense to sell the formula for gold when you can just use it yourself.

    > There could be a paradigm shift in compute architecture/compute cost.

    Possible, but no signs of this on the horizon. If it does happen, it's impossible to predict when it will.

    > We could reach a limit of marginal utility, shifting consumption to legacy models, thereby lengthening the depreciation/utility of training.

    I'm not sure market dynamics will allow this any time soon. We seem to have already achieved a marginal utility equilibrium in terms of model size, so training new models on trending use-cases (e.g. synthetic data targeting tool calls, agentic workflows, computer use, etc) is really the driving force behind product differentiation. Nobody wants to admit "training new models isn't profitable" because that deflates the AGI singularity narrative that all this investment hinges on.

  • fritzo 17 days ago

    I'm not accountant, but I would expect Pizza Hut's accounting is significantly more complex than Anthopic's. 50+ year old global franchise with physical supply chain partnerships vs an upstart SAAS company?

    • jchallis 17 days ago

      Your instincts are good here. Whatever complexity Pizza Hut has it comes from being the weakest of the Yum! Brands siblings — KFC carries the international profit, Taco Bell owns domestic. Pizza Hut is slow growth, perpetual restructuring, and a weird inherited obligation to always serve Pepsi.

  • skybrian 17 days ago

    > Almost certainly, any reasonable depreciation schedule of the cost of training [...]

    Maybe not? This is an argument that has to be made using numbers. We can't do the estimate without the numbers.

    • elbasti 17 days ago

      This is correct. I regret that assertion and have added a comment reflecting that.

  • benlivengood 17 days ago

    The world labor market is ~35T USD yearly, and so that is roughly the order of magnitude to balance against frontier model training cost. E.g. Dario Amodei has his "data center of PhDs" level where he assumes that's "good enough" to stop training frontier models; so if that can take even 5% of global labor market that's ~1.5T a year revenue, balanced against current model training costs of ~1B. 3 orders of magnitude might get us to PhD level? I think that is ultimately the bet the big AI companies are making. Even if 1T is the cost of PhD level AI then three/four companies could depreciate that over 4-5 years sharing that 5% of global market.

    • freejazz 17 days ago

      > The world labor market is ~35T USD yearly, and so that is roughly the order of magnitude to balance against frontier model training cost.

      Crazy that people can write sentences like this with a straight face these days.

  • lokar 17 days ago

    Of course a model does not really depreciate, the problem is they are forced by competitive pressure to offer newer/better models at the same price.

    This is what the elites of the gilded age called "ruinous competition", and the solution today will be the same as back then: monopoly power. This has been the business plan of the tech VC industry for 25+ years.

    • lovich 17 days ago

      Do they not depreciate?

      The models don't learn without training, and they have finite context windows. As software updates around the world, don't they have to be trained on the new information to stay up to date?

      • lokar 17 days ago

        Fair, but in this context people are generally contemplating the need to replace the model with a new, much larger and more expensive model, not just refresh the training set.

        It's partly about updating what it "knows", but more about keeping up with competitive pressure on capabilities.

        • lovich 17 days ago

          I’m actually not familiar enough to know. Can models be refreshed for cheaper? I thought due to the black box nature of them that there would be no difference between updating and generating a whole new model.

          Maybe they can get to a “good enough” level where the next model isn’t 10x the price but if the business model requires ever increasing sizes to paper over the r&d costs from the previous set then I don’t understand how they would transition to profitability

        • freejazz 17 days ago

          People? There's a guy upthread quoting the Anthropic CEO on how they view the value of increasing training against the offset of the entirety of the $35T worldwide labor market... It's not "people". It's the salesmen.

anonzzzies 18 days ago

I calculated only last weekend that my team would cost, if we would run Claude Code on retail API costs, around $200k/mo. We pay $1400/month in Max subscriptions. So that's $50k/user... But what tokens CC is reporting in their json -> a lot of this must be cached etc, so doubt it's anywhere near $50k cost, but not sure how to figure out what it would cost and I'm sure as hell not going to try.

  • scandox 18 days ago

    I'm fascinated to know the kind of work that allows you to intelligently allocate so much resources. I use Claude extensively and feel that I great value out of it but I reach a limit in terms of what I can do that makes sense relatively quickly it seems.

    • lukan 18 days ago

      Same for me, but I suppose it is letting agents more loose and less checking of the code and rather throw away lots of generated output.

  • sva_ 18 days ago

    Gemini CLI shows how much was saved through caching each session, and it's usually somewhere around 90%

  • neamar 18 days ago

    You can use `npx ccusage` to check your local logs and see how much it would have cost through the API.

  • tcbrah 18 days ago

    yeah the json token counts are super misleading. i run a bunch of claude agents for automation and like 85% of input tokens end up being cached reads -which cost 1/10th of the sticker price. so your $200k number is probably closer to $25-30k in real cost, and thats before you factor in that anthropics own infra is way cheaper than retail API pricing. the $5k forbes number was always nonsense but even the "corrected" estimates in TFA are probably still too high IMO

    • kleton 17 days ago

      I proxy all of my llm completion subscriptions. In a typical 7d span-

      model completions read write cached_read cache_write

      claude-opus-4-6 11000 16900000 5840000 1312000000 66120000

      • tgrowazay 17 days ago

        17M uncached reads (input) and 6M of uncached writes (output) is

          $5x17+$25x6=$235 for Opus 4.6
        
          $2x17+$12x6=$106 for Gemini 3 Pro
        
          $0.60x17+$3.6x6=$31.80 for Qwen3.5 397B-A17B via Huggingface API
        • kleton 15 days ago

          You did not add up cache writes, which are $6.25 / MTok, which is another ~$400

  • jychang 18 days ago

    > but not sure how to figure out what it would cost and I'm sure as hell not going to try.

    Ask Opus to figure out how much it would cost. Lol.

  • aweb 18 days ago

    I'm surprised, isn't it forbidden to use the Max plan as part of a company? Just curious, as I thought it was forbidden by the ToS but I'm not sure if I have a good understanding of it

    • ffsm8 18 days ago

      There is nothing in the TOS last time I checked forbidding it's use with Claude code. It's only forbidden to utilize it in the running of the business.

      So getting Claude code subscriptions for developers should be permissable and not be against anything... However, if you created a rest endpoint to eg run a preconfigured prompt as part of your platform, that'd be against it

      But I'm neither a lawyer nor work for anthropic

      • anonzzzies 18 days ago

        Ah, that makes sense. I hope they mean that then. We are just devs using it to write code; not selling it on.

      • ValentineC 17 days ago

        > It's only forbidden to utilize it in the running of the business.

        Sorry, but could you clarify what this means?

        • ffsm8 17 days ago

          The following paragraph had my clarification to that...

          Expressed differently: are you an individual using a official anthropic application interactively? You're fine.

          You're using it unattendedly, without an individual holding the reigns? You should probably talk with an lawyer wherever that's permissable.

          Again, IANAL nor do I work for anthropic

    • alex_c 18 days ago

      ?

      Claude Code has a Teams plan which includes Max tiers. Why would it be forbidden?

    • sunaurus 18 days ago

      Surely that can't be true? The expectation would be that people pay $200 a month for building open source and personal hobby software with Claude?

      • anonzzzies 18 days ago

        Yeah, that would end that really quickly. I use Pro for personal stuff. If $200 is not allowed for companies I don't think anyone would use it, at all.

    • quikoa 18 days ago

      If they believe a sufficient number is locked in then they may consider doing this later.

    • ValentineC 17 days ago

      > Just curious, as I thought it was forbidden by the ToS but I'm not sure if I have a good understanding of it

      Could you quote the relevant part that you think forbids it for us?

    • bloppe 18 days ago

      If that were true, then everyone I know is violating that tos

    • KptMarchewa 18 days ago

      Most companies forbid it though, since you're not covered by any legal protection - for example, Anthropic can use your data or code to train new models and more.

      • bdangubic 18 days ago

        This maybe was the case year+ ago but this is no longer the case, used to be most; now it is some/few

        • itintheory 17 days ago

          Any references on this? I hear this argument a lot. In fact, in a talk on AI last week I heard someone say:

          "If you click the thumbs up button to rate a chat, the AI provider will use the contents for training, so our company's policy is never to click the thumbs up button"

          That seemed so farcical I had a hard time taking this person seriously. Enterprise plans must give some strong guarantees around data usage, right?

          • bdangubic 17 days ago

            Obviously I can speak only from my personal experience but just me I have 5 examples of companies that were “no AI, IP and all that” that are now full-on “every developer must use CC, Cursor…”

            How many conpanes today don’t have “AI strategy” and are fearing will be left behind etc? In my small circle we went from “most are not using AI” to “none are not using AI” in somewhat short period of time

          • addandsubtract 17 days ago

            This is why most businesses only have ChatGPT subscriptions. Plus their integration into existing Microsoft products and billing.

            • bdangubic 17 days ago

              Trusting Microsoft seems like a right move /s

              • ziml77 17 days ago

                Microsoft already has all their business data in the form of handing document storage and emails. Trusting another of their services to also not use that data for Microsoft's own purposes is reasonable.

osener 18 days ago

> Cost remains an ever present challenge. Cursor’s larger rivals are willing to subsidize aggressively. According to a person familiar with the company’s internal analysis, Cursor estimated last year that a $200-per-month Claude Code subscription could use up to $2,000 in compute, suggesting significant subsidization by Anthropic. Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute, according to a different person who has seen analyses on the company’s compute spend patterns.

This is the relevant quote from the original article.

eaglelamp 18 days ago

If Anthropic's compute is fully saturated then the Claude code power users do represent an opportunity cost to Anthropic much closer to $5,000 then $500.

Anthropic's models may be similar in parameter size to model's on open router, but none of the others are in the headlines nearly as much (especially recently) so the comparison is extremely flawed.

The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch based on gear count.

  • d1sxeyes 18 days ago

    But opportunity cost is not actual cost. “If everyone just kept paying but used our service less we would be more profitable” is true, but not in any meaningful way.

    Are Anthropic currently unable to sell subscriptions because they don’t have capacity?

    • mike_hearn 17 days ago

      The opportunity cost isn't selling subscriptions, the cost is the gap between what they could sell the GPU time for via their API vs what they're selling it for in a flat rate subscription. If you assume API demand is unlimited and GPU supply is fixed, then the opportunity cost is the 'real' loss of revenue that comes from redirecting supply away from customers willing to pay more to customers willing to pay less.

      • d1sxeyes 17 days ago

        > If you assume API demand is unlimited

        Doing a lot of heavy lifting here. Not everyone on a subscription plan would convert to a 200USD/mo API consumer.

        • mike_hearn 17 days ago

          No but demand comes from a lot of sources beyond vibe coders.

          • d1sxeyes 17 days ago

            True, but you need to run the optimisation function to calculate at what X does the increased revenue at lower margin from Pro Max subscribers outweigh the reduced revenue at higher margin for API consumers, and adjust your pricing to approach X.

            I'm sure that Anthropic have some very smart folks on that right now.

    • eru 18 days ago

      Opportunity costs are real. In many cases they are more real than 'actual costs'. However, I otherwise agree with you.

    • MaxikCZ 18 days ago

      > Are Anthropic currently unable to sell subscriptions because they don’t have capacity?

      Absolutely! Im currently paying $170 to google to use Opus in antigravity without limit in full agent mode, because I tried Anthropic $20 subscription and busted my limit within a single prompt. Im not gonna pay them $200 only to find out I hit the limit after 20 or even 50 prompts.

      And after 2 more months my price is going to double to over $300, and I still have no intention of even trying the 20x Max plan, if its really just 20x more prompts than Pro.

      • dtech 18 days ago

        This has a absolutely nothing to do with whether they're limited by available compute...

        • MaxikCZ 18 days ago

          What? Wouldn't they give me more than 1 prompt of compute for my $20, if they had spare?

          • esrauch 18 days ago

            I don't think that logically follows.

            They have a business model and are trying to capture more revenue, fully saturating your computer isn't obviously a good business strategy.

      • cicko 18 days ago

        If anything, you are confirming that $170 covers heavy Opus use profitably for the provider.

  • Aeolun 18 days ago

    Opportunity cost is not the same thing as actual cost. They might have made more money if they were capable of selling the API instead of CC, but I would never tell my company to use CC all the time if I didn’t have a personal subscription.

    • eaglelamp 18 days ago

      You’re looking through the wrong end of the telescope. An investor is buying opportunity and it is a real cost to them.

      • kaliqt 18 days ago

        Still makes no sense as they’d lose revenue, data, and scale if they don’t subsidize.

  • bob1029 18 days ago

    > If Anthropic's compute is fully saturated then the Claude code power users do represent an opportunity cost to Anthropic much closer to $5,000 then $500.

    I think it's the other way around? Sparse use of GPU farms should be the more expensive thing. Full saturation means that we can exploit batching effects throughout.

    • eaglelamp 17 days ago

      If they have spare capacity then there is no opportunity cost to selling $100 subscriptions for exactly that reason. If they don’t have spare capacity then, at the margin, they could replace a subscription user with API calls that make them $5000: that’s opportunity cost.

      If you own equity in Anthropic you should care about that cost. Maybe you are willing to tolerate it to win market share, but for you to make the most profit you need that cost to shrink.

  • nottorp 18 days ago

    You know who also loves to use the term "opportunity cost"?

    The entertainment industry. They still tell you about how much money they're leaving on the table because people pirate stuff.

    What would happen in reality for entertainment is people would "consume" far less "content".

    And what would happen in reality for Anthropic is people would start asking themselves if the unpredictability is worth the price. Or at best switch to pay as you go and use the API far less.

  • KronisLV 18 days ago

    Don’t give them any ideas, please! I need my 100 USD subscription with generous Opus usage!

    • eru 18 days ago

      Google's Antigravity has Opus access, and I suspect it's subsidised.

  • YetAnotherNick 18 days ago

    You can rent the GPUs and everything needed to run the model. Opportunity cost is not a real cost here.

    Only thing that matters is if the users would have paid $5000 if they don't have option to buy subscription. And I highly doubt they would have.

  • the_gipsy 18 days ago

    I prefer car analogies

  • NooneAtAll3 18 days ago

    > The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch on gear count

    I mean... rolex is overpriced brand whose cost to consumers is mainly just marketting in itself. Its production cost is nowhere close to selling price and looking at gears is fair way of evaluating that

    • fragmede 18 days ago

      > production cost is nowhere close to selling price

      When has production cost had anything to do with selling price?

      • eru 18 days ago

        Not directly. But if production cost is above selling price, you typically tend to get less production. And if production cost is (way) below selling price, that tends to invite competition.

ymaws 18 days ago

How confident are you in the opus 4.6 model size? I've always assumed it was a beefier model with more active params that Qwen397B (17B active on the forward pass)

  • Bolwin 18 days ago

    Yeah that's a massive assumption they're making. I remember musk revealed Grok was multiple trillion parameters. I find it likely Opus is larger.

    I'm sure Anthropic is making money off the API but I highly doubt it's 90% profit margins.

    • jychang 18 days ago

      > I find it likely Opus is larger.

      Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.

      If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.

      It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.

      • simianwords 18 days ago

        Are you sure you can use tps as a proxy?

        • jychang 18 days ago

          In practice, tps is a reflection of vram memory bandwidth during inference. So the tps tells you a lot about the hardware you're running on.

          Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

          I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.

          • throwdbaaway 16 days ago

            What about the VRAM requirement for KV cache? That may matter more than memory bandwidth. With these GPUs, there are more compute capacity than memory bandwidth than VRAM.

            DeepSeek got MLA, and then DSA. Qwen got gated delta-net. These inventions allow efficient inference both at home and at scale. If Anthropic got nothing here, then their inference cost can be much higher.

            DeepSeek also got https://github.com/deepseek-ai/3FS that makes cached reads a lot cheaper with way longer TTL. If Anthropic didn't need to invent and uses some expensive solution like Redis, as indicated by the crappy TTL, then that also contributes to higher inference cost.

          • fc417fc802 18 days ago

            > In practice, tps is a reflection of vram memory bandwidth during inference.

            > Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

            You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.

            • jychang 18 days ago

              That doesn't work. Think about it a bit more.

              Hint: what's in the kv cache when you start processing the 2nd token?

              And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.

              Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.

              Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.

              In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.

              • fc417fc802 18 days ago

                Oh I see. I went and confused total aggregate throughput with per-query throughput there didn't I.

    • nbardy 18 days ago

      You can estimate on tok/second

      The Trillions of parameters claim is about the pretraining.

      It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.

      However those models end up very sparse and incredibly distillable.

      And it’s way too expensive and slow to serve models that size so they are distilled down a lot.

    • wongarsu 18 days ago

      GPT 4 was rumoured/leaked to be 1.8T. Claude 3.5 Sonnet was supposedly 175B, so around 0.5T-1T seems reasonable for Opus 3.5. Maybe a step up to 1-3T for Opus 4.0

      Since then inference pricing for new models has come down a lot, despite increasing pressure to be profitable. Opus 4.6 costs 1/3rd what Opus 4.0 (and 3.5) costs, and GPT 5.4 1/4th what o1 costs. You could take that as indication that inference costs have also come done by at least that degree.

      My guess would have been that current frontier models like Opus are in the realm of 1T params with 32B active

    • aurareturn 18 days ago

      Anthropic CEO said 50%+ margins in an interview. I'm guessing 50 - 60% right now.

  • daemonologist 18 days ago

    Even if it's larger, OpenRouter has DeepSeek v3.2 (685B/37B active) at $0.26/0.40 and Kimi K2.5 (1T/32B active) at $0.45/2.25 (mentioned in the post).

    • johndough 18 days ago

      Opus 4.6 likely has in the order of 100B active parameters. OpenRouter lists the following throughput for Google Vertex:

          42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
          143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
          70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
      
      For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B, which makes sense since denser models are easier to optimize. As a lower bound, we get 4576B / 42 ≈ 109B active parameters for Opus 4.6. (This makes the assumption that all three models use the same number of bits per parameter and run on the same hardware.)
      • jychang 18 days ago

        Yep, you can also get similar analysis from Amazon Bedrock, which serves Opus as well.

        I'd say Opus is roughly 2x to 3x the price of the top Chinese models to serve, in reality.

  • codemog 18 days ago

    Also curious if any experts can weigh in on this. I would guess in the 1 trillion to 2 trillion range.

    • Chamix 18 days ago

      Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...

      Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)

      So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

      Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.

      • jychang 18 days ago

        Nobody is running 10s of trillion param models in 2026. That's ridiculous.

        Opus is 2T-3T in size at most.

        • Chamix 17 days ago

          What do you think labs are doing with the minimum 10TB memory in NvLink 72 systems that were publicly reported to all start coming online in November/December of last year? And why would this 1 TB -> 10 TB jump matter so much for Anthropic previously being wholly dependent on running Opus 4x on TPUs, if the models were 2-3T at 4bit and could fit in 8x B200 (1.5 TB = 3T param) widely deployed during the Opus 4 era?

          You have presented a vibe-based rebuttal with no evidence or or logic to outline why you think labs are still stuck in the single trillions of parameters (GPT 4 was ~1 trillion params!). Though, you have successfully cunninghammed me into saying that while anything I publicly state is derived from public info, working in the industry itself is a helpful guide to point at the right public info to reference.

          • johndough 17 days ago

            Could you point at some more public info about active parameter count? You said:

            > and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

            I can see ~100B, but that would near the same order of magnitude. I find ~1000B active parameters hard to believe.

            • Chamix 17 days ago

              Sorry if that was unclear, I did mean 100Bs as in the next order of magnitude. Even GPT-4 had ~220B active params, though the trend has been towards increased sparsification (lower activation:total ratio). GPT 4.5 is the only publicly facing model that approached 1T active parameters (an experiment to see if there was any value in the extreme inference cost of quadratically increasing compute cost with naïve-like attention). Nowadays you optimize your head size to your attention kernel arch and obtain performance principally through inference time scaling (generate more of tokens) and parallel consensus (gpt pro, gemini deep think etc), both of which favor faster, cheaper active heads.

              4o and other H100 era models did indeed drop their activated heads far smaller than gpt-4 to the 10s just like current Hopper-Era Chinese open-source, but it went right back up again post-Blackwell with the 10x L2 bump (for kv cache) in congruence with nlogn attention mechanisms being refined. Similar story for Claude.

              The fun speculation is wondering about the true size of Gemini 3's internals, given the petabyte+ world size of their homefield IronwoodV7 systems and Jim Keller's public penchant for envisioning extreme MoE-like diversification across hundreds of dedicated sub-models constructed by individual teams within DeepMind.

          • jychang 16 days ago

            Well, for one, Anthropic mostly uses Google TPUs and Amazon Inferentia2 chips, not Nvidia NVL72s. That's because... Google and Amazon are major investors in Anthropic.

            Secondly, you missed out the entire AI industry trend in 2024-2025, where the failure of the GPT-4.5 pretrain run and the pullback from GPT-4 to GPT-4 Turbo to GPT-4o (each of which are smaller in parameter count). GPT-4 is 1.6T, GPT-4 Turbo is generally considered 1/2 to 1/4 that, and GPT-4o is even smaller (details below)

            Thirdly, we KNOW that GPT-4o runs on Microsoft Maia 100 hardware with 64GB each chip, which gives a hard limit on the size of GPT-4o and tells us that it's a much smaller distilled version of GPT-4. Microsoft says each server has 4 Maia 100 chips and 256GB total. We know Microsoft uses Maia 100s to serve GPT-4o for Azure! So we know that quantized GPT-4o fits in 256GB, and GPT-4 does not fit. It's not possible to have GPT-4o be some much larger model that requires a large cluster to serve- that would drop performance below what we see in Azure.

            Fourthly, it is not publicly KNOWN, but leaks say that GPT-4o is 200b-300b in size, which also tells us that running GPT-4 sized models is nonsense. This matches the information from Microsoft Maia servers above.

            Fifthly, OpenAI Head of Research has since confirmed that o1, o3, GPT-5 use the same pretrain run as 4o, so they would be the same size.[1] That means GPT-5 is not some 1T+ model! Semianalysis confirms that the only pretrain run since 4o is 4.5, which is a ~10T model but everyone knows is a failed run.

            Sixthly, Amazon Bedrock and Google Vertex serves models at approximately similar memory bandwidths when calculating tokens/sec, giving 4900GB/sec for Google Vertex. Opus 4.5 aligns very well with 100b of active params.

                42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
                143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
                70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
            
            For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B. There's calculations for Amazon Bedrock on the Opus 4.5 launch thread that compares it to gpt-oss-120b with similar conclusions.

            Seventhly, Anthropic distilled Opus 4/4.1 to 4.5, which is why it runs ~3x faster than Opus 4 while costing 1/3 the price in terms of API fees.

            Eightly, no respectable model has a sparsity below 3% these days- ridiculously low sparsity gives you Llama 4. Every single cutting edge model are around 3-5% sparsity. Knowing the active param count for Opus 4.5 gives you a very good estimate of total param count.

            The entire AI industry is moving AWAY from multi-trillion-parameter models. Everything is about increasing efficiency with the amount of parameters you have, not hyperscaling like GPT-4.5 which was shown to be a bad way forward.

            Nobody thinks Opus 4.5 is bigger than around 2T in size (so not 10T). Opus 4/4.1 may have been ~6T, but that's it. Any guess of 10T or above is patently ridiculous for both Opus 4/4.1 and Opus 4.5.

            [1] https://x.com/petergostev/status/1995744289079656834

            • Chamix 14 days ago

              I appreciate the detailed comment! I took the day off and am bored so have a brain dump of a reply - basically I think we are talking past each other on two major points:

              1. All the discussion about model size is CRITICALLY bisected into talking about TOTAL model size vs ACTIVE parameter size (of a "head" in an "Mixture of Experts"). Everything you've said trend-wise is mostly accurate for ACTIVE parameter count, which is what determines inference cost and speed.

              But I am primarily talking about TOTAL parameter count (which has to just fit inside cluster HBM). The total parameter count only affects training cost and has nothing to do with inference cost or speed. So there is no downside to making total parameter count as big as your inference cluster will fit.

              2. You touch on distllation, and this heavily relates to the post-gpt-4 base model (call it 5th gen, if gpt-4 was 4th gen), which indeed was used for all models through gpt5.1.

              The actual base 5th gen model was as large as OAI could fit on training clusters, and only then distilled down to whatever total size a release model targeted, and the little secret with sparse MOE is the entire model weights don't have to fit (again, plenty of public papers detailing techniques) on a single HBM pool when training. This leads to the 2nd little secret, that GPT-4.5 is ALSO using that same base model; as I said in another comment, 4.5 was all an experiment in testing a huge ACTIVE parameter model (which again is all that determines cost and speed), not so much total (which is capped by inference cluster hardware anyways!) How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else? But its easy to serve a model with active parameters 10x bigger!

              So this same huge 5th gen base model was distilled down and RLed over and over again in different permutations and sizes to feed the whole OAI model lineup, from o4-mini to advanced voice to gpt4.5 all the way until finally 5.2 starts using a new, "6th gen" base model (with various failed base model trainings between 5th and 6th) (shallotpeat!).

              Picking up misc pieces, yes 4o was tiny when served at Q4, which is what Maia 100 did (with some Q6). We are still taking about a ~1T total model. Quantization both static and dynamic was the whole drive behind gpt4-turbo variants which led straight into 4o targeting an extremely economical deployment of 5th gen base. Economical was sorely needed (arrakis!) since this all was at the critical junction when 8xH100s had not been deployed quite at scale yet, but AI use was rocketing off to mainstream, so we had silly situations like Azure being forced to serve on 256gb clusters. (We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)

              But this DOES NOT mean o1 was tiny, which conveniently was deployed right when 8xH100s WERE available at scale. We split into the instant tree, where 4.1 was bigger than 4o and 5-instant was bigger than 4.1 etc. And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking. Again, the ACTIVE counts were very small comparatively, especially as it let you cheaply experiment and then train with substantial inference compute required for RL training/unrolling! But there was no reason not to fit increasingly large distilled versions of the 5th-gen/6th-gen base models as the inference fleet buildouts (particularly in 2H 2025) came online! The same 5th and now 6th gen base models were refined and twisted (foundry!) into totally different end models and sizes.

              I just think this really all comes down to total vs active, not understanding a huge base model can be distilled into arbitrarily sized release models, and then bizarrely giving weight to Meta's completely incompetent Llama 4 training run (I was there, Gandalf!) as giving any sort of insight on what sort of sparsity ratio cutting edge labs are using. You cannot learn anything about total parameter size from active parameter count+ derivatives (token speed, cost, etc)! But on this topic we could again diverge into an entire debate; I'll just say Google is likely doing like 0.1%-OOM in some production configs (Jim Keller is basically shouting extreme sparsity from the rooftops!).

              Brief rebuttal summary:

              1. Incorrect as of late 2025. Whole public reporting about Anthropic dissatisfaction with "Project Ranier". Dario talking about Nvidia compute candidly on Dwarkish interview!

              2. Active vs Total

              3. 4o is small, 4-bit 4o on Azure even smaller. 4o is 5th gen base distilled not gpt-4 distilled.

              4. 256gb at Q4 fits 1T parameters! Active vs total

              5. 5th gen pretrain / base model is huge! 4.5 uses the same base as 4o and 5.1! Can be shrunk to arbitrary size before RL/post training create finished model! Active vs total

              6. Active vs total

              7. Active vs total, also Ironwood/TPUv7 and Blackwell give much cheaper Q4 inference

              8. Don't trust the Zuck

              Anyways its all a mess and I don't think its possible to avoid talking past each other or misunderstanding in semi-casual conversation - even just today Dylan Patel (who is extremely well informed!) was on Dwarkesh podcast talking about 5.4-instant having a smaller active parameter count than GPT-4 (220B active), which is completely true, but instantly gets misinterpreted on twitter et al that 5.4 is a smaller model than gpt-4, ignores that 5-4.instant are 5.4-thinking are totally different models, etc etc, just too much nuance to easily convey.

              • jychang 9 days ago

                1. Claiming that gpt-4o and gpt-4.5 came from the same training run is ridiculous, gpt-4.5 was not distilled from the same pretrain as 4o.

                - Mark Chen has literally publicly said as much, it's a completely different pretrain run.

                - And clearly if openai has a good big base model before 4.5, they would have released it back in 2024.

                "How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else?" through pipeline parallelism, not tensor parallelism. Don't need to synchronize an all-reduce across clusters. You lose tons of tokens/sec per user though. That's exactly what we see with gpt-4.5 in real life- slow ~10token/sec inference.

                2. 4o was definitely not served fully at 4-bit/6-bit, and even at 4-bit a 1T model wouldn't fit in a Maia cluster with reasonable kv cache for users. You can't quant attention down to 4-bit/6-bit, that would give the model brain damage. A production environment would quant attention down to fp8 at most. Even local home users don't quant attention down to 4 bit. Unsloth UD Q4 quants usually quant attention to Q8. https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/mai...

                blk.0.attn_qkv.weight [2 048, 8 192] Q8_0

                Also, Q4/Q6/Qwhatever are quants used by llama.cpp only, and nobody in a production environment would be using llama.cpp at all. So, saying "Qwhatever" is a clear indicator you have no clue what you're talking about.

                Since 4o predates widespread MLA, they're clearly using GQA and thus you can estimate the size per token from an approximate attention head size. Note that Azure offers 4o with max context of 128k tokens. That's about 4-8gb kv cache at full context size. Even at 4bit (it's not at 4bit), 4o is 500b at most, if you actually want to serve customers! Providers do not do batch=1 inference, that would leave the GPU core idle while memory bandwidth is saturated. So they'd have to batch many users onto one machine, with all their kv caches resident in memory. There's just no way you can fit a 1T model with 8+ bit attention and a bunch of users' kv cache into 256GB, even if the ffn was fp4.

                3. Microsoft leaked the size of 4o, you know. And there's also other estimates. They all estimate 4o at around 200b. https://arxiv.org/pdf/2412.19260 or https://epoch.ai/gradient-updates/frontier-language-models-h...

                4. "(We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)"

                More accurately, most deployments are FP4 for ffn, and still 8 bit or 16 bit for attention. And only the chinese labs train at FP8. There's very little reason to train at FP8 when your AdamW states and gradients are still FP32 and FP16. And note that even deepseek uses FP16/FP32 AdamW/gradients.

                https://arxiv.org/pdf/2412.19437 That's deepseek using FP8 live weight copy + FP32 master + FP32 grad + BF16 moments = 13 bytes per parameter. BF16 weights is 14 bytes per parameter. There's very reason to use FP8 weights over BF16 weights during training, you don't save that much VRAM/compute, unless you're very desperate like Deepseek. Most labs now still train for W16A16 but apply QAT, not train at FP8. Even the chinese labs do this now- Kimi K2.5 is BF16 native, and just quantize ffn down to int4 with QAT. You can tell, because Kimi K2.5 attention tensors are BF16 and not FP8.

                4. "instant tree" "And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking." What you're describing is a massive waste of money. Nobody's doing that. Each time you distill a model to a different size, you have to do that separately. That's a waste of compute. Nope, openai just took the same model, and kept on posttraining it more and more, and published some checkpoints. That's what everyone does. The various gpt-4o-2024-05-13 and gpt-4o-2024-08-06 and gpt-4o-2024-11-20 and gpt-5 and gpt-5.1 ... and o1 and o3 and gpt-5-thinking models are NOT different sizes.

                Every lab takes a model, and iterate on it and train it more and more. Creating a bunch of distills is expensive. Training a model compute is approximately Compute ≈ 6(number of active params)(tokens trained). Posttraining is basically just throwing a few more tokens into the model and doing some forward and backwards passes. I don't know how many tokens they trained on, but it's somewhere in the 10T to 100T range. Distilling a model compute ≈ [2(big model active params) +6(small model active params)](tokens trained). This is way more expensive per token than training! There's less passes, but you don't get the value you think from distills.

                Look at Deepseek! Deepseek V3? 671B total parameters checkpoint. R1? 671B total parameters checkpoint. V3 0324? 671B total parameters checkpoint. R1 0528? 671B total parameters checkpoint. V3.1 combined thinking and non-thinking? 671B total parameters checkpoint. V3.1 Terminus? 671B total parameters checkpoint. V3.2? 671B total parameters checkpoint.

                5. Sparsity matters. Nobody currently is going below 1% sparsity.

                MoE sparsity is just the ratio of total number of experts to active experts. Most labs settle on around 8 out of 256 (like Deepseek, GLM, etc) aka 6.25%. There's plenty of research showing that models break down at too high of a sparsity, which is why total params is correlated to active params.

                Also, please don't use the word "head" to refer to a MoE expert. The word "head" has a specific meaning in ML and it's not that. It's referring to the component in multi-head attention. That's like using the word "transmission" when talking about a car but not referring to the actual transmission. It's making you look really weird.

                Actually, we know what architecture openai was using a few years ago- because openai released it. That was the whole point of gpt-oss. Notably, it uses mxfp4 for MoE, but still uses BF16 for GQA attention, and it has 4 of 128 experts sparsity. Yes, even OpenAI realized that staying around 6.25% sparsity is a good idea. And note that OpenAI clearly did not think quantizing attention is a good idea, even if they apply QAT to create a mxfp4 ffn.

                Basically, you have no clue what you're talking about. You're somehow claiming that openai is doing a ton of distills, one for each of 4o/o1/o3/gpt-5/gpt-5.1 thinking and nonthinking, to different sizes... instead of just taking a model they already have, and doing more training and more checkpoints like everyone else. They'd be insane if they were doing that.

        • johndough 18 days ago

          Do you have any clues to guess the total model size? I do not see any limitations to making models ridiculously large (besides training), and the Scaling Law paper showed that more parameters = more better, so it would be a safe bet for companies that have more money than innovative spirit.

          • magicalhippo 18 days ago

            > I do not see any limitations to making models ridiculously large (besides training)

            From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.

            [1]: https://news.ycombinator.com/item?id=47089780

      • aurareturn 18 days ago

        China is targeting H20 because that's all they were officially allowed to buy.

        • Chamix 18 days ago

          I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.

          However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.

          This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.

          • aurareturn 18 days ago

            They do not have enough H200 or Blackwell systems to server 1.6 billion people and the world so I doubt it's in any meaningful number.

            • Chamix 18 days ago

              I assure you, the number of people paying to use Qwen3-Max or other similar proprietary endpoints is far less than 1.6 billion.

0xbadcafebee 17 days ago

There's a huge difference between cost of inference and profit margin of the "big" providers, and the cost of inference for cloud-hosted open-weights. It's the same as R&D cost of the pharmaceutical industry, versus cost of producing generic drugs. One is massively expensive, the other is cheap.

That said, for inference, the margins for OpenAI were estimated at 70% [1] [2], and the margins for Anthropic were estimated between 90% and 40% [3] [4], last year. They will not be profitable for years.

[1] https://phemex.com/news/article/openais-ai-profit-margin-cli... [2] https://www.saastr.com/have-ai-gross-margins-really-turned-t... [3] https://www.theinformation.com/articles/anthropic-projects-7... [4] https://www.investing.com/news/stock-market-news/anthropic-t...

  • vessenes 17 days ago

    Thank you for real data. Please moderate the use of the word profitable talking to engineers! We get the same circle jerk over and over here.

    Profit implies a GAAP accrual of some sort. On any accrual schedule tied to reality, the companies are profitable now - that is, inference margin on each given model has more than paid for capital costs of training and deploying those models.

    That the companies get to show a loss is a feature of cash-basis accounting: they made $100m net on that last model? Good news, We’re spending $1b on the next! Infinite tax losses!

    The companies will not be cashflow positive for years. Why does this persnickety difference matter? It matters to me because I care about the engineers here - and they seem collectively likely to either short every AI company IPOing, or just quietly ignore AI impact on their livelihood, or head off into a corner and go catatonic - all based on a worldview that “this is collective insanity and everything here is going to eventually go bankrupt” — none of those are good outcomes. Shorting might be, but it should be done judiciously, and understanding the financial factors at play. So, anyway, long plea over - but, allow me to plead: cashflow positive if you want to make the point you were making.

scuff3d 18 days ago

This article is hilariously flawed, and it takes all of 5 seconds of research to see why.

Alibaba is the primary comparison point made by the author, but it's a completely unsuitable comparison. Alibab is closer to AWS then Anthropic in terms of their business model. They make money selling infrastructure, not on inference. It's entirely possible they see inference as a loss leader, and are willing to offer it at cost or below to drive people into the platform.

We also have absolutely no idea if it's anywhere near comparable to Opus 4.6. The author is guessing.

So the articles primary argument is based on a comparison to a company who has an entirely different business model running a model that the author is just making wild guesses about.

  • simianwords 18 days ago

    What? Aws is a good comparison if you want only infra level costs which is what the post is talking about.

himata4113 18 days ago

What people don't realize is that cache is *free*, well not free, but compared to the compute required to recompute it? Relatively free.

If you remove the cached token cost from pricing the overall api usage drops from around $5000 to $800 (or $200 per week) on the $200 max subscription. Still 4x cheaper over API, but not costing money either - if I had to guess it's break even as the compute is most likely going idle otherwise.

  • mike_hearn 18 days ago

    Cache definitely isn't free! We're in a global RAM shortage and KV caches sit around consuming RAM in the hope that there will be a hit.

    The gamble with caching is to hold a KV cache in the hope that the user will (a) submit a prompt that can use it and (b) that will get routed to the right server which (c) won't be so busy at the time it can't handle the request. KV caches aren't small so if you lose that bet you've lost money (basically, the opportunity cost of using that RAM for something else).

    • otterley 18 days ago

      Why do you believe that caches are held in RAM? They don’t need RAM performance, and disk is orders of magnitude cheaper.

      • mike_hearn 17 days ago

        Because OpenAI specifically say that:

        https://developers.openai.com/api/docs/guides/prompt-caching...

        > When using the in-memory policy, cached prefixes generally remain active for 5 to 10 minutes of inactivity, up to a maximum of one hour. In-memory cached prefixes are only held within volatile GPU memory.

        You can opt-in to storing the caches on local disk but it's not the default. I haven't done the calculations for why they do this, but given that disaggregated parallel prefill and RDMA can recompute the KV cache very fast, you'd need a huge amount of bandwidth from disk to beat it (and flash drives wear out!).

    • himata4113 17 days ago

      free relative to gpu cost even at these costs

  • criemen 18 days ago

    > What people don't realize is that cache is free

    I'm incredibly salty about this - they're essentially monetizing intensely something that allows them to sell their inference at premium prices to more users - without any caching, they'd have much less capacity available.

  • eru 18 days ago

    > [...] if I had to guess it's break even as the compute is most likely going idle otherwise.

    Why would it go idle? It would go to their next best use. At least they could help with model training or let their researchers run experiments etc.

    • himata4113 18 days ago

      inference compute is vastly different versus training, also it has to stay hot in vram which probably takes up most of it. There is limited use for THAT much compute as well, they are running things like claude code compiler and even then they're scratching the surface of the amount of compute they have.

      Training currently requires nvidia's latest and greatest for the best models (they also use google TPU's now which are also technically the latest and greatest? However, they're more of a dual purpose than anything afaik so that would be a correct assesment in that case)

      Inference can run on a hot potato if you really put your mind to it

      • rafaelmn 18 days ago

        I think I've heard multiple time that a large % of training compute for SoTA models is inference to generate training tokens, this is bound to happen with RL training

      • eru 18 days ago

        They can run any number of inference experiments. Like a lot of the alignment work they have going on.

        I am not saying this would be a great use of their compute, but idle is far from the only alternative. (Unless electricity is the binding constraint?)

        • himata4113 18 days ago

          Electricity is charged whenever you use it or not, so very unlikely, but sure, they can find uses for it. Although they are not going to make that much money compared to claude code subscriptions.

          • eru 18 days ago

            > Electricity is charged whenever you use it or not, [...]

            Huh, what? You know you can turn off unused equipment, and at least my nvidia GPU can use more or less Watts even when turned on?

            Or does Anthropic have a flatline deal for electricity and cooling?

            • himata4113 17 days ago

              in datacenters power allocation is a fixed cost.

              • eru 16 days ago

                If you are big enough, you can and will negotiate.

                • himata4113 16 days ago

                  the datacenter has a fixed cost for power, industrial power is not consumer power especially at large scale. Scale really kicks in if you own your power plant (ex: hydro, wind, solar).

                  • eru 16 days ago

                    Data centres are more complicated than that.

                    For an example, even if you have a fixed power budget at the data centre level, you still have opportunity costs: if you turn some unused GPUs off, you can run other things hotter.

z3ugma 18 days ago

This is such a well-written essay. Every line revealed the answer to the immediate question I had just thought of

  • lovecg 18 days ago

    I can’t get past all the LLM-isms. Do people really not care about AI-slopifying their writing? It’s like learning about bad kerning, you see it everywhere.

    • crakhamster01 18 days ago

      I had a similar reaction to OP for a different post a few weeks back - I think some analysis on the health economy. Initially as I was reading I thought - "Wow, I've never read a financial article written so clearly". Everything in layman's terms. But as I continued to read, I began to notice the LLM-isms. Oversimplified concepts, "the honest truth" "like X for Y", etc.

      Maybe the common factor here is not having deep/sufficient knowledge on the topic being discussed? For the article I mentioned, I feel like I was less focused on the strength of the writing and more on just understanding the content.

      LLMs are very capable at simplifying concepts and meeting the reader at their level. Personally, I subscribe to the philosophy of - "if you couldn't be bothered to write it, I shouldn't bother to read it".

      • ajkjk 18 days ago

        Alternate theory... a few months into the LLMism phenomenon, people are starting to copy the LLM writing style without realizing it :(

        • amonith 18 days ago

          This happens to non-native English speakers a lot (like me). My style of writing is heavily influenced by everything I read. And since I also do research using LLMs, I'll probably sound more and more as an AI as well, just by reading its responses constantly.

          I just don't know what's supposed to be natural writing anymore. It's not in the books, disappears from the internet, what's left? Some old blogs for now maybe.

          • crakhamster01 17 days ago

            The wave of LLM-style writing taking over the internet is definitely a bit scary. Feels like a similar problem to GenAI code/style eventually dominating the data that LLMs are trained on.

            But luckily there's a large body of well written books/blogs/talks/speeches out there. Also anecdotally, I feel like a lot of the "bad writing" I see online these days is usually in the tech sphere.

          • juuular 17 days ago

            Books definitely have natural writing, read more fiction! I recommend Children of Time by Adrian Tchaikovsky

    • weird-eye-issue 18 days ago

      I think you're just hallucinating because this does not come across as an AI article

      • lovecg 18 days ago

        I see quite a few:

        “what X actually is”

        “the X reality check”

        Overuse of “real” and “genuine”:

        > The real story is actually in the article. … And the real issue for Cursor … They have real "brand awareness", and they are genuinely better than the cheaper open weights models - for now at least. It's a real conundrum for them.

        > … - these are genuinely massive expenses that dwarf inference costs.

        This style just screams “Claude” to me.

      • hansvm 18 days ago

        It was almost certainly at least heavily edited with one. Ignoring the content, every single thing about the structure and style screams LLM.

      • lelanthran 18 days ago

        > I think you're just hallucinating because this does not come across as an AI article

        It has enough tells in the correct frequency for me to consider it more than 50% generated.

      • NetOpWibby 18 days ago

        Name checks out

    • raincole 18 days ago

      It's really unfortunate that we call well-structured writing 'LLM-isms' now.

    • Erem 18 days ago

      I don’t see the usual tells in this essay

    • 152334H 18 days ago

      People care, when they can tell.

      Popular content is popular because it is above the threshold for average detection.

      In a better world, platforms would empower defenders, by granting skilled human noticers flagging priority, and by adopting basic classifiers like Pangram.

      Unfortunately, mainstream platforms have thus far not demonstrated strong interest in banning AI slop. This site in particular has actually taken moderation actions to unflag AI slop, in certain occasions...

    • rhubarbtree 18 days ago

      It is certainly very obvious a lot of the time. I wonder if we revisited the automated slop detection problem we’d be more successful now… it feels like there are a lot more tells and models have become more idiosyncratic.

      • weird-eye-issue 18 days ago

        Tons of companies do this already. It's not like this is a problem that nobody is constantly revisiting...

        • rhubarbtree 17 days ago

          What’s one company that has revisited this recently and what’s their detection rate on what sample?

          • weird-eye-issue 17 days ago

            Companies like Originality.ai are always updating their models and you could use a simple Google search to answer your questions.

            • rhubarbtree 16 days ago

              You could also have had the courtesy to put that in your original post. But let’s not get meta.

              I did a quick test and it detected an AI summary of a random topic, even after two prompts to disguise it. So as expected it may have become a lot easier to detect.

              • weird-eye-issue 16 days ago

                There are literally hundreds of companies that are doing this. You could have the basic courtesy to do a Google search instead of asking.

                • sebastiennight 13 days ago

                  This is an Internet forum and one of the ways such places are valuable is that it enables you to ask questions to other humans and allows those other humans, if they'd like, to answer.

                  You will get better results asking questions like GP's than Googling because you're asking the specific person who made a claim to quote an example, so you can judge from the specific example they provide, rather than the Google results. The best answers are often technically interesting niche tools which don't have great SEO.

                  Case in point: the platform you recommended does not show up anywhere on my first page of Duck.com results.

faangguyindia 18 days ago

Claude subscription is equivalant of spot instance

And APIs are on-demand service equivalant.

Priority is set to APIs and leftover compute is used by Subscription Plans.

When there is no capacity, subscriptions are routed to Highly Quantized cheaper models behind the scenes.

Selling subscription makes it cheaper to run such inference at scale otherwise many times your capacity is just sitting there idle.

Also, these subscription help you train your model further on predictable workflow (because the model creators also controls the Client like qwen code, claude code, anti gravity etc...)

This is probably why they will ban you for violating TOS that you cannot use their subscription service model with other tools.

They aren't just selling subscription, but the subscription cost also help them become better at the thing they are selling which is coding for coding models like Qwen, Claude etc...

I've used qwen code, codex and claude.

Codex is 2x better than Qwen code and Claude is 2x better than Codex.

So I'd hope the Claude Opus is atleast 4-5x more expensive to run than flagship Qwen Code model hosted by Alibaba.

  • popcorncowboy 18 days ago

    > Claude is 2x better than Codex

    This hasn't been true in a long time.

    • epolanski 18 days ago

      Not only that, but since the release of 5.4 and 5.3 codex I've been running them in parallel and I've been let down by Opus 4.6 with maximum thinking way more than I've been let down with OpenAI models.

      In fact I'm more and more inclined to run my own benchmarks from now on, because I seriously distrust those I see online.

      Even if the benchmarks are indeed valid, they just don't reflect my use cases, usages and ability to navigate my projects and my dependencies.

    • Huppie 18 days ago

      imho they're mostly better at a subset of different tasks. I find codex to be better at reasoning through bugs and reviewing code when compared to Opus, but for writing code I find Claude a lot better.

      Maybe that's just CLAUDE.md and memory causing the difference of course.

      As a matter of preference however I like the way Claude Code works just a lot better, instructing it to work with parallel subagents in work trees etc. just matches the way I think these things should work I guess.

    • elAhmo 18 days ago

      My impression as well, especially since 5.2 which I felt was on par or better than Opus 4.5

  • janalsncm 18 days ago

    > When there is no capacity, subscriptions are routed to Highly Quantized cheaper models behind the scenes.

    Have they announced this?

    • nl 18 days ago

      > Have they announced this?

      No and indeed they have said they never do this at all.

n_u 18 days ago

Good article! Small suggestions:

1. It would be nice to define terms like RSI or at least link to a definition.

2. I found the graph difficult to read. It's a computer font that is made to look hand-drawn and it's a bit low resolution. With some googling I'm guessing the words in parentheses are the clouds the model is running on. You could make that a bit more clear.

brianjeong 18 days ago

These margins are far greater than the ones Dario has indicated during many of his recent podcasts appearances.

functionmouse 18 days ago

Was anyone under the impression that it does? Serious question. I've never heard that, personally.

  • versteegen 18 days ago

    Ed Zitron made that claim (in particular here: [1]). In the same article he admits he not a programmer, and had to ask someone else to try out Claude Code and ccusage for him. He doesn't have any understanding of how LLMs or caching works. But he's prominent because he's received leaked financial details for Anthropic and OpenAI, eg [2]

    [1] https://www.wheresyoured.at/anthropic-is-bleeding-out/ [2] https://www.wheresyoured.at/costs/

    • sunaurus 18 days ago

      Maybe I'm misreading it, but I don't see him saying it's just the cost of *inference* alone (which is the strawman that the article in the OP is arguing against). He says:

      > this company is wilfully burning 200% to 3000% of each Pro or Max customer that interacts with Claude Code

      There is of course this meme that "Anthropic would be profitable today if they stopped training new models and only focused on inference", but people on HN are smart enough to understand that this is not realistic due to model drift, and also due to comeptition from other models. So training is forever a part of the cost of doing business, until we have some fundamental changes in the underlying technology.

      I can only interpret Ed Zitron as saying "the cost of doing business is 200% to 3000% of the price users are paying for their subscriptions", which sounds extremely plausible to me.

  • simianwords 18 days ago

    You would be surprised because there are lots of posters here who think that the cost is so enormous that this whole industry is unviable.

    • beepbooptheory 17 days ago

      Is that conceit somehow intrinsically absurd? Or is everyone just supposed to just know?

      Like I wish it was simple as "if it wasn't viable, they wouldn't be in business," but alas that argument is kinda the more naive one in this world. Right?

      Or is there some intuition about energy/cost here all the dump posters miss, that you could tell us about?

      Please, anything, my company is dying.

      • simianwords 17 days ago

        It’s the most simple intuition. You can run a 2 year old gpt model equivalent on your laptop today. So cost is obviously going down.

        Its so strange to see people still think costs are not going down..

        • beepbooptheory 16 days ago

          Wow never thought about it like that... Guess there is nothing to worry about! I feel so silly now.

  • dimgl 18 days ago

    Twitter.

  • crazygringo 18 days ago

    I mean, the very first paragraph of TFA is describing who is under that impression. Literally the first sentence:

    > My LinkedIn and Twitter feeds are full of screenshots from the recent Forbes article on Cursor claiming that Anthropic's $200/month Claude Code Max plan can consume $5,000 in compute.

    • fulafel 18 days ago

      That's claiming that worst case, a subscriber _can_ use that much. It's possible that's wrong too, but in any case a lot of services are built on the assumption that the average user doesn't max out the plan.

      So the article's title is obviously sensationalized.

      • vidarh 18 days ago

        I have no problem believing that a Claude Max plan can consume equivalent to $5000 worth of retail Opus use, but one interesting thing you'll see if you e.g. have Claude write agents for you, is that it's pretty aggressive about setting agents to use Sonnet or even Haiku, so not only will most people not exhaust their plans, but a lot of people who do will do so in part using the cheaper models. When you then factor in Anthropics reported margins, and their ability to prioritise traffic (e.g. I'd assume that if their capacity is maxed out they'd throttle subscribers in favour of paid by the token? Maybe not, but it's what I'd do), I'd expect the real cost to them of a maximised plan to be much lower.

        Also, while Opus certainly is a lot better than even the best Chinese models, when I max out my Claude plan, I make do with Kimi 2.5. When factoring in the re-run of changes because of the lower quality, I'd spend maybe 2x as much per unit of work I were to pay token prices for all my monthly use w/Kimi.

        I'd still prefer Claude if the price comes down to 1x, as it's less hassle w/the harder changes, but their lead is effectively less than a year.

      • crazygringo 17 days ago

        The article discusses all of that, in great detail.

        The title does not seem sensationalized. It's literally a summary of the article.

        • fulafel 17 days ago

          It does and that is exactly why the article's title is sensationalized.

          The title is refuting a strawman argument that wasn't actually made, and that the article itself doesn't claim was made.

          • crazygringo 17 days ago

            > The title is refuting a strawman argument that wasn't actually made

            The argument was literally made in Forbes. It's linked to. What are you on about?

            Is there something I'm missing here?

            • fulafel 17 days ago

              The quote from Forbes in the article is a claim that a subscription could use up to 5k worth of tokens.

              Which is different from actaully costing 5k in tokens per Claude Code user. As users won't max out their subscriptions. And there doesn't seem to be any stronger claim elsewhere in the article.

              But the title is about a strawman that it would cost Anthropic 5k per user which it seems nobody claimed.

              • crazygringo 17 days ago

                OK, I see what you're saying, thanks for clarifying.

                But headlines are short. This is so common even in mainstream news, I can't really complain about it. Especially when the full claim with "up to" is printed in the very first paragraph.

                And the entire point of the article is not about which users max out their subscriptions. It's about conflating retail prices with actual costs.

                So maybe the headline would be more accurate with "up to" in it, but the article itself is totally fine, and does not hinge on that distinction. The article is certainly not about a strawman.

aurareturn 18 days ago

By the way, one of the charts in the article shows that Opus 4.6 is 10x costlier than Kimi K2.5.

I thought there was no moat in AI? Even being 10x costlier, Anthropic still doesn't have enough compute to meet demand.

Those "AI has no moat" opinions are going to be so wrong so soon.

  • spiderice 18 days ago

    Claude Code Max obviously doesn't cost 10x more than Kimi. The article even confirms that you can get $5k worth of computer for $200 with Claude Code Max.

    So no, Claude would not be getting NEARLY as much usage as it's currently getting if it weren't for the $100/$200 monthly subscription. You're comparing Kimi to the price that most people aren't paying.

SilverBirch 16 days ago

Whilst this is interesting I find the topic bought up on odd lots is more interesting. The idea was this: Once you've built a model, if you can sell tokens for a profit, this is a great business - just sell more tokens. But you can't just build a model and sell tokens. You need to build the best model to sell new tokens. So the question is much more "How much does it cost you to build a new SotA model" and then "How effectively can you monetize it". And since you need a SotA model, your only option if you have a bad model that isn't selling is to invest billions more into building a better model whose tokens you can sell.

So this turns into a death march.

If you are behind, the only thing you can do is make massive capital investments to catch up. Once you're ahead you can sell tokens until someone else catches up. And, breaking the model of normal of places like chip fabrication, your billions of investment may only keep you ahead for 2 months. So you have a tiny window to sell those tokens.

readthemanual 18 days ago

I think the main issue I have with the article is that author whole argument is based on 'Qwen wouldn't run at a loss'. But why wouldn't it? Depsite it being a business, there might be a number of arguments why they decide to run without profit for now: from trying to expand the user base, to Chinese government sponsoring Chinese AI business.

  • martinald 18 days ago

    Hi, OP here! Even if Qwen wants to run at a loss, why would Together, DeepInfra, SiliconFlow, etc _all_ also want to run at a similar loss?

ineedaj0b 18 days ago

What CC costs internally is not public. How efficient it is, is not public.

…You could take efficiency improvement rates from previous models releases (from x -> y) and assume; they have already made “improvements” internally. This is likely closer to what their real costs are.

hattmall 18 days ago

Is it fair to say the Open Router models aren't subsidized though? They make the case that companies on there are running a business, but there are free models, and companies with huge AI budgets that want to gather training data and show usage.

dreis_sw 17 days ago

We'll only know for sure when either of these companies goes public. Google serves inference and Google Cloud is profitable but we don't know how much inference is costing.

If they never go public, there's our answer as well.

lyu07282 18 days ago

What this doesn't mention is the "cost" to the public: the inevitable bailouts after it all comes crashing down again, the massive subsidies that Datacenters get from tax payers, the fresh water they consume, the electricity price hikes for everyone else, the noise, air and water pollution and the massive health impact on the surrounding population of every datacenter. The jobs that it destroys and the innocent people it kills through use of the technology in military targeting and autonomous weapons usage.

WhitneyLand 17 days ago

“Cursor has to pay Anthropic's retail API prices (or close to it) for access to Opus 4.6. So to provide a Claude Code-equivalent experience using Opus 4.6, it would cost Cursor ~$5,000 per power user per month. But it would cost Anthropic perhaps $500 max.”

Cursor seems to be in a tough spot. Just heard the swix podcast on their big new cloud agents thing, and it’s looking like a pretty small moat these days.

gmerc 18 days ago

Nobody gets RSI typing “iterate until tests pass”

  • arthurcolle 18 days ago

    Recursive self improvement and Repetitive Strain Injury being the same initialism is really funny to me

nickcoffee 17 days ago

Been running Claude Code and the $200/month has been one of the better value decisions I've made as a founder.

The more interesting question is where the margins go as inference costs keep dropping. At some point the pricing pressure flows to users.

d--b 18 days ago

And on top of that, Anthropic does not run their own compute clusters do they? They probably get completely ripped by whoever is renting them the processors.

$200 worth of actual computation is an awful lot of computation.

maxdo 18 days ago

I'm using API directly for software developement, i'm on path to pay ~$5k this month per user, some less , some more, with daily use is just growing more and more.

  • tartoran 18 days ago

    What kind of software development do you do? Are you running a gas town? I assume you make your money back but still, are you sure you’re not wasting your tokens away?

    • maxdo 17 days ago

      Lots of greenfield sub projects around same product , bug fixes , ticket management etc . With my price per hour it still Make sense, even though it’s on the edge of been not that great.

      Yeah , I tried gasTown. Not using it extensively.

ramesh31 17 days ago

Agent teams change everything. I can easily burn through 1m tokens in 15 minutes. There's no way the $200 price will hold once everyone is doing it.

tartoran 17 days ago

Nobody seems to mention free users who have a pretty small limit but there are a lot of users in this category. Who’s subsidizing them?

  • amelius 17 days ago

    But that doesn't say anything about whether they can be profitable or not.

    • tartoran 17 days ago

      I assume they project that some of the free users will be converted to paying users at some point. Whether it turns profitable in the end it's another story. Have no idea..

vbezhenar 18 days ago

Why does Claude charge 10x for API, compared to subscriptions? They're not a monopoly, so one would expect margins to be thinner.

  • hobofan 18 days ago

    Monopoly isn't the only thing that allows you to charge large margins.

    API inference access is naturally a lot more costly to provide compared to Chat UI and Claude Code, as there is a lot more load to handle with less latency. In the products they can just smooth over load curves by handling some of the requests slower (which the majority of users in a background Code session won't even notice).

  • preommr 18 days ago

    It's why every integration basically tries to piggyback off of a subscription, and why Anthropic has to continuously play whack-a-mole trying to shut those services down.

  • timmmmmmay 17 days ago

    they're obviously trying to steer customers to the monthly subscription instead of the pay-per-token API.

    now, the consensus of the commentards on this website, who don't have access to any of anthropics financial data, is that the monthly subscriptions are a money loser!

    so either the leading AI company's business dev team is wrong or the Jacker News comment section is wrong, it is a mystery

akhrail1996 18 days ago

The comparison with Qwen/Kimi by "comparable architecture size" is doing a lot of heavy lifting. Parameter count doesn't tell you much when the models aren't in the same league quality-wise.

I wonder if a better proxy would be comparing by capability level rather than size. The cost to go from "good" to "frontier" is probably exponential, not linear - so estimating Anthropic's real cost from what it takes to serve Qwen 397B seems off.

darkwater 18 days ago

Well, IDK, I have used CC with API billing pretty extensively and managed to spend ~$1000 in one month more or less. Moved to a Max 20x subscription and using it a bit less (I'm still scared) but not THAT less and I'm around 10% weekly usage. I'm not counting the tokens, though.

behehebd 18 days ago

Did anthropic do the oldest SaaS sales trick in the 2010s SaaS playbooks ;)

vmykyt 18 days ago

I have very naive question:

People in comments have assumption that Atropic 10 times bigger than chinese models so calc cost is 10 times more.

But from perspective of Big O notation only a few algorithms gives you O(N). Majority high optimized things provide O(N*Log(N))

So what is big O for any open model for single request?

  • fancyfredbot 18 days ago

    It's a good question. Costs will be lumpy. Inference servers will have a preferred batch size. Once you have a server you can scale number of users up to that batch size for relatively low cost. Then you need to add another server (or rack) for another large cost.

    However I think it's fair to say the cost is roughly linear in the number of users other than that.

    There may be some aspects which are not quite linear when you see multiple users submitting similar queries... But I don't think this would be significant.

  • rat9988 18 days ago

    N*Log(N) can be approximated to O(N) for most realistic usecases.

    As for LLM, there is probably some cost constant added once it can fit on a single GPU, but should probably be almost linear.

ajstars 17 days ago

The compute cost debate misses a subtler point: the real cost multiplier isn't inference, it's context length. Most agent frameworks naively stuff 6-8k tokens into every prompt turn. If you route intelligently and compress memory hierarchically, you can bring that down to 200-400 tokens per turn with no quality loss. The model cost then becomes almost irrelevant.

steveBK123 17 days ago

One consideration to me, regardless of the exact burn rate on inference is the assumed increase in revenues via higher fees. One of the bull cases I often see is that the hockey stick revenue growth continues longer/higher than the hockey stick cost growth. Then it all prints money because people are spending 10x/100x/1000x what they are today.

In the real world ..

Where I work, AI is used heavily, we are already tipping into cost management mode at a firm level. Users are being aggressively steered to cheaper models, usage throttled, and cost attribution reports sent. This is already being done at the under-$1k/mo per user cost level. So some indications of revenue per user leveling out already.

Meanwhile everyone I know who works anywhere near a computer has had AI shoved down their throat, with training, usage KPIs, annual goal setting and mandated engagement. So we are already pretty saturated, it's not like theres giant new frontiers of new users.

august11 16 days ago

They are clearly making money overall.

tom_m 17 days ago

Yea, it costs more than that.

zurfer 18 days ago

tldr: the author argues it is closer to costing 500 USD per month IF a user hits their weekly rate limits every week.

Which is probably a lot more correct than other claims. However it's also true that anybody who has to use the API might pay that much, creating a real cost per token moat for Anthropics Claude code vs other models as long as they are so far ahead in terms of productivity.

otterley 18 days ago

The first clue that an article is likely bullshit is that it’s posted on Forbes. Forbes is now basically a contributor blog wearing the tattered remains of a designer suit.

sheepscreek 17 days ago

TL;DR the premise for the calculation is completely flawed even if the conclusion is correct

> Qwen 3.5 397B-A17B is a good comparison point. It's a large MoE model, broadly comparable in architecture size to what Opus 4.6 is likely to be.

I stopped reading here. Frontier models have been rumoured to be in TRILLIONS of parameters since the days of GPT-4. Besides, with agents, I think they’re using more specialized models under the hood for certain tasks like exploration and web searches.

So while their cost won’t be $5000 or anywhere close, I still think it would be in the hundreds for heavy users. They may very well be losing money to the top 5-10% MAX users. Their real margin likely comes from business API customers.

Here’s an interesting bit - OpenAI filed a document with the SEC recently that gave us a peek into its finances. The cost of all infrastructure stood at just ~30% of all revenue generated. That is a phenomenal improvement. I fell off the chair when I first learned that.

amelius 18 days ago

Tl;dr, their guesstimate:

> Anthropic is looking at approximately $500 in real compute cost for the heaviest users.

beepbooptheory 18 days ago

Ok but so it does cost Cursor $5k per power-Cursor user?? Still seems pretty rough..

  • scriptsmith 18 days ago

    Yes, you could turn it around to say that using Anthropic models in Cursor, Copilot, Junie, etc. is 'subsidising' Claude Code users.

  • oefrha 18 days ago

    No, to use $5k in Cursor you have to pay $5k.

  • arthurcolle 18 days ago

    $5 = $5

    but $5 that I amortize over 7 years might end up being $1.7 maybe if I don't rapidly combust (supply chain risk)

  • dimava 18 days ago

    Cursor may be losing money only on $200 sub people who do over $200 of usage (it grants $400)

    Everyone else pays them at API prices

  • unlimit 18 days ago

    I wonder how they are defining a power user. How many tokens, what could be the size the code base?

    • dietr1ch 18 days ago

      The $5k power user is the one that consistently uses all input and output tokens available under the Max subscription

secondary_op 18 days ago

off-topic, but, please stop hosting websites behind cloudflare, just nginx is enough, cloudflare is cancer.

  • AussieWog93 17 days ago

    Have you ever hosted a website without Cloudflare?

    In my case, the access logs alone from bots scanning for vulns grew so large that the server started creating.

    Fortunately I wasn't running anything vulnerable!

  • notkyle 18 days ago

    Equating Cloudflare and NGINX seems a bit weird, they're entirely different tools with different purposes? Cloudflare has never done me any wrong as a web host or a web user. Calling it a cancer is very disingenuous.

dr_dshiv 18 days ago

I easily go through two pro max $200/m accounts and yesterday got a third pro account when I ran out.

It’s worth it, but I know they aren’t making money on me. But, of course I’m marketing them constantly so…

fnord77 18 days ago

> I'm fairly confident the Forbes sources are confusing retail API prices with actual compute costs

Aren't they losing money on the retail API pricing, too?

> ... comparisons to artificially low priced Chinese providers...

Yeah, no this article does not pass the sniff test.

  • versteegen 18 days ago

    > Aren't they losing money on the retail API pricing, too?

    No, they aren't, and probably neither is anyone else offering API pricing. And Anthropic's API margins may be higher than anyone else.

    For example, DeepSeek released numbers showing that R1 was served at approximately "a cost profit margin of 545%" (meaning 82% of revenue is profit), see my comment https://news.ycombinator.com/item?id=46663852

    • bandrami 18 days ago

      Weird that they're all looking for outside money then

      • vidarh 18 days ago

        They're all looking for outside money because they're all looking for outside money, and so need to keep up with their competitors investments in training. It's a game of chicken. Once their ability to raise more abates, they'll slow down new training runs, and fund that out of inference margins instead, but the first one to be forced to do so will risk losing market share.

      • aurareturn 18 days ago

        Inference is profitable. No one is selling at a loss. It’s training to keep up with competitors that is causing losses.

        • bandrami 18 days ago

          > Inference is profitable

          Eh. We don't really know that, and the people saying that have an interest in the rest of the world believing it's true.

          • aurareturn 18 days ago

            How are we so sure that deep inside the moon isn't made out of cheese?

            • bandrami 17 days ago

              I remember Enron. Hell, I remember the S&Ls. I've seen this movie too many times to not know how it ends.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection