The End of ARR: Why SaaS Metrics Are Dead for AI Companies

Executive Summary

ARR was always a shortcut. A contractual fiction dressed up as financial truth. For fifteen years, nobody questioned it hard enough because the margins were fat and the costs were predictable. Then AI made the cost side of the P&L variable, and ARR stopped being incomplete. It became actively misleading.

"ARR is dead, and it's a metric that's not coming back."

Dennis Onalaja

Antti Pasila called it “one of the worst vanity metrics ever invented, a value extrapolation that blurs investors' focus.” Paul Lundin put it most precisely: “Half of ARR isn't real. It's a costume metric, dressed up to look predictable, recurring, and contractual when it's often none of those things.”

Ed Zitron spent months documenting exactly how bad this gets under AI economics. Anthropic leaked $14 billion in annualized revenue figures while court filings showed $5 billion in lifetime revenue. As Zitron reported, OpenAI published ARR graphics without disclosing they were annualized figures. The gap between what ARR implies and what a company actually earns has never been more dangerous to act on.

Anthropic: ARR claimed (leaked)

$14B

Annualized figure, leaked

Anthropic: Lifetime revenue (CFO affidavit)

$5B

Actual cumulative earnings on record

ARR rewarded growth theater. AI punishes it.

The Conditions That No Longer Exist

Why ARR worked, then stopped

ARR made sense under one specific set of conditions: predictable 80–90% gross margins, near-zero marginal cost per user, and contracts that were actually annual and actually recurring. One seat, one price, one line item. Boards understood it. CFOs could forecast it. Nobody questioned it because nobody had a reason to.

AI products broke all three conditions simultaneously.

Every inference call costs real money. The compute behind a single Opus call or a complex code completion is a line item on your infrastructure bill. When a customer runs 10,000 API calls on a flat $49 subscription, ARR still looks fine. Gross margin is quietly dying.

Your top 5% of users might consume 75% of your compute costs while paying the same flat fee as everyone else. Those customers aren't low-margin. They're loss-making. ARR registers them identically to every profitable customer you have. This is the default state of any AI product with a flat pricing model, not an edge case to watch for.

The Structural Flaw, Made Concrete

The specific problem with ARR

Zitron's reporting makes the structural flaw concrete. Anthropic posted $14 billion in annualized revenue while its own CFO's court affidavit confirmed $5 billion in lifetime revenue. OpenAI published a graphic showing “$20bn ARR in 2025,” annualized, undisclosed, multiplied from a single month's revenue. These aren't isolated cases of creative accounting. They're what happens when a metric designed for predictable SaaS gets applied to a business where costs are volatile, margins swing weekly, and “recurring” is optimistic.

Illustrative Case - how unit economics diverge over time

Imagine two customers, both paying $2,000 per month. Customer A runs light workloads, results in a gross margin of 78%.

Customer B runs heavy model-intensive jobs daily which results in a gross margin of just 31%.

Your ARR dashboard shows two identical $24K lines. Customer B is destroying your unit economics, and ARR has no mechanism to show you that.

ARR captures revenue but it doesn't capture cost.

In the SaaS world, high margins were the norm, but when they aren't, it's just not enough

Replacement metric #1

Gross profit per token

Kyle Poyar and CJ Gustafson made this point on Mostly Growth: token consumption alone is a pass-through metric. What you charge versus what it costs to serve those tokens is what tells you whether you're building a business or subsidizing one.

Mostly Growth screengrab

Most AI companies today can't answer that question per customer. While they can tell you about your total token consumption, they definitely can't tell you total revenue, gross profit per token, by customer, updated as usage happens... So it, lives across three systems that don't talk to each other.

I firmly believe it is a billing architecture problem.

Replacement metric #2

LTV to CAC is over

Gustafson's framing was blunt: LTV to CAC is “fun with numbers” for AI startups. Nobody actually knows what the lifetime of an AI customer is. Switching friction is lower than any previous generation of software. The tools to replace you now deploy in hours.

LTV compounds assumptions: low churn, predictable expansion, long retention. AI products haven't yet earned the right to project any of those. First year value asks the sharper question: did this customer get enough value in the first twelve months to justify renewing? If the answer requires year two or three to become yes, LTV is obscuring a retention problem.

Replacement metric #3

Productivity per dollar spent

The old version, ARR per FTE, made sense when headcount was the main cost driver. Poyar's updated framing is more precise: ARR per dollar spent on both human headcount and AI agents. The ratio has to account for both.

An agent running 40-hour-equivalent workflows per week is not a software subscription. It's a workforce cost. If AI spend doesn't appear in the denominator of your efficiency metric, the metric is flattering you. Some companies running this number for the first time discover their ARR per FTE looks clean while their true efficiency ratio has been degrading for two years.

Replacement metric #4

Retention by output quality

Customer health scores predict churn by measuring proxy engagement: login frequency, feature adoption, support volume, NPS. In AI products, customers evaluate you on output quality, not how often they log in.

A customer who uses the product daily but gets mediocre outputs will leave. A customer who uses it once a week to automate one high-value workflow will stay for years. If your retention model runs on engagement signals while customers make decisions based on output quality, you're measuring the wrong leading indicator. You'll see the churn three quarters after you should have.

Output quality is harder to instrument than login frequency. That's why most teams don't track it. It's also the only signal that reliably predicts what happens at renewal.

The Fix

The billing layer is where this gets fixed

None of this requires a new dashboard - more like new billing infrastructure that captures cost alongside revenue.

Stripe doesn't record what it cost you to serve customers, so the data needed to calculate gross profit per customer typically lives across three systems: your billing tool, your infrastructure cost reports, and your finance team's spreadsheets. By the time someone reconciles them, you're looking at last quarter.

You need a billing layer that meters token consumption, maps it against inference costs by model and tier, and surfaces gross profit per customer as usage happens. Not at month-end close. As it happens.

The companies still reporting ARR as their primary health metric showcasing their billing system's limitations.