Ask HN: How are people forecasting AI API costs for agent workflows?
I’ve been experimenting with agent-based features and one thing that surprised me is how hard it is to estimate API costs.
A single user action can trigger anywhere from a few to dozens of LLM calls (tool use, retries, reasoning steps), and with token-based pricing the cost can vary a lot.
How are builders here planning for this when pricing their SaaS?
Are you just padding margins, limiting usage, or building internal cost tracking?
Also curious, would a service that offers predictable pricing for AI APIs (like a fixed subscription cost) actually be useful for people building agentic workflows? We hit this exact problem and built NORNR (nornr.com) around it. Instead of trying to forecast costs upfront, we enforce them at runtime. Every agent action that involves spend has to request a mandate first. Policy decides approved, queued, or blocked based on budget, rate, context, whatever you define. Every decision gets a signed receipt so you have a full audit trail. It doesn't replace your cost estimates but it puts a hard ceiling on what agents can actually spend. Free tier if you want to try it. We've been dealing with this exact problem building agent-driven workflows. A few things that have actually helped: The unpredictability is worse than the absolute cost. Our billing model broke several times not because costs were high, but because we couldn't bound them. One approach that helped: define a 'token budget' per user action at design time - cap total tokens per session and treat hitting the cap as a first-class outcome your product handles gracefully, not an error. On the forecasting side, we track cost per workflow step rather than per request. Step-level cost is much more stable than request-level because it absorbs the variance in tool calls and retries. Once you have step costs, you can forecast by expected workflow composition. On fixed subscription pricing for AI APIs - I'd actually pay a premium for that. The unpredictability creates a hidden cost: you over-provision margins and add complexity to your pricing tier design. A flat rate for a capacity bucket would eliminate both. The question I'd ask about any such service: how do they handle the tail cases where agents go off-rails and rack up 10x normal token usage? That's where the cost risk actually lives. I’d break it into a few separate numbers instead of trying to estimate one blended API cost. We usually look at cost per workflow run, runs per active account, and the heavier paths separately, then keep retries and tool calls as their own line items. That makes the pricing side easier to reason about. this is exactly the problem we keep running into. the cost isn't just "how many tokens did this call use," its "how many tokens did this entire user action consume across all the agent loops, retries, tool calls, and embeddings." most observability tools show you the LLM call as one flat span. you can see it cost X tokens but you cant correlate it with the API request that triggered it, or see that the agent looped 4 times because the first 3 outputs failed validation. so you end up building custom logging and hoping the numbers add up. we've been building an APM (immersivefusion.com) where cost is a first-class dimension on every trace. so you can see one request flow from the UI through your backend through the agent workflow, and each span carries its token cost. the idea is you should be able to answer "what does a checkout cost when the recommendation agent is in the loop" without stitching together 3 different tools. for the forecasting question specifically, i think the answer is you need a few weeks of production data with good instrumentation and then you can build a distribution. the variance is real but its not random, its usually a few specific flows that blow up (retries on bad structured output like @hkonte mentioned, or RAG queries that hit the wrong chunk size). once you can see which flows are expensive the guardrails become obvious. also wrote a longer piece on this if anyone's interested: immersivefusion.com/blog/end-to-end-observability-from-ui-to-ai-agent-to-invoice Agreed. The real cost unit becomes the whole agent workflow, not a single LLM call. One user action can trigger dozens of calls. We ran into the same issue and ended up building https://oxlo.ai to make the cost side more predictable for agent workloads. lowercasing doesn’t obfuscate the stench of LLM I love the idea. @Edgee.ai we are tracking cost in real time by tag, LLM, ...but not yet on forecast, and indeed it will be very useful. Something to explore; thanks for the feedback. That’s great. Real-time tracking is a big step already. The tricky part we kept running into was the variance itself, especially with retries and agent loops. That’s partly why we started experimenting with Oxlo.ai (https://oxlo.ai) where the pricing model absorbs that variance so builders don’t have to constantly model token risk. Just add very hard high limits and add instrumentation so you can track it and re-evaluate it accordingly. This takes a couple of hours maximum at best. Sounds like a plan, But what if you can just pay a fixed cost every month and not worry about anything? What you suggested is the best way in my opinion, but given what OP asked, I gave my answer. It feels like the traditional fixed SaaS pricing model is slowly shifting toward more consumption-based pricing. That’s true, but AI is interesting because consumption-based pricing introduces a lot more variance than typical SaaS infrastructure. One user action can trigger dozens of model calls in an agent workflow. That’s partly why we started experimenting with models like https://oxlo.ai where the pricing flips back to a fixed subscription and we absorb the usage spikes. Honestly, if you're designing your agent workflows properly with hard limits on retries and tool calls, the variance shouldn't be that wild. Most of the unpredictability comes from not having those guardrails in place early on. A few weeks of real production data usually shows the average cost is more stable than you'd expect. True, but for early stage builders it’s harder to design those guardrails upfront. A lot of the time you only discover the retry patterns and cost spikes once real users start hitting the system. Fair point. And honestly, with more non-technical builders shipping agent-based products these days, that's probably where a service like this makes the most sense – for people who don't yet have the experience to know what guardrails to put in place. Exactly. That’s actually why we started building Oxlo.ai. Early stage builders usually just want to experiment without worrying too much about token cost spikes. imo switch to local models could be an option Local models solve the marginal cost problem, but they move the complexity into infrastructure and throughput planning instead. makes sense, it really depends on the use cases, I'm building my version of claw openwalrus for the local LLMs first goal, I think myself will use local models for daily tasks that heavily depend on tool callings, but for coding or doing research, I'll keep using remote models and this topic actually inspires me that I can introduce a builtin gas meter for tokens