Cost of AI-Driven Development

AI driven SDLC is perhaps the new norm because of its virality. There is definitive value to AI-SDLC but then we see a lot of noise too. What makes it chaotic is the difficulty in identify noise from value. This can be seen in the way large-scale firing happened in the name of redundancy creating global panic (more than the wars that has been happening at around the same time), followed by some immense mishaps in production systems on account of blind reliance of AI, subsequently taking measures to find a avoid such incidents to save business face by re-looking at all things from hiring to people practices.

Personally FinOps is something that I have my eye on incorporating it to the data for making calibrated bets on decisions I make. For the past year, I have been talking and blogging on this subject on the need to focus on alternatives as backup plan instead of relying solely on one vendor. We witnessed vendor lock-in nightmares repeatedly for decades and it seems the industry hasn't learned its lesson. History will repeating itself and the industry veterans are now calling it out. After Salesforce's CEO admitting that AI can't be trusted blindly, we have Uber 'sCTO Praveen Neppalli Naga confirming in April 2026 that their annual budget was completely consumed due to high token-based costs from using Anthropic's Claude Code.

If that provides the real motivation, let us see what it's scaled down version looks like by an experiment, so that it is easy to extrapolated to your desired scale to help you decide on the path you want to take forward. About 6-7 years back, I rebuilt a native-Android mobile app using Expo/React Native and deployed it to Playstore too - that you can get it even today. After a long period of time, I picked this up updated its Expo version from SDK v48 to SDK v54. Yesterday, I wanted to record a day's worth of AI Coding effort in terms of tokens and number of requests. For context, for all of yesterday, the task was to write unit tests for the app that was virtually absent. Now for this task, I used Claude Code with Open Router, to collect the data and below is the summary of it:

Total Requests Made = 578 requests
Sum of all Input (Prompt) Tokens = 40,271,797 (40+ million tokens)
Sum of all Output (Completion) Tokens = 286,456

Visualizing the Gap

The chart below shows the drastic difference in pricing for the exact same volume of work. For developers working on large Expo projects, choosing the right model isn't just a performance decision—it should also be a financial one.

If I hadn't been using a cost-effective routing strategy, the bill could have ranged from the price of a coffee to the price of a high-end smartphone.

Model Tier	Total Cost ($)
Premium Reasoning (Claude 3 Opus / o1-preview)	~$620+
High-End General (GPT-4o / Claude 3.5 Sonnet)	~$125 - $205
Frontier Open Source (Llama 3.1 405B)	~$121
Efficiency King (DeepSeek V3 / Gemini 1.5 Flash)	~$3 - $6

The above inferences are made from the following table that estimates what the given usage would cost today across some 20 high-performance models, sorted from most expensive to most affordable ones.

Model	Input Cost ($)	Output Cost ($)	Total Cost ($)
Claude 3 Opus	604.08	21.48	625.56
o1-preview	604.08	17.19	621.26
GPT-4o	201.36	4.30	205.66
Gemini 1.5 Pro	140.95	3.01	143.96
Claude 3.5 Sonnet	120.82	4.30	125.11
Command R+	120.82	4.30	125.11
o1-mini	120.82	3.44	124.25
Mistral Large 2	120.82	2.58	123.39
Llama 3.1 405B	120.82	0.86	121.67
Grok-2	80.54	2.86	83.41
Codestral	40.27	0.86	41.13
Llama 3.1 70B	24.16	0.17	24.33
Phind-CodeLlama-34B	24.16	0.17	24.33
Qwen 2.5 72B	16.11	0.11	16.22
Claude 3 Haiku	10.07	0.36	10.43
GPT-4o-mini	6.04	0.17	6.21
DeepSeek V3	5.64	0.08	5.72
DeepSeek Coder V2	5.64	0.08	5.72
Gemini 1.5 Flash	3.02	0.09	3.11
Llama 3.1 8B	2.01	0.01	2.03

So to conclude, the data from this one-day experiment makes one thing clear: The most expensive component of the AI-SDLC isn't the intelligence—it’s the context.

In a world where models must "ingest" millions of tokens of existing infrastructure just to write a single passing unit test, we cannot afford to treat LLM providers as a utility like water or electricity. If a single developer can inadvertently rack up a $600 bill in 24 hours using premium reasoning models, it’s easy to see how a company like Uber could burn through an annual budget in record time.

To survive this era of AI-driven development without going bankrupt or falling into a new generation of vendor lock-in, we need to shift our focus in 3 ways:

Multi-Model Orchestration: The "Efficiency Kings" like DeepSeek or Gemini Flash are not just "budget" options; they are the necessary workhorses for context-heavy tasks like unit testing. Reserve the premium models for architectural pivots, not boilerplate.
Institutional FinOps: Coding is no longer just a labor cost; it is now a variable compute cost. Engineering leads must become as proficient in token-management as they are in memory-management.
The "Trust but Verify" Mandate: As the Salesforce CEO and Uber CTO have signalled, the "AI-will-fix-it" euphoria is cooling. The value of the human developer is shifting from writing the code to governing the AI that writes it—ensuring that the "noise" of 40 million tokens actually results in "value" for the production system.

The path forward isn't to retreat from AI, but to stop betting blindly on it. By using AI Gateways like (Open-Router, Groq, etc.) and diversified model strategies, we can keep the "Fin" in "FinOps" from spiraling out of control while still capturing the "Ops" efficiency that AI promises. History may repeat itself, but your cloud bill doesn't have to.