We Built Features with Claude Code. Then We Measured the Code.
At Codeflash, we're heavy users of Claude Code ourselves - it's fast, capable, and lets a small team punch above its weight. Recently, we used it to build two major features: Java language support and React framework support.
These weren't toy projects. Java support added 52,000 lines across dozens of modules — parsers, context extractors, instrumentation, test runners, assertion transformers. React support added another 24,000 lines covering component discovery, profiling, benchmarking, and code replacement.
Claude Code helped us build both of these fast. Then Codeflash — our own optimization tool — ran on the PRs.
The results were sobering.
118 Performance Optimizations in Two PRs
Across just these two pull requests, Codeflash identified 118 functions that were performing significantly worse than they needed to:
These weren't edge cases or micro-benchmarks. These were functions in the hot path of our optimizer — code that runs on every optimization job, for every user.
What Did the Slow Code Look Like?
The optimizations fell into clear patterns — exactly the kind of inefficiencies that an engineer focused on correctness (or an AI that doesn't think about performance) would introduce:
Catastrophically inefficient algorithms. A type extraction function in our Java context module was 446x slower than it needed to be. Claude Code had implemented it with naive string scanning when tree-sitter-based extraction would have been far more efficient. A helper function finder was 74x slower for similar reasons.
Redundant computation everywhere. Functions were re-parsing data they'd already parsed, re-traversing trees they'd already walked, rebuilding strings character by character. An assertion target call builder was 19x slower than necessary because it was recomputing source byte conversions on every invocation instead of caching them. An import-insertion utility in the React PR was 36x slower due to redundant tree traversals.
Missing caching. Functions called repeatedly with the same inputs were computing results from scratch every time. A type definition extractor in the React PR was 16x slower because it wasn't memoizing intermediate results, and an export checker was 9x slower for the same reason.
Suboptimal data structures. Lists where sets should have been. Linear searches where hash lookups would work. String concatenation in loops instead of joins. A brace-balancing parser was 3x slower due to inefficient data structure choices, and a parentheses matcher had similar issues.
These aren't obscure performance tricks. They're the kind of choices a performance-aware engineer makes instinctively — but that AI agents systematically skip.
A Real Example: 19x Slower Than Necessary
Here's actual code Claude Code wrote to convert byte offsets to character positions in a source file — and what Codeflash replaced it with (PR #1597):
What Claude Code wrote:
# Called for every AST node found in the file
start_char = len(content_bytes[:start_byte].decode("utf8"))
end_char = len(content_bytes[:end_byte].decode("utf8"))What Codeflash replaced it with:
# Build a lookup table once, then binary search for every node
from bisect import bisect_right
cum_bytes = [0]
for ch in source.decode("utf8"):
cum_bytes.append(cum_bytes[-1] + len(ch.encode("utf8")))
start_char = bisect_right(cum_bytes, start_byte) - 1
end_char = bisect_right(cum_bytes, end_byte) - 1Same result. 19x faster. The original code decodes the entire byte prefix from the beginning of the file on every single call — O(n) per lookup. For a file with hundreds of AST nodes, that means re-decoding the same bytes hundreds of times. The optimized version builds a lookup table once and uses binary search — O(n) once, then O(log n) per lookup.
This is the kind of optimization that an experienced engineer spots quickly. Claude Code never considered it because it had no reason to — the original version is correct, readable, and obvious. It just happens to be an order of magnitude slower.
Of course, humans write unoptimized code too. The difference is frequency and invisibility. When a senior engineer writes a naive implementation, they usually know it's a shortcut — they'll leave a # TODO: optimize or flag it in review. When an AI agent does it across 118 functions in a single PR, nobody knows. There's no self-awareness, no flag, and no prompt asking for performance.
Why AI Agents Produce Slow Code
This isn't a bug in Claude Code or any specific tool. It's a structural problem with how LLMs approach code generation:
1. LLMs optimize for correctness, not performance. Training data rewards code that works, not code that works fast. The simplest correct implementation — O(n²) when O(n) exists, a list when a set would work — is what the model reaches for.
2. Optimization is an exploration problem, not a generation problem. You can't tell whether code is slow just by reading it — you have to benchmark it. And even once you know a function is slow, finding the fastest implementation requires trying multiple approaches and measuring each one against real data. It's also data-dependent: one benchmark might crown approach A the winner, while a different dataset favors approach B entirely. This kind of iterative, empirical search — generate candidates, run benchmarks, compare results, refine — is fundamentally different from how LLMs produce code, which is a single-pass generation.
3. Performance requires system-level reasoning. Understanding that a function will be called 10,000 times in a loop, or that a data structure will grow to millions of entries, requires reasoning about the broader execution context. LLMs generate code locally — one file at a time — without this global view.
4. Users prompt for features, not performance. When you tell Claude Code "add Java support" or "build a React profiler," the implicit optimization target is working code, delivered fast. Nobody prompts "and make sure every function uses optimal data structures and algorithms." The incentive structure rewards shipping speed, not runtime speed.
5. Performance problems are invisible. A feature that works but runs 100x slower than it should looks exactly the same as an optimized one. There's no failing test, no error message, no red flag. The cost shows up later — in cloud bills, in user complaints about latency, in infrastructure that needs to scale sooner than it should.
This Isn't Just Us. The Research Confirms It.
We've been studying this problem for a while. Our own research found that 90% of AI-suggested code optimizations are either incorrect or provide no real performance benefit. When we tested over 100,000 optimization attempts, 62% produced incorrect behavior, and of the ones that were correct, 73% yielded less than 5% improvement or actually made performance worse. This was with GPT-4o — at the time, one of the most capable models available.
The natural assumption would be that newer, more powerful models have solved this. They haven't.
The SWE-fficiency benchmark evaluated 11 frontier LLMs — including the latest models like Claude 4.6 Opus — on 498 real-world performance optimization tasks drawn from major Python libraries (NumPy, Pandas, scikit-learn, and others). The results tell the same story:
- The best models achieved less than 0.23x the speedup that human experts achieved on the same tasks
- Models frequently optimized the wrong functions entirely, missing the actual bottlenecks
- When models did find the right function, they settled for suboptimal solutions rather than pushing toward expert-level improvements
- Models regularly broke correctness while attempting to improve performance
The gap between what we measured with GPT-4o and what SWE-fficiency found with the latest frontier models is remarkably small. Better LLMs are not closing this gap — because the problem isn't model capability. It's the fundamental mismatch between how LLMs generate code (single-pass, correctness-first) and what optimization actually requires (iterative benchmarking, empirical exploration, system-level reasoning).
The Productivity Paradox
Here's the uncomfortable math: if AI agents make you 10x more productive at writing code, but the code they produce runs 2-5x slower, you've traded engineering time for compute cost. 52% of engineering leaders already report that increased AI usage directly leads to performance problems. For teams shipping to production, that tradeoff compounds:
- Higher cloud costs from inefficient compute usage
- Worse user experience from slower response times
- Earlier scaling problems as inefficient code hits production load
- Technical debt that accumulates invisibly until it becomes a crisis
If your leadership is sizing infrastructure based on pre-AI code performance assumptions, they're in for a surprise. The code shipping today runs meaningfully slower than the code it replaced, and nobody budgeted for that.
The productivity gains from AI coding are real. But they're not free. And the cost is hiding in a place most teams aren't looking.
What You Can Do About It
The answer isn't to stop using AI coding agents, they're too useful for that. The answer is to add a performance layer to your development workflow:
- Measure before you merge. Run performance analysis on AI-generated code before it hits production. If a function is 100x slower than it needs to be, you want to know before your users do.
- Don't trust "it works" as the bar. Correctness is necessary but not sufficient. A correct function that wastes 99% of its compute is still a problem.
- Automate performance checks in CI. Just as you wouldn't merge without tests passing, don't merge without performance analysis. Tools like Codeflash, continuous profilers, and benchmarking frameworks can catch regressions before they hit production. On our own PRs, automated analysis found 118 optimizations that would have shipped otherwise.
- Profile production code regularly. AI-generated code in production right now may already be running far slower than it should. Continuous profiling can surface these issues before they become crises.