You Don’t Review Your Compiler Output

Someone on LinkedIn posted that every line of AI-generated code should be reviewed with the same rigor you’d apply to a junior developer’s pull request. The reply that stopped the thread was six words: “Do you review your compiler output?”

It’s a provocative analogy, and like all provocative analogies, it’s both illuminating and incomplete. But the argument it surfaces isn’t new. It has a name, a well-studied mechanism, and a history of playing out the same way every time. Understanding that mechanism — not just the precedent — tells you where the current transition is heading and, more importantly, what to do about it.

The pattern has a name

In 1997, Clayton Christensen published The Innovator’s Dilemma and described a mechanism that has since been observed across dozens of industries. Incumbents improve their products along dimensions their best customers value. They overshoot — the product eventually exceeds what most of the market actually needs. A disruptor enters from below: worse on the incumbent’s metrics, but cheaper, simpler, and accessible to people who were never going to buy the incumbent’s product at all. The disruptor improves. By the time it’s “good enough” on the old dimensions, it has already won on the new ones.

This pattern has played out inside computing itself at every major abstraction transition — not between companies, but between paradigms.

In 1954, John Backus proposed FORTRAN to IBM. The programmers of the era — Backus called them a “priesthood” — were hostile. They viewed the proposal as a “heretical notion” that a mechanical process could do “all the mysterious, inventive things they did to produce an efficient program.” Backus later recalled that they were “really opposed to those few mad revolutionaries that wanted to make programming easy enough so that everyone could do it.”

Their objections were not imaginary. Earlier automatic coding systems produced programs five to ten times slower than hand-coded assembly. Memory was measured in kilowords. The priesthood had a legitimate performance argument. But notice what the argument was really about: they were measuring quality on the dimension they had mastered — instruction-level efficiency — while the market was moving toward a different dimension entirely: could scientists who would never learn assembly write programs that worked?

It took Backus three years to build FORTRAN. When it shipped in April 1957, the efficiency surprised even its creators. At a 1976 retrospective, Turing Award winner Richard Hamming described what it took to get there: “The opposition to FORTRAN and any automatic coding system seems to me very, very high. And the courage he had to persist I think should be recognized.” The room broke into spontaneous applause. Backus cut in: “It didn’t take any courage. It was a lot of fun.”

The same pattern recurred with each successive abstraction. When Java introduced garbage collection in the mid-1990s, C++ programmers measured the ten-to-twenty-times performance penalty and declared it unfit for serious work. Bjarne Stroustrup quipped that “C++ is my favorite garbage collected language because it generates so little garbage.” They were right about the performance cost. Java won anyway, because it eliminated entire categories of bugs and made enterprise development feasible for a much larger population of programmers. It won on a dimension the incumbents weren’t measuring.

When containers and cloud infrastructure abstracted away physical servers, veteran sysadmins who knew their machines by name — the “pets vs. cattle” shift that Randy Bias named in 2012 — pushed back on the complexity and overhead. David Heinemeier Hansson repatriated Basecamp off AWS entirely in 2023, arguing that “the savings promised in reduced complexity never materialized.” Again: legitimate objections, measured on the incumbent’s scale.

Joel Spolsky gave this recurring dynamic its most honest articulation in 2002: “All non-trivial abstractions, to some degree, are leaky.” Abstractions save us time working, he observed, but they don’t save us time learning. The resisters at each transition were not fools. They were engineers who understood, correctly, that something was being lost. What they underestimated was how much more was being gained — and, critically, they were measuring the loss on their scale while the market was measuring the gain on a different one.

The overshoot

Here is where the Christensen lens reveals something the historical pattern alone doesn’t.

Every survey that measures AI code quality finds real problems. A CodeRabbit analysis of 470 open-source pull requests found that AI-generated code produces 1.7 times more issues — more logic errors, more maintainability problems, more security findings. The Stack Overflow 2025 survey found that 66% of developers cite “AI solutions that are almost right, but not quite” as their biggest frustration, and trust in AI tool outputs dropped from 40% to 29% in a single year. The METR randomized controlled trial — sixteen experienced developers, 246 real tasks, 143 hours of screen recordings — found that experienced developers were 19% slower with AI tools despite predicting they’d be 24% faster.

This is real. It would be dishonest to dismiss it.

But ask what’s being measured. Code elegance. Maintainability scores. Defect density. Architectural soundness. Stylistic consistency. These are the dimensions the profession values — the dimensions it has spent decades mastering. They are the incumbent’s metrics. And Christensen’s central insight is that incumbents always measure on their own scale, and the disruption always comes on a different one.

The market’s dimension is simpler: does this software solve the problem the person actually has?

The data on that question tells a different story. The Standish Group found that only 20% of software features are used “always” or “often” — 45% are never used at all. Pendo’s analysis of 615 SaaS products found 80% of features rarely or never used, with a median feature adoption rate of 6.4%. They estimated $29.5 billion in cloud R&D investment wasted on features nobody touches. Christensen himself pointed to the canonical example: “Who has ever used all of the features in Microsoft’s Word?” His observation about software specifically: “Microsoft has way overshot what the customer needs.”

The profession builds software optimized for a competitive feature checklist and measured against craft-community standards. The market needs software that solves a specific problem, reliably enough, soon enough, cheaply enough. Those two scales were roughly aligned when building software was expensive — you couldn’t afford to build the wrong thing, so careful engineering was the rational investment. Now that building is cheap, the scales are decoupling. You can build something that’s “worse” on the craft scale and “better” on the problem-solved scale, because it exists, it works for the use case at hand, and it was available Tuesday instead of next quarter. This is precisely the dynamic Christensen described — and it is already being operationalized in the market. Chamath Palihapitiya’s 8090 is built on the explicit thesis that AI can deliver 80% of an enterprise application’s value at 90% less cost, replacing vendor relationships that cost millions per year.

Reframing the quality data through this lens changes the interpretation. The METR study found experienced developers 19% slower — but the study’s own authors identified that 56% of participants were new to Cursor, the primary tool used, and that the AI’s context windows couldn’t handle the million-line codebases involved. This is early-stage disruption: the disruptor is immature, just as FORTRAN’s first compiler was immature. It tells you nothing about where the trajectory ends. Google’s 2025 DORA report found that AI adoption now correlates with higher delivery throughput but lower delivery stability — more code ships, more things break. Their conclusion: AI is a “mirror and multiplier.” Teams with strong testing and deployment infrastructure get stronger. Teams without it get worse. The quality problem is in the engineering practices, not in the AI — which is exactly what Christensen’s model predicts. The disruption rewards organizations that adapt their infrastructure to the new paradigm and punishes those that try to absorb new-paradigm output with old-paradigm processes.

The honest gap

None of this means the compiler analogy is complete.

A compiler transforms deterministically. Given the same input, it produces the same output, and that output conforms to a formal specification. When you don’t review your compiler output, you’re relying on decades of formal verification, not exercising faith.

AI code generation is probabilistic. Given the same prompt, it may produce different output each time. There is no formal specification it conforms to. The gap is real, and acknowledging it is essential for any honest argument about this transition.

But notice how practitioners are already managing that gap. A study from UC San Diego and Cornell — the paper’s title captures the moment well: “Professional Software Developers Don’t Vibe, They Control” — observed thirteen experienced developers working with AI coding agents. Nine of the thirteen reviewed every change, but they made this tractable by working in very small increments, an average of 2.1 steps at a time from plans that could contain seventy or more. The remaining three skipped code review entirely and monitored program behavior instead, rejecting suspicious outputs and debugging when things went wrong.

Neither approach is the exhaustive line-by-line audit that the LinkedIn commenter demanded. Both represent expertise deployed strategically — judgment about what to verify and how, rather than inspection of every artifact. This is recognizable as the same adaptation that occurred at every previous abstraction transition: engineers letting go of reading the output and investing in verifying the behavior.

A colleague — another veteran, about my vintage — complained about the variable names his AI agent was generating. Too long, too verbose. He hated typing them. “You shouldn’t ever have to type them,” I told him. “And you WANT them long. Verbose, semantically rich identifiers give the agent more context about meaning and intent. They’re better prompts.” The objection that variable names should be short is an artifact of the era when code was something you physically typed. In the emerging paradigm, code is a communication medium between human and agent. Optimizing for human typing speed is measuring on the wrong scale — the same error, in miniature, as measuring AI code against hand-written craft standards when the market is measuring whether the problem got solved.

Where this is heading

In 1990, Richard Gabriel published an essay called “Worse Is Better.” His argument: Unix and C won not by being theoretically superior to their competitors, but by being simple enough to spread. The “right thing” — more correct, more complete, more elegant — took longer to build and ran only on sophisticated hardware. The “worse” alternative shipped, spread, and improved incrementally. Gabriel called Unix and C “the ultimate computer viruses.”

AI-generated code is following the same adoption curve. It is worse by the craft’s standards. It is spreading faster than the craft can object. And it is improving incrementally while the profession debates whether to let it through the gate.

And there is a deeper shift underneath the disruption that is worth watching, though its full implications are still developing. If building software costs nearly nothing, then software doesn’t need to serve millions of users to justify its existence. It can be personal — built by the person who has the problem, for that specific problem, used until it’s solved, then discarded. The product manager from the first post who built her own dashboard wasn’t producing inferior software by craft standards. She was producing exactly the right software by the only standard that mattered to her: it solved her problem. When building is free and the builder is the user, the market’s quality metric and the user’s quality metric converge completely. The implications of software that personal and that disposable are a subject I’ll return to.

For now, the immediate question is more bounded. The quality the profession defends is real. The market it serves may be shrinking. The compiler analogy has a genuine gap — no formal guarantees, no deterministic correctness. Line-by-line review of AI output doesn’t scale. So what fills that gap? How do you verify that the problem is solved without reading every line of how the machine solved it?

The answer turns out to be the oldest idea in software engineering — one the profession always knew was right but could never quite afford to practice. That’s the subject of the next post.

This is Part 2 of a three-part series on agentic coding and the future of software engineering. Part 1: Agentic Coding Makes Old Coders Young and Young Coders Old

References

Richard Gabriel — “Worse Is Better”, 1990

Clayton Christensen — The Innovator’s Dilemma (1997) and Competing Against Luck (2016); the core disruption mechanism and jobs-to-be-done framework

John Backus and the FORTRAN story — drawn from The New Stack, Wikipedia, and Richard Hamming’s remarks at the 1976 ACM retrospective

Bjarne Stroustrup on garbage collection — InformIT interview and collected quotes

“Pets vs. Cattle” — Randy Bias, Cloudscaling, 2012

David Heinemeier Hansson — “We Have Left the Cloud”, 2023

Joel Spolsky — “The Law of Leaky Abstractions”, 2002

Standish Group — feature usage data, 2002 (via Mountain Goat Software)

Pendo — 2019 Feature Adoption Report

Chamath Palihapitiya — 8090, AI-driven enterprise software disruption

CodeRabbit — “State of AI vs Human Code Generation”, 2025

Stack Overflow — 2025 Developer Survey, AI section

METR — “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity”, 2025

Google — 2025 DORA Report

Huang et al. — “Professional Software Developers Don’t Vibe, They Control”, arXiv, 2025