Your Code Was Never Deterministic

9 min read Original article ↗

Paul Bernard

In recent articles I have drawn an analogy between AI code generation pipelines and compilers; specifically, AI pipelines that convert intent into solution. The thesis of my articles has been that our industry (software engineering) is undergoing a major evolutionary step, one not dissimilar from when we built compilers to free us from having to write low level machine code enabling engineers to focus on higher level concepts in a far more productive manner.

One counter argument has been that AI, particular LLM’s are not deterministic and therefore the analogy is not a useful one. Here is one such article. I keep seeing the same exchange play out so I felt the need to address the counter argument since it is so widely felt.

Someone says we should stop reviewing AI-generated code line by line and instead verify behavior. Check the tests. Check the contracts. Validate outputs. Move on.

Someone else responds with what sounds like a mic drop: “LLMs aren’t compilers. Compilers are deterministic. You can’t treat probabilistic generation like deterministic translation.”

That’s sounds convincing to many. It sounds precise. It sounds technical. It sounds safe.

But the more you sit with it, the more it becomes obvious that the argument is anchoring on the wrong part of the system.

Let me ground this in something concrete.

A few years ago, I think it was 2021, maybe late 2020, I was working on a reporting system. It shipped with a production defect that passed code review, passed CI, and passed integration testing. It had nothing to do with the compiler. The compiler behaved perfectly. Same output every time. The issue was that the team encoded a business assumption about a particular calculation. That turned out to be wrong, or it changed, or maybe it was just bad asusmptions and the team heard what they wanted to hear. I honestly still don’t know which and honestly it doesn’t really matter. It was quickly corrected and you move on.

The point is: the system behaved deterministically. It failed deterministically. It failed every time, in exactly the same way, across every environment. The determinism didn’t help us.

That experience isn’t unusual. It’s normal. Most production defects are like that. They’re not the result of compilers rolling dice. They’re the result of humans translating incomplete intent into code, and then verifying only parts of that translation. The compiler is just faithfully reproducing whatever mistakes you gave it, with absolute consistency. That consistency is useful for debugging. It is not protective.

So when someone says we can’t trust LLM output because it’s non-deterministic, I find myself asking: what exactly were we trusting before?

Were we trusting our test coverage? In most systems I’ve worked on, meaningful coverage hovered somewhere between aspirational and fictional. Were we trusting our specifications? I’ve inherited systems where the “spec” was a Confluence page last edited three years ago by someone who’d left the company. Were we trusting code review? I trust code review more than most things, honestly, but it’s still human. Humans miss stuff. I’ve approved PRs that I definitely should have pushed back on. Everyone has. Such decisions are made because commercial software is built on a risk managed based manner.

The truth is that software reliability has always come from layered verification. Tests. Monitoring. Contracts. Runtime safeguards. Gradual rollout. Observability. And a disturbing amount of tribal knowledge carried around in people’s heads that everyone pretends is documented somewhere.

The compiler is one tool in that chain. It’s the most mechanically precise one. But it’s not the reason the system works. It never was.

This is why the compiler analogy actually holds up better than its critics think, if you flip it around.

We stopped reading assembly decades ago. Not because compilers were flawless ; early C compilers had real code-generation bugs. I know. I’ve been around long enough to have seen that first hand. People debugged compiler output for legitimate reasons. Dennis Ritchie himself had to work around codegen issues in the PDP-11 C compiler. We stopped reading assembly because we moved our verification effort up a layer. We built tests around C-level behavior. We built lint tools and static analyzers for C-level patterns. Assembly became an implementation detail that we validated indirectly through the behavior of the layer above it.

That wasn’t a story about determinism. It was a story about abstraction.

And I think and this is going to sound like I’m reaching, but bear with me ; we’re in the middle of the same kind of shift right now. Today you can describe desired behavior in natural language, structured acceptance criteria, property-based definitions, or executable specs. An LLM turns that into an implementation. The implementation may vary from run to run. The specification doesn’t have to.

If your verification infrastructure is anchored at the specification layer, variability in implementation stops being scary. It’s just an internal detail. Like register allocation. Like branch prediction. Like all the other stuff we stopped caring about when we moved up.

Now, the immediate rebuttal I always get is: “But you still have to review the test code, and test code is still code, so you haven’t gained anything.” I’ve seen this in maybe a dozen threads at this point and every time it frustrates me a little more, because it’s treating the current primitive state of AI-assisted testing as the permanent endpoint.

Yes, if your workflow is “ask the LLM to generate tests, then hand-review every test the way you’d hand-review production code,” you’re just shuffling the problem around. That’s true and also not what anyone serious is proposing.

Imagine a different workflow. You define behavioral contracts and invariants. You define edge cases. You define performance boundaries. The model generates an implementation. Automated verification runs. Fails. The model revises. Verification runs again. That loop can execute dozens of times in seconds. The human isn’t scanning diff hunks trying to spot a subtle off-by-one error buried on line 347. The human is asking whether the specification actually captures intent. That’s a fundamentally different cognitive task, and it’s one humans are actually good at.

Before someone says “that’s just TDD”…. yes, it shares DNA with TDD, sure. But in TDD the human writes the failing test and then manually implements until it passes. The iteration speed is bounded by human cognition and typing speed. When a model closes that loop, the feedback cycle collapses by orders of magnitude. The model can attempt, fail, and retry at machine speed. That’s not a minor efficiency gain. That’s a qualitative change in how the build-test-fix cycle works. We never had a compiler that could see its own errors and try again. Now we sort of do.

And this also addresses the architectural coherence thing that people bring up ; the argument that models can’t see the whole system, so they’ll duplicate logic, violate DRY, create inconsistent patterns. Which, fine, that’s often true today. But we never asked gcc to understand our service boundaries either. The compiler works on translation units. It has no idea what your dependency graph looks like. Architectural coherence comes from module systems, interface contracts, integration tests, and organizational discipline. Those mechanisms don’t disappear when the implementation is generated instead of hand-written. You modularize specifications. You define boundaries. You validate integration points. It’s the same discipline at a different layer.

I want to be honest about something, though. Right now, a lot of this feels aspirational.

Using AI to generate code today often feels like supervising a very fast intern who never sleeps and sometimes forgets what you told them five minutes ago. You generate. You review. You fix. You review again. The volume of output makes line-by-line inspection genuinely miserable. I don’t think people who complain about AI fatigue are wrong about the experience. They’re describing something real.

But I think the fatigue is a transitional artifact, not a permanent condition. It’s what happens when you’re caught between paradigms , using a new tool with old workflows, trying to apply line-level review discipline to a process that’s screaming for behavior-level verification instead. That’s partly a self inflicted problem.

In the early days of high-level languages, there were engineers who genuinely distrusted them. They’d been writing assembly for years. They could see what was happening at the machine level. Compiler output looked opaque and wasteful. It felt irresponsible not to check it. And you know what, they weren’t crazy. Early compilers really did produce worse code than a skilled assembly programmer in many cases.

The industry didn’t solve that by getting better at reading assembly. It solved it by investing in better compilers, better type systems, better testing, and better tooling at the higher abstraction layer. The verification infrastructure moved up, and eventually the generated assembly stopped mattering.

I don’t have a neat conclusion for this. I think we’re early enough in this shift that a lot of the specifics are still blurry. The verification tooling for intent-based development is maturing quickly but is still very new. The workflows don’t have years worth of production experience under their belt. There’s a lot that requires attention.

But the argument about determinism is a dead end. It was always a dead end. Your software was never reliable because of deterministic compilation. It was reliable, to the extent it ever was, which honestly wasn’t as much as any of us like to pretend , because of the verification discipline wrapped around the whole messy pipeline from someone’s half-articulated idea to running production code.

That discipline still matters. What’s changing is the layer where we apply it. Whether that transition goes well depends on whether we build the right infrastructure around intent and behavior, or whether we spend the next decade squinting at AI-generated diffs and wondering why it never stops being exhausting.

It is for these reasons that the safe money, which is widely reflected in the massive spending in the area by the likes of Microsoft, AWS, Google, Anthropic, OpenaI, and many others is all focused on maturing these higher level pipelines which work at the intent based level rather than the lower code level. Spec driven development or if you prefer “Intent based development” is here to stay. It will get better and engineers will be preoccupied with making the required pipelines better and better for some time to come; building on the billions of dollars being invested in this space by some of the biggest technology players on earth.