Your Code Review Process Is Theatre

10 min read Original article ↗

You open the PR. You blanch at the 10,000-line diff. You leave a fussy comment, so everyone knows you’re Paying Attention. You click Approve; the green checkmark appears. Everything keeps trundling along.

The ritual is complete.

The phrase “code review” flatters the practice. In most organizations it means little more than ritualized exposure to a textual delta followed by institutional assent. The green checkmark is not judgment. It is a surrogate for judgment.

How We Ended Up Here

We shape our tools, and thereafter our tools shape us. — John Culkin, after Marshall McLuhan

Start with diff. Bell Labs described it in 1976 with brutal clarity: it reports file differences as “a minimal list of line changes” required to bring one file into agreement with another.1 That is a beautiful description of what diff is, and an equally beautiful description of what it is not. It is a textual change description. It is not a semantic model of program behaviour.

Then patch: software whose own manual defines it as a tool to “apply a diff file to an original.”2 Again: a precise solution to a real problem. If code changes are moving around as textual deltas, you need a reliable way to apply textual deltas.

Then the world of emailed patches, maintainers, and upstreams. Git still carries this lineage in its own git request-pull command, defined as a way to “generate a request asking your upstream project to pull changes into their tree.”3 The Linux kernel documentation still describes pull requests in exactly this idiom: signed tags, diffstats, shortlogs, and an emailed [GIT PULL] request to the maintainer.4 In that world the patch is not just a code artifact. It is a trust boundary. A maintainer reviewing a patch from a stranger is doing gatekeeping in the literal sense. The review is adversarial because it has to be.

Then GitHub takes this model, wraps it in a web interface, and normalizes it for everyone. Its own 2010 announcement describes pull requests as “our take on code review.”5 A workflow shaped by text transport and stranger-trust negotiation becomes the default interface for every software team, including small internal teams working in a single repository on a single product.

Every step in that chain solved a real problem. diff solved comparison. patch solved application. Pull requests solved repository-boundary gatekeeping. None of those solutions was universal. A workflow built for textual transport and trust boundaries got promoted into a general theory of engineering judgment. That was the category mistake.

This is path dependence in action. Software process does not converge on perfection. What survives is what met the constraints of its moment. We mistook survival for vindication, and then vindication for rigour. Historically contingent artifacts became load-bearing truths.

The Approval Is the Product

The purpose of a system is what it does. — Stafford Beer

If you want to understand what a process is for, ignore what people say about it and look at what it reliably produces.

Your PR workflow does not reliably produce understanding. It does not reliably produce defect detection. It does not reliably produce architectural coherence. What it reliably produces is an approval record.

That approval record is useful to managers, auditors, procurement questionnaires, compliance frameworks, and engineering dashboards. It is much less useful to the people deciding whether a change is safe to run in production.

The green checkmark is legible. That is the point. Scott’s phrase was that modern power tries to “make a society legible.”6 Organizations do the same thing to engineering work. The checkmark is easy to count, easy to require, easy to display in a control report, and easy to sell as rigour. It is much harder to count whether anyone actually understood the change. So the surrogate displaces the judgment it was supposed to stand in for.

This is why so many teams drift toward the same equilibrium:

  • Thorough review is expensive.
  • Superficial review is cheap.
  • Blocking a PR creates friction.
  • Approving a PR clears the queue.
  • Almost nobody measures review quality directly.
  • Almost everybody measures whether review happened.

Under those conditions, rubber-stamping is not a moral failure. It is the rational response to a badly designed system.

People are not stupid. They know the difference between a process that generates confidence and one that generates an institutional record. When you force them through the second kind while describing it as the first, they adapt accordingly. They optimize for the visible output. They leave a comment, skim the diff, click Approve, and move on to work that might actually change the system’s behaviour.

The artifact is legible. Legibility does real work – but it can’t do everything. The quality signal lives elsewhere, in the messy undergrowth we keep burning down to pursue legible rows of neat little checkmarks.

The Diff Is the Wrong Interface

A diff is precise about text and impoverished about meaning. That is not a flaw in diff. It is what diff is for. But that very parsimony is why it is the wrong encoding for many of the claims we ask it to carry.

But the process does not distinguish. It has settled on universality: everything becomes a PR, every PR needs approval, and the same ceremony is applied whether the change is a copy fix that needs an editor’s eye, an algorithmic improvement that needs domain expertise, a large refactoring that needs architectural vision, or a thousand-line machine-generated patch that needs – what, exactly? Nobody knows. The diff looks the same for all of them because textual variance is the only thing it can represent with any fidelity.

The historical mismatch makes this more than absurd; it makes it dangerous. An adversarial patch-review workflow makes sense when a maintainer is judging a patch from a stranger across a trust boundary. Universalizing that workflow for trusted internal teams was a category error. For engineers who know each other’s work and share a codebase, it is not fit for purpose. For genuinely risky changes, it is not merely wrong; it is actively dangerous. Too ceremonial to be efficient, too shallow to be reliable. A better reviewer staring harder at the wrong artifact is still staring at the wrong artifact.

What This Process Crowds Out

Defenders of the status quo usually point to the secondary benefits of review: knowledge sharing, mentoring, shared ownership, architectural consistency. Those benefits are real; they are important; yet every one of them is better delivered by a mechanism actually designed to produce it, rather than by harvesting whatever scraps of them survive GitHub Theatre.

What the process is actually doing – quietly, year after year – is destroying taste. Taste is an annoying word, which is one reason I like it. It offends the proceduralists. Good. What I mean by it is not aesthetic fussiness. I mean the earned capacity to look at a system and know that something is wrong before you can name the rule it violates: that the invariant is missing, that the abstraction will rot, that the tests are passing for the wrong reasons. It cannot be reduced to a checklist. It is cultivated slowly by people who have spent years building and breaking real systems.

Those people are not alienated because the work is hard. They are alienated because the work has been replaced by theatre – and the theatre is metastasizing. It has spread from pull requests into planning, into estimation, into standups, into every corner of the practice. The message is consistent: your judgment is not trusted; the checkmark is. Your discernment is not valued; the dashboard is.

Even this bad process remained deniable only because code output was slow enough for the theatre to keep up appearances. Machine-generated code is about to multiply the volume of unsupervised changes by orders of magnitude, and we are actively thinning the layer of human judgment required to supervise it.

Why AI Makes This Collapse Visible

AI did not break code review. It exposed that we never had real review at scale. The artifact did not get thinner. The output got cheaper.

There was a time when diff inspection was one of the best tools available. That time is gone. We now have type systems, static analysis, property tests, contract tests, schema validation, fuzzing, and CI machinery capable of far more mechanical verification than most teams bother to wire together.

Humans are good at intent, tradeoffs, product judgment, domain modelling, risk acceptance, and spotting when the whole shape of a change feels wrong. Machines are good at exhaustive checking, consistency, speed, and repeatability. The standard PR workflow misallocates cognition: it gives humans repetitive verification work machines are better at, then starves the architectural and semantic judgment machines still cannot supply.

If your review policy can be satisfied by somebody glancing at a diff on a phone between meetings, then what you have built is not a quality gate. It is a waiting room.

And here is the part that makes this urgent rather than merely annoying: code generation has arrived at industrial scale. A workflow that already produces thin assurance at human speed will collapse under machine speed. It will drown in output, routinize false confidence, and consume exactly the attention that should have been reserved for actual supervision. The bottleneck is no longer code production. It is supervisory attention.

What Review Should Become

None of this is an argument against review. It is an argument against confusing a historically contingent artifact with the thing itself.

The real question is now open again. If code is cheap and judgment is scarce, what are the right interfaces, institutions, and supervisory practices for allocating human attention over machine-speed output?

That is not a process tweak. It is a research agenda.

The diff is not going away. It will be generated by machines and consumed by machines, and that is fine – that is what it was designed for. What must go is the pretence that it is the natural interface for human judgment. It is not.

Enough With the Checkmark

The tools were good tools. diff solved a real problem. patch solved a real problem. Gatekeeping at the repository boundary solved a real problem. Every link in that chain was a reasonable response to the conditions of the time. The conditions changed; the workflow did not. That is not a story about bad actors. It is path dependence, and we are all stuck in it: the engineers performing confidence they do not feel, the managers measuring compliance instead of confidence, the organizations defending rigour they cannot define.

The mistake was not using these tools. The mistake was universalizing them. We mistook historically contingent artifacts for load-bearing truths.

I’m not pining for the mythical, prelapsarian past, when we all had offices and wrote specs and software worked. Software engineering has always been, at least partially, a shitshow. But the tools are better, the problem statement is clearer, and genuflecting at the green checkmark is beneath us.

I was there when these were expedients. I am here now that they have been mistaken for truths.

Zero-cost code has changed the economics of this craft. The bottleneck is no longer code production. It is supervisory attention. We can no longer afford to spend human judgment manufacturing institutional reassurance.

Code is now cheap. Judgment is not.