The Root Cause Fallacy

3 min read Original article ↗

The phrase “root cause” implies a single point to fix. Somewhere you can fire the mythical silver bullet and solve all problems.

The trouble isn’t that root cause analysis (RCA) gives you wrong answers. It gives you an answer, and that’s even worse. Because once you’ve got an answer you stop looking for anymore.

RCA doesn’t fail because people do it badly. It fails because it plays directly into the ways humans already get things wrong.

Premature convergence. Techniques like 5 Whys perform a depth-first search that stops at the first leaf node. You get a cause. You miss the contributory factors sitting in every other branch you didn’t walk down. The depth feels like rigour, but it’s actually just tunnel vision justified with a methodology.

False dichotomy. Software is full of these. “It’s a people problem” versus “it’s a process problem.” The instinct is always to pick one. But capable people in a bad process look incompetent, and a great process with misaligned people generates beautifully efficient but wrong outputs. The failure lives in the fit between the two, not in either one. Same pattern everywhere: testing versus shipping, tech debt versus roadmap, speed versus direction. Root cause thinking doesn’t just oversimplify, it polarises. People end up arguing about which cause is the cause rather than mapping how they feed each other.

Narrative satisficing. Humans are story-completing machines. So are LLMs, for what it’s worth. Once we have a coherent causal narrative, we stop looking. It’s not that we’re lazy, it’s that a good story is indistinguishable from a good explanation. RCA exploits this tendency rather than guarding against it.

Blame a part, not the relationship. It’s always easier to point at a component than at the coupling between components. “The deploy caused the outage” is a root cause that teaches you nothing. The deploy, the missing canary, the alert that fired but was ignored, the on-call engineer who was already handling three other things happened before and could have stopped it. But it’s harder to name the interaction of systems in a JIRA ticket, so we just file a JIRA ticket for a thing instead.

The result is almost always the same: a concrete action item (seemingly always adding another step to a process) without any consideration of the system as a whole. You get the satisfying feeling of having fixed something. Whether it’s the right something is another question entirely.

  • Think in contributing factors, not root causes. This isn’t just semantic. Language shapes inquiry. “Root cause” asks you to converge. “Contributing factors” asks you to keep looking.

  • Draw causal loop diagrams instead of causal chains. Tech debt causes slowness. Slowness creates pressure. Pressure creates more tech debt. If your diagram has no loops, you probably haven’t looked hard enough. Chains are comforting because they have endpoints. Loops are uncomfortable because they don’t. That discomfort is the point!

  • Ask “what conditions made this likely?” rather than “what caused this?” This shifts the question into something systematic, rather than looking for a silver bullet.

  • Borrow from safety science. James Reason’s model is worth internalising: failures happen when multiple gaps align simultaneously, not when one thing goes wrong. Every system has holes (incomplete tests, ambiguous runbooks, technical debt, etc). Individually, none of them are “the cause.” The incident happens when enough of them line up at the same time. Patching one hole feels productive. Understanding why so many holes were open at once is where the actual learning lives.

Discussion about this post

Ready for more?