The Measurement Problem in Software Engineering

9 min read Original article ↗

For fifty years, the software industry has tried to measure developer productivity. Every attempt has failed—not because we haven’t found the right metric, but because we kept trying to count things without understanding them.

In 1982, Tom DeMarco wrote what became one of the field’s most quoted maxims: “You can’t control what you can’t measure.” Twenty-seven years later, he retracted it. Was the advice correct at the time? Is it still relevant? Do you still believe metrics are a must for successful software development?

“My answers are no, no, and no,” he wrote in IEEE Software. The statement, he admitted, may have “distracted us from the real point of computing.”

The pattern since then is striking: each generation of metrics gets critiqued by its own creators. Lines of code, function points, story points, velocity—all were reasonable responses to the previous approach’s failures, and all eventually revealed the same fundamental problem. We keep measuring what’s easy to count rather than what actually matters.

And now AI has arrived—both accelerating the crisis and, paradoxically, offering the first realistic path through it.

IBM and other mainframe-era organizations measured lines of code (LOC). Barry Boehm’s COCOMO model (1981) formalized this by correlating development effort with source lines. The problems became apparent almost immediately. Capers Jones (1986) documented that LOC metrics “make requirements and design invisible” and “penalize high-level languages.” He eventually declared that using LOC for productivity measurement “should be regarded as professional malpractice.”

Allan Albrecht’s function point analysis (1979) attempted to fix this by measuring user-visible functionality rather than implementation. It was a genuine conceptual advance—but Albrecht himself observed that function points remained “highly correlated to lines of code.”

The Agile movement brought story points and velocity. Kent Beck introduced stories as “an antidote to requirements”—deliberately abstract units meant to facilitate conversation, not enable measurement. Ron Jeffries (2019), one of XP’s founders and a probable inventor of story points, now expresses regret: “I may have invented story points, and if I did, I’m sorry now.”

The most sophisticated recent framework is DORA, developed by Nicole Forsgren, Jez Humble, and Gene Kim. Their four metrics—deployment frequency, lead time, change failure rate, and mean time to recovery—emerged from rigorous statistical analysis across thousands of organizations. DORA represents a genuine advance: it measures delivery capability rather than activity volume. But Forsgren herself warns against misuse. DORA measures system flow, not individual productivity—one slice of engineering work that says nothing about what commits actually accomplish for users.

Three fundamental challenges explain why measurement keeps failing.

The output is non-fungible. A 10-line fix to an authentication vulnerability isn’t “less” than a 500-line UI component—it might be worth considerably more. Fred Brooks identified this in The Mythical Man-Month (1975): the best programmers can be 5-10x more productive than mediocre ones, a variation that simple metrics cannot capture.

Context dominates. Deleting 5,000 lines of technical debt might be the most valuable work done all quarter, but it shows up as negative “productivity” in any volume-based metric. A one-line change to a critical system might require days of careful analysis, while a thousand lines of generated boilerplate takes minutes.

Gaming is trivially easy. Anthropologist Marilyn Strathern simplified Goodhart’s Law to its canonical form: “When a measure becomes a target, it ceases to be a good measure.” Software metrics are particularly vulnerable because the relationship between metric and goal is so loose. Write more code, split work into smaller tickets, inflate estimates, approve pull requests faster—the numbers go up while actual value stays flat.

Robert Austin’s Carnegie Mellon dissertation, Measuring and Managing Performance in Organizations (1996), proved mathematically that measurement-based management becomes dysfunctional when not all critical dimensions are measured. When managers can only observe one of two job dimensions, workers rationally shift effort toward the measured dimension at the expense of unmeasured value. DeMarco wrote the foreword, calling it a book that “needs to be on the desk of just about anyone who manages anything.”

Rather than proposing another metric, it’s worth asking what any genuine solution would need.

Understanding semantic change, not syntactic artifacts. Current metrics count what happened (lines changed, PRs merged) without understanding what it meant. A meaningful measure would need to comprehend what actually changed—whether a modification fixed a critical bug, introduced a new capability, or improved maintainability.

Deep context about codebase and architecture. The same diff has different significance in different contexts. A change to a payments module carries different risk than a change to a logging utility.

Distinguishing different forms of value. Engineers create value in many ways: writing code, reviewing others’ code, mentoring, designing systems, debugging production issues. Senior engineers often have outsized impact through review and design work that produces few commits. Traditional metrics make these contributions invisible.

Resistance to gaming. Goodhart’s Law suggests this requires measuring something close to the actual goal—the semantic value of changes—rather than proxies that can be optimized independently.

This describes a very sophisticated judge: one capable of reading code, understanding its context, and making nuanced assessments about value. For most of software engineering’s history, only humans could do this—and humans can’t do it at scale. A VP of Engineering can’t personally read every pull request across a 200-person organization. The information asymmetry between the people doing the work and the people making resource decisions has seemed insurmountable.

Just as the measurement problem seemed intractable, AI coding assistants arrived—and made everything more complicated.

The productivity gains are real but deeply asymmetric. Microsoft and GitHub’s 2024 field experiment across 4,867 developers found a 26% increase in completed pull requests with Copilot. But the METR randomized controlled trial (July 2025) found AI tools made experienced developers 19% slower on real-world tasks in mature repositories. The most striking finding: developers predicted a 24% speedup beforehand and estimated 20% faster completion afterward—a massive perception gap that persisted despite objective measurement.

The quality picture is troubling. GitClear’s analysis of 211 million changed lines found that code churn—code reverted or updated within two weeks—doubled after AI adoption. Uplevel Data Labs found developers with AI access showed a 41% increase in bug rate. Google’s DORA 2024 report found every 25% increase in AI adoption correlated with a 7.2% drop in system stability.

Here’s the fundamental problem: when a developer can generate 5,000 lines of code in an instant, lines of code stops meaning anything at all. Traditional metrics assumed that code output required human effort. AI breaks that assumption completely. A developer who thoughtfully reviews and refines AI-generated code might produce fewer commits than one who accepts suggestions uncritically. The first is creating more value; the second looks more productive.

Here’s the irony: the same technology that broke traditional metrics might be the first thing capable of replacing them.

Large language models can read code. They can understand what a diff actually does—whether it fixes a security vulnerability, introduces a new feature, or just adds boilerplate. They can distinguish a 10-line change that prevents a production outage from a 1,000-line change that reorganizes imports. They can do this at scale, across every commit in an organization, continuously.

This isn’t speculative. Google’s AutoCommenter research (2024) demonstrated LLM-backed systems that understand code semantically well enough to identify best-practice violations and provide actionable feedback. The same capability that enables “this code has a potential null pointer exception” also enables “this change fixes a critical bug in the authentication flow.”

But semantic understanding of individual changes is only half the problem. The other half is synthesis: turning thousands of assessments into something leaders can act on.

Traditional metrics cope with overwhelming activity by reducing everything to numbers: 47 PRs merged, 12,000 lines added, velocity up 15%. The numbers are digestible but meaningless. AI enables a different approach: understanding work at the atomic level, then synthesizing it into narrative. Not “the platform team merged 23 PRs” but “the platform team completed the Kubernetes migration and resolved three critical performance bottlenecks.” Not “this engineer’s commit count is down” but “this engineer spent the sprint on architectural review and mentoring, with high-impact contributions across six projects.”

This is what leadership actually needs—not more numbers, but genuine understanding of what their organization is accomplishing.

This path has real constraints. AI can’t see everything that matters: the conversation that prevented a bad architectural decision, the mentoring that accelerated a junior engineer’s growth, the relationship-building that enabled cross-team collaboration.

And Goodhart’s Law doesn’t disappear just because measurement is more sophisticated. But semantic measurement has an advantage: it’s closer to the actual goal. Gaming lines of code is trivial because lines of code have no intrinsic relationship to value. Gaming a system that evaluates actual significance is harder—you’d have to create actual significance, which is the point.

Martin Fowler wrote in 2003 that software productivity “cannot be measured” because output cannot be measured—and even if it could, “true output” is business value delivered, which varies enormously and manifests over years. Kent Beck and Gergely Orosz, responding to McKinsey in 2023, called the consultancy’s measurement framework “absurdly naive” and warned it would “do far more harm than good to organizations.”

I’ve spent years working on this problem, and I don’t have clean answers. But I’ve become convinced of a few things.

Traditional proxy metrics—lines of code, velocity, PR counts—were always flawed. In the age of AI-generated code, they’re actively misleading. They measure what’s easy rather than what matters, they’re trivially gamed, and they corrupt the processes they’re meant to monitor.

Any serious path forward requires understanding what code actually does, not counting artifacts of work. This is a semantic problem, not a syntactic one. And for the first time, we have tools capable of semantic understanding at scale.

The goal isn’t perfect measurement—that’s neither achievable nor necessary. It’s measurement good enough to recognize that a 10-line security fix can be more valuable than 5,000 lines of AI-generated boilerplate. Good enough to make visible the engineers who create value through code review, debugging, and the unglamorous work that keeps systems running. Good enough to give leaders genuine insight into what their organizations are accomplishing, not just how busy they look.

Whether we can build systems that see work clearly—that understand what software actually accomplishes, synthesize detail into insight, and resist the gaming dynamics that have corrupted every previous approach—remains an open question.

That’s the question I’ve spent years working on.

I work on tools in this space at Maestro AI.