Evaluating Human Factors Beyond Lines of Code

Part of PL research is building a science of programming. PL has a science of program correctness: we make precise claims about soundness / completeness / undefined behavior / etc., and we justify those claims with proofs. PL has a science of program performance: we make precise claims about relative speed or resource utilization compared to prior art, and we justify those claims with benchmarks.

PL researchers also care about program usability. Foundational ideas in our field are predicated on beliefs about people:

Compilers: “We were after getting correct programs written faster, and getting answers for people faster” (Grace Hopper)
Structured programming: “Our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed” (Edsger Dijkstra)
Profiling: “We think we know what programmers generally do but our notions are rarely based on a representative sample of the programs which are actually run on computers” (Donald Knuth)
Language design: “The primary purpose of a programming language is to help the programmer in the practice of his art” (Tony Hoare)

This observation still rings true today – one can find dozens of examples in every PL venue (even POPL!) describing systems or techniques as “usable”, “intuitive”, “easy to reason about,” and so on. However, the thesis of this post is that PL research lacks a commensurate science of human factors for dealing with these claims. PL papers tend to make imprecise usability claims and justify them with weak methods like comparing lines of code (LOC). The goal of this post is to provide actionable advice for PL researchers to improve their human-centered vocabulary and methods. Rather than just saying “run more user studies”, I will argue to embrace qualitative methods, modernize usability metrics, and leverage prior work on programmer psychology.

User Studies Cannot Be The (Only) Answer

I am not the first person to argue for a more human-centered approach to PL. “Programmers Are Users Too” and “PL and HCI: Better Together” are excellent articles that advocate for importing ideas from the field of human-computer interaction (HCI) into PL research, such as user studies, iterative design, and contextual inquiry. These techniques are valuable; I use them in my own research on languages with sizable populations of existing users, such as Rust.

However, these techniques are often impractical for PL research. Consider trying to run a user study:

You need to get approval from an ethics board. If it’s less than two months before your conference deadline, then it’s probably too late!
You need to recruit participants. Consider any given niche of PL: cubical type theory, Racket macrology, e-graph optimizers, or quantum computing. There may be at most 100 people in the world who are qualified to understand your tool.
You need to teach participants your tool. But no one meaningfully picks up complex languages like Haskell or Coq in a day. Do you have the money to pay for world-class programmers to learn your tool for potentially a week or more?
You need to have your participants stress-test the conceptually important aspects of the tool. But more likely, they’ll get hung up on the “front-line UX”: bad error messages, confusing documentation, and so on. And what if your conceptual contribution is most apparent in large codebases? Then your participants need to learn the large codebase first! (How big is your grant again?)
Hopefully you find a positive result at the end!

A relevant parable comes from the systems community: the Raft consensus protocol. The principal design goal of Raft is reflected in its paper title, “In Search of an Understandable Consensus Algorithm”. Diego Ongaro and John Ousterhout felt that existing algorithms (namely Paxos) were too complicated for most people to understand, so they invented an algorithm that they believed was more understandable. This belief has arguably borne out in practice. Many production-grade distributed systems have adopted Raft in part due to its perceived understandability.

Yet, Ongaro and Ousterhout struggled to convince program committees. Their initial paper relied on qualitative justifications, and was repeatedly rejected from NSDI and SOSP for lack of a sufficiently rigorous evaluation. The pair eventually did a user study that involved teaching Raft and Paxos to students at Berkeley and Stanford, and then comparing the algorithms’ understandability via scores on a post-lecture exam. Ousterhout later said of the experience:

“User studies are […] a huge amount of work, and you don’t know whether it’s going to work until you reach the very end. […] We couldn’t tell whether the results were going to favor Raft or Paxos. And we were wondering, what are we going to do if Paxos wins? Do we cancel the project at this point?”

Thankfully, they found a small but positive result. Test scores were 8% higher with Raft than Paxos. And even still, PCs were unimpressed — the study was lambasted as unscientific, non-neutral, and too small of a sample size. Ousterhout remarked that the ultimate value of the study was that it “made the PCs stop complaining about a lack of evaluation.”

What can we learn from this parable? My takeaway is that these kinds of user studies are functionally useless. They consume an enormous amount of researcher time and energy. Their results are shallow — an 8% increase in test scores would not predict Raft’s substantial impact in practice. But the culprits here are not Ongaro and Ousterhout, but rather the culture in software systems research that refuses any other form of evidence for a human-centered claim. We need to develop better standards for usability evaluation so that the next Raft does not die on the vine when its last-ditch user study goes wrong.

Making “Usability” Precise

A good place to start is vocabulary: how can we articulate human-centered claims? For example, we would not generally accept a PL paper that merely claims a type system is “correct.” We expect a more precise statement, such as “sound with respect to undefined behavior”. However, there are myriad examples of PL papers that make claims about a system’s “usability”, point blank. Here’s a selection covering every major PL conference just in the past year:

“Usability of quotient types in a general purpose language was the guiding principle in the design choices of Quotient Haskell” (Hewer and Hutton, POPL 2024)
“We demonstrate the usability and versatility of the [JavaScript regular expression] mechanization through a broad collection of analyses, case studies, and experiments” (De Santo et al., ICFP 2024)
“We plan to enhance [our language’s] usability by not requiring the user to specify the hyperparameters for every probabilistic program.” (Garg et al., PLDI 2024)
“To demonstrate the usability and maturity of Byods, we implemented a realistic whole-program pointer analysis for LLVM IR” (Sahebolamri et al., OOPSLA 2023)
“To improve the usability of Allo, we have implemented the frontend language in Python” (Chen et al., PLDI 2024)

In these quotes, usability contains many shades of meaning:

Usability as “able to be used”: literally that a system works, and is not just theoretical.
Usability as “able to be used in realistic contexts”: that a system can work, with enough effort, for a realistic type/scale of problem.
Usability as “easy to be used by people”: that a person can work with the system productively, or that it has a low learning curve.

I believe we should separate the first two meanings from the third. “Able to be used [in realistic contexts]” is not really a human-centered claim as much as a logistical one, whereas “easy to be used by people” is a distinct category. I will propose that the first two meanings would be better called practicality than usability.

For the human-centered definition of usability, we should use more precise words to delineate why a system is more usable. For example, say the usability claim is “people used to have to manually specify a thing, but now we can infer that thing for them” (as in the Quotient Haskell and probabilistic programming examples). The reason this improves usability depends on why the user previously needed to specify the thing. Is the new system providing “smart default”, i.e. a reasonable but not-guaranteed-perfect choice? Or is the new system providing a complete substitute that the user never needs to reason about? In the former case, I would say the system is reducing premature commitment (the need to provide details before necessary), while the latter is reducing diffuseness (the amount of notation needed to express an idea).

These concepts of premature commitment and diffuseness are just two of the many cognitive dimensions of notation, a set of criteria for evaluating the usability of programming tools derived from programming psychology experiments. The cognitive dimensions are not perfect or exhaustive, but they are a good starting point. I strongly encourage PL researchers to read the linked paper and consider adopting its vocabulary in lieu of general statements about usability. If you want concrete examples of papers using the cognitive dimensions, check out “Designing the Whyline” (Ko and Myers 2004) and “Protovis: A Graphical Toolkit for Visualization” (Bostock and Heer 2009).

Additionally, most usability claims need to be qualified with the expected audience. For instance, “we have implemented the frontend language in Python” only makes a tool more usable to people who know Python. (This raises the more interesting question: why pick Python?) If you want to make a usability claim, consider whether your system might be used differently along lines like these:

Novice programmers vs. expert programmers
Application developers vs. library developers vs. language developers
People working in small teams vs. people working in large teams
People working in open-source vs. people working in companies

Before moving on, I want to briefly comment on some other questionable words:

Intuitive: I have been on the record against this word for a while. At this point, I am convinced that no one in the PL community actually knows what it means. Let me quote Merriam-Webster:

intuition: the power or faculty of attaining to direct knowledge or cognition without evident rational thought and inference

I will assert there is very little in PL research that one can deduce without rational thought or inference. However, I am not focusing heavily on “intuitive” because it is less commonly used to describe systems than to describe steps in proofs. I believe what authors usually mean is “informally”, “abstractly”, “in summary”, or “at a high level”.

Complex: I suspect that researchers use this word far too freely (89% of papers last year used it at least once). Putting aside its use as a term of art (“computational complexity”), there are few intrinsic measures of complexity (e.g., of a system, concept, or proof). Complexity is in the eye of the beholder, but papers are rarely explicit about who is the beholder.

Thinking Past Lines of Code

Imagine you were the first person to invent the concepts of map, filter, and reduce. You probably believe these functions are useful, i.e., they make list processing more usable than C-style for-loops / if-statements or ML-style inductive functions. How would you justify that claim? In a PL paper today, you might find a subsection of the evaluation that looks like this:

“On a benchmark of 10 list processing tasks, we implemented each task in C and in Lambda-C. The C programs were on average 15 lines long, while the Lambda-C programs were on average 5 lines long. Therefore, the Lambda-C programs are 66% more usable.”

This is a caricature, but it is uncomfortably close to reality. Software systems research frequently relies on LOC as a proxy for every human-centered attribute of a system: usability, complexity, difficulty, and so on. Depending on the venue, ¼ to ¾ of (distinguished!) PL/SE/systems papers involve measurements of code size (Alpernas et al. 2020). It is a symptom of the problem observed by Ousterhout — qualitative evaluation is often not perceived as comparably rigorous to quantitative evaluation, so authors feel pressure to publish some kind of number to back up a usability claim.

The challenge for PL researchers is to identify evaluation methods with a high return on investment. User studies can (sometimes) provide significant returns in rigor, but the investment is also significant. Measuring code size is low investment, but it provides few returns beyond the illusion of rigor.

I believe the first step forward is that PL/SE/systems researchers must embrace qualitative evaluation. I’m referring to arguments based on design principles and case studies rather than numeric data. For instance, John Backus provided a powerful set of qualitative arguments for combinator-style data processing in his Turing lecture. I could rephrase a few of his arguments using the cognitive dimensions:

Closeness of mapping: if your task is “do f to each element of a list”, then map f describes that task more directly than the equivalent for-loop. For example, it should be easier for me to formally prove that map f implements the task than the loop-style program.
Error-proneness: if you use an indexed for-loop (e.g., for(int i = 0; i < N; i++) { ... }) then you could easily slip up: i could be initialized to the wrong number, N could refer to the wrong length, i++ could be the wrong step, or you could use the wrong variable when indexing (e.g., a[j]). Combinators reduce the number of such errors.
Role-expressiveness: when a programmer reads a map f, it is clearer at a glance what it’s doing than an equivalent for-loop.

When this kind of analysis is performed systematically across a carefully-selected range of examples, it should be considered a legitimate form of evaluation. Reviewers should not categorically require quantitative evidence as a precondition for publication. It’s equally important that communities develop shared standards for qualitative evaluation. Shared standards help authors design better evaluations, and they help reviewers more constructively critique those evaluations. Documents like the SIGPLAN Empirical Evaluation Guidelines should be modernized to reflect best practices for performing qualitative evaluations such as case studies.

I would argue that effective case study analysis includes characteristics like diverse selection (picking a reasonable and representative set of examples), detailed comparison (analyzing a case study in depth, e.g., giving “thick description“), and distinct insights (generating ideas that specifically need qualitative rather than quantitative analysis). For example, these are some recent PL papers that embody these characteristics:

“Capturing Types” (Boruch-Gruszecki et al. 2024) shows how to evaluate the smoothness of a transition to an extended type system.
“Associated Effects” (Lutze and Madsen 2024) shows how to build up increasingly complex examples that demonstrate type-level expressiveness against related work.
“The Ultimate Conditional Syntax” (Cheng and Parreaux 2024) shows how to deeply compare against a wide range of related work.

Improving Usability Metrics

I don’t want to come across as saying that quantitative usability metrics like LOC are useless. For example, say that two systems do the same thing, but one is 100 LOC while the other is 10,000 LOC. Without any further information, most developers would guess that the 100 LOC system is more usable, at least along certain dimensions like readability.

The problem is that LOC is a very lossy proxy for usability. For example, a 2021 study evaluated code complexity metrics by comparing them to people’s subjective reports and objective measures of brain activity via fMRI (Peitek et al. 2021). Every evaluated metric, including LOC, had at best a small correlation with both subjective and objective complexity. So if two systems are 600 vs. 800 LOC, it’s not obvious that the 600 LOC system is 25% more usable.

My proposed second step forward is that researchers should develop new usability metrics that better map onto human experience. For example, rather than rejecting LOC comparisons entirely, we should only accept LOC comparisons that are very likely to indicate a difference in code complexity. Here’s a serious proposal for one way to do that. In the study of human perception, there’s a concept called Fechner’s law:

“Sensation experienced is proportional to the logarithm of the stimulus magnitude”

You may have encountered this concept in sound measurements. Audio decibels are a logarithmic scale, because a sound that is objectively twice as loud is not always perceived as twice as loud. This law actually holds for all the senses: sight, sound, touch, smell, and taste.

I claim that it is useful to conceptualize code size on a logarithmic scale because each step of the scale corresponds to meaningfully different levels of complexity. 1 LOC is a single action. 10 LOC is your average function. 100 LOC is an average class. 1,000 LOC is a library. 10,000 is a small application. And so on. In terms of decibel lines of code (call it “deciloc”, or “dLOC”), those are 0, 10, 20, 30, and 40 dLOC, respectively. As a community, we could require papers to report code size in dLOC rather than raw LOC. I believe that would disincentivize researchers from relying on marginal LOC differences (say, less than 5 dLOCs) just to provide the aesthetics of quantitative evaluation to their paper.

More generally, I want researchers (especially those with the experience and capacity to run user studies) to both explore more usability metrics beyond LOC, and to validate those metrics. Validation is the concept in psychometrics of demonstrating whether a metric is a useful proxy for some human attribute. For instance, the previously cited fMRI study argues that LOC is not a particularly valid metric for either subjective or objective cognitive complexity.

Consider another proposal. Rather than comparing the size of two programs, what if we compared the size of the argument that the programs do what they’re supposed to? Imagine running a user study where you take a bunch of specifications, implement each spec in several programming models embedded in a proof assistant, then formally prove that each program in each model implements its spec. I conjecture that the length of the proof would have a better correlation with cognitive complexity than the length of the program. If you ran a study showing that’s true, then you’ve now produced a validated metric that other researchers can use without needing to run a user study on their own. That is one way which human-centered PL researchers can produce evaluation methods useful to the rest of the community.

For further inspiration, here are some creative metrics for system usability I’ve seen in prior work:

“Automatically Scheduling Halide Image Processing Pipelines” (Mullapudi et al. 2016) evaluates a Halide auto-scheduler by measuring the time it takes for two Halide experts to construct a schedule of equivalent or greater performance than the automated algorithm.
“How Profilers Can Help Navigate Type Migration” (Greenman et al. 2023) compares strategies for optimizing gradually-typed codebases by running thousands of simulations of programmers engaging in random variations on each strategy.
“Catala: A Programming Language for the Law” (Merigoux et al. 2021) evaluates the perceived accessibility of their language to legal experts by repeatedly measuring participant confidence at different stages of a training session.

Calls to Action

If you want to make a claim about usability in your paper: Challenge yourself to be as specific as possible, and don’t stop at “usability.” Read the cognitive dimensions paper. Invent your own terms!

If you want to evaluate a claim about usability in your paper: Before you reach for lines of code, consider making a strong qualitative argument. Use the specificity of your argument to inspire better metrics than lines of code. But if you really want to use lines of code, consider a logarithmic scale (dLOC).

If you are reviewing a paper that makes a usability claim: Be demanding on the specifics. Don’t let papers get away with broad-brush statements like “Python makes something usable”. But be fair on the evaluation. Don’t just demand a user study or a lines-of-code comparison because that seems rigorous.

If you want to do research to improve usability evaluation: my inbox is always open (will_crichton@brown.edu), and I am recruiting PhD students (https://cel.cs.brown.edu/)!

About the author: Will Crichton is an incoming assistant professor at Brown University. His research combines programming language theory and cognitive psychology to develop a science of the human factors of programming. Will received his PhD from Stanford University in 2022 advised by Pat Hanrahan and Maneesh Agrawala, and he was a postdoc with Shriram Krishnamurthi before starting as a professor at Brown.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.