Minimum Reviewable Unit

In 2016, Microsoft made a pull request to Node.js that added almost 3 million lines of code. The Node.js maintainers were actually super cool about it, since it enabled Node.js to use Microsoft’s pretty decent Javascript engine (ChakraCore) optimized for Windows. They were in a bit of a pickle, however, since they didn’t really get any heads up about it, and 3 million lines of code is a lot of fucking code.

Eventually, after one of the longest discussion threads to ever exist on a pull request, it was abandoned, left as a branch with the promise of staying current, and then that branch was itself abandoned in 2020. In 2021, Microsoft’s entire Javascript engine was abandoned too, although they stopped active development as early as 2018.

I’m pretty sure the pull request was too large, but what is the right size? That’s what we’re here to talk about today.

Before I get started, I want to bring some closure to the tale of Microsoft’s pull request (PR).

Should they have just merged it?

Hell no. Besides there just being too much stuff all at once, there was also the question of whether Microsoft’s code even belonged in Node.js. I’d like to point you to this blog post by my good friend Ashley Gullen (we’ve never met or interacted in any way). I’ll summarize it for you anyway since I wouldn’t be arsed to click the link either.

> 75% to 95% of the total work involved in software engineering is maintenance

And if you skim the discussion of Microsoft’s ill-fated PR, the subtext in criticism from maintainers was roughly “I don’t want to be responsible for maintaining this shit.”

Reviewability

Microsoft’s PR consisted entirely of code from an existing well-tested codebase that was surprisingly loved. But they missed perhaps the most consequential of all the -ilities, reviewability.

It doesn’t matter if your code is readable, decomposed, well-tested, and performant. If it’s not reviewable then it doesn’t get merged. Period.

I wrote a four part series in 2020 on how to make code reviewable, and I stand by it today. The first and most important step to writing reviewable code is making sure that your reviewers are on board with the intent of it beforehand (Microsoft failed here).

After you clear that hurdle, it’s just a matter of ensuring that each review is digestible by the reviewers (Microsoft failed here too). You want your review to take about ten minutes, and you don’t want to include more than 4 “things” in it.

But 4 things and 10 minutes is perhaps an average bound, if not close to an upper bound for your code reviews.

How small can you go? How small should you go?

Minimum reviewable unit

In academia, a minimum publishable unit is is the minimum amount of information that can be used to generate a publication in a peer-reviewed venue.

In software engineering we can make the analog that a minimum reviewable unit is the minimum amount of information that can be used to generate a pull request in a code review.

“Minimum publishable unit” has negative connotations because researchers who do it are seen as trying to artificially get their “pub count” up, which is a simple and easy metric that institutions use as part of making hiring and tenure decisions (reference). Gee, who would have thought that such a metric would be gamed?

However, in the field of software engineering, a minimum reviewable unit can be the optimal way to incrementally deliver value. A fortunate side effect is that delivering code this way improves the metrics that companies measure about you (“Are they good metrics? It doesn’t matter!”).

The guide for submitting patches to the Linux kernel says:

> Solve only one problem per patch.

Easy enough. If our upper bound for a code review is four things, then the lower bound should be one thing. But what the hell is one thing?

If we got pedantic with it, the minimal amount of information we could change would be one bit. For example, changing the ASCII character 'b' (binary 01100010) into 'c' (binary 01100011). How many bug fixes have you seen that were just one character?

Though if someone on my team tries to merge their next feature one character at a time, then ritualistic flaying would be too light a punishment. So one bit might be a theoretical minimum, but what is a practical minimum?

Minimum practical reviewable unit

Stanislau the plasma physicist and his intern PeeWee are writing code for the next generation nuclear fusion reactor. Stanislau keeps submitting pull requests containing an entire research paper’s worth of equations. PeeWee has but a lowly bachelor’s in mathematics with no specific domain knowledge in plasma physics (what an idiot, right?). These pull requests take hours and hours for PeeWee to review, which is completely impractical.

If Stanislau isn’t going to spend more time carefully walking PeeWee through the code, explaining and defending what each piece does, then he needs to break up his pull requests into pieces that can be reviewed in a timely fashion by his intern. On their team, “one thing” is likely closer to what PeeWee considers to be “one thing”.

Erica the staff software engineer works on a giant monolithic web app that is showing its age. She can’t merge her changes until:

The code builds from scratch (20 minutes)
The unit tests pass (20 minutes)
The king of France makes a royal decree that the code is acceptable (~6 months)
The code coverage tool reports adequate unit testing (5 minutes)
The regression tests pass (1 hr)
The manual testers sign off (6 hours)
The product owner signs off (6 hours)

It’s not uncommon for her pull requests to take an entire week to get merged into the main branch. It has become impractical for her to split her pull requests into very small pieces, because the overhead of maintaining stacked branches and multiple in-flight pull requests is just too much to cope with. As a result, each of her pull requests becomes about as large as they can reasonably be.

Until her company invests in streamlining the pipeline by, for example, splitting large independent modules into their own repositories, Erica’s situation is only going to get worse. On her team, “one practical thing” means “as much as I can reasonably put into a single review”.

Jordan works for a tech org that has dictated everyone must use feature branches instead of trunk-based development for their features. After all, feature flagging services cost money and their company is in the business of making money, not spending it on superfluous things that would make everyone’s lives easier and save them money in the long run. Also, the shitty coffee on the fourth floor costs $5. Developers are allowed to do whatever they want in their feature branches to get work done, but the rule is that once code is to be merged into the main branch it must undergo a thorough review.

It’s not just impractical for Jordan to have small pull requests, it’s impossible. Sure, they can have each commit to their feature branch reviewed, and theoretically the “big bang” merge into the main branch should just be a formality, but rules are rules. In Jordan’s company, “one thing” means “one feature”.

Finally, Alex is the senior developer on a “two-pizza team” that doesn’t actually get any pizza. Their build pipeline is really fast, and Alex only needs one other developer to sign off before they can merge a branch. Nice, right? Except Alex is inundated with six or more reviews every day. On bad days it’s as high as a dozen.

Alex’s teammates are assholes. Share the load, people. That is, unless Alex is the asshole who insists on this arrangement, but one developer would never be so conceited as to force all changes to go through them first, right?

“One thing” on Alex’s team should be very small, but the turnaround on reviews is longer than necessary because they all seem to go through one person. So, like Erica, devs will include just a little bit more in each pull request so that they don’t get sent to the end of the line.

It’s a social problem

The minimum practical reviewable is unit a social optimization problem that must take into account the specific team, project, CI/CD process, and culture. As Fred Brooks once wrote:

“The major problems of our work are not so much technological as sociological in nature.”

Ideally we want “one thing” to be fairly small, but still self-contained. Here are some best practices to ensure that this is practical:

It should be digestible by the least-expert reviewer you’ll have, because otherwise they’ll take a long time to review, and also probably miss defects.

Keep an eye on the time between when a pull request is opened and when it is merged. Pull out code from your application into other repositories where they can be consumed as libraries or even as separately-hosted services. Automate the hell out of your tests, and move the longest ones to run on a schedule that won’t stand in the way of a merge (in the TFS days we called these “rolling builds”).

Use trunk-based development so you can merge your changes into the main branch sooner. If you must use feature branches, find a way to rescope your features to be smaller, or alternatively allow for unreviewed or lightly-reviewed “big bang” merges, with requirements that merges into the feature branch were thoroughly reviewed.

If the culture just won’t allow for a streamlined review, you can try to change it, which is slow even in small companies. You do this by finding the decision makers, which is usually at the director or even CTO level, and personally reaching out. Persistently reach out. Don’t stop reaching out. The squeaky wheel gets the grease. The alternative is to spend your days shitposting in Slack instead of working while you look around for other jobs, because just generally complaining to your colleagues isn’t going to accomplish squat. It’s pretty fun to just shitpost in Slack instead of working, so I can’t fault you if you go that route.

Empirical pull request size

Until AI completely takes over, you’ll just have to live with the fact that there’s no single quantifiable answer to the size of a minimum practical reviewable unit.

Let’s look at some numbers anyway. For fun.

I went around to some popular open source libraries and measured their commit sizes. I assumed that one commit = one pull request, which is definitely not always true, but is true enough for the repos I looked at.

Empirically, how large are pull requests?

The Linux kernel

The Linux kernel contains 1.3 million commits, which is not bad for a personal project from a random computer science undergraduate.

Over 44% of all commits changed 10 lines or fewer, and 4% of all commits changed just 1 line!

The distribution of commit sizes appears to follow a power-law distribution, where the median commit size is 14 LOC, but the average of 84 LOC is skewed by the long tail of larger commits.

The largest commit touched over 300k lines:

commit 4f29f9cf092b2d331ba2081566be3272962b7f96
Author: <redacted>
Date: Thu Apr 14 15:19:16 2022 -0400

drm/amd: add register headers for DCN32/321

Add register headers for DCN 3.2.0 and 3.2.1.

It added a ton of code for AMD’s Display Core Next (DCN) architecture that handles all graphics coming from the GPU to your monitor. But the size of this commit is an outlier among outliers; only 30 commits changed more than 100k lines of code, and the next largest touched only about 260k LOC (15% fewer LOC).

Suffice it to say that the maintainers of the Linux Kernel appear to have done a great job in keeping commits small and adhering to their “do one thing” mantra. Except for that one dev who added 300k LOC; I think they did two things.

PostgreSQL

Git wasn’t invented until 2005, but the maniacs at PostgreSQL used a time machine to make their first commit in 1996 (They used a tool called cvs2git to migrate their repository.)

Despite their longer git history than Linux, the PostgreSQL repository only contains about 61k commits. It, too skews heavily towards small commits, with 35% changing 10 LOC or less. It, too, appears to follow a power-law distribution.

I promise you that’s a different graph than the one I posted for Linux. Go ahead and check. The only tell is that the median commit size of PostgreSQL is 23 LOC and its average is 359 LOC, which is also skewed heavily by outliers.

The largest commit in PostgreSQL is this one that clocks in at a whopping 515k LOC.

commit aeed17d00037950a16cc5ebad5b5592e5fa1ad0f
Author: <redacted>
Date: Mon Mar 13 20:46:39 2017 +0200

Use radix tree for character encoding conversions.

Replace the mapping tables used to convert between UTF-8 and other character encodings with new radix tree-based maps. Looking up an entry in a radix tree is much faster than a binary search in the old maps. As a bonus, the radix tree representation is also more compact, making the binaries slightly smaller.

[...]

I had no idea there were so many different character encodings. There’s johab, KOI8-U, SHIFT JIS and a whole mess of Windows-specific encodings. Each different conversion probably could have been added in its own commit, though.

That huge commit, too, is an outlier among outliers. There are only 24 commits over 100k LOC, and the next largest one is 30% smaller!

It’s clear that the maintainers of PostgreSQL have also maintained good commit hygiene. While its commits appear to change more LOC on average, I noticed that they also tend to include more test code on average, too, although I haven’t collected any formal data to that effect.

ChakraCore

What about Microsoft’s ChakraCore, the reason for the doomed pull request into Node.js that I discussed at the beginning of this blog post?

It appears that some enthusiasts have been trickling changes into the repository since it was abandoned by Microsoft, but it’s pretty clear that the developers were already transitioning off the project by the time Microsoft announced in 2018 that they were adopting Chromium for their browser. If I had to guess based on the data, the developers found out in July of the same year.

The total number of commits sits at 13k, although only about half are non-merge commits, which indicates a strong preference for squashing within the team. How large are these commits, though?

There’s definitely some outliers here, but overall it also appears to follow a power-law distribution in LOC changed. The median LOC is an even 30 while the average is extremely skewed to 1264. 31% of all commits changed 10 LOC or fewer.

Let’s take a closer look at that skew. There are only 3 commits that changed more than 100k LOC, and the largest commit changed 1,385,288 LOC. What happened there?

commit bdf3216cce7f1d3ba9338cddce74f45a753ae942
Author: <redacted>
Date: Mon Nov 6 17:28:37 2017 -0800

Merge unreleased/rs3 to release/1.6

ChatGPT tells me that it’s likely a long-lived feature branch for Windows Redstone 3 (a Windows 10 release) but I can’t be fucked to verify that for myself. If true, it only reinforces my point about feature branches encouraging huge “big bang” merges (read the story about “Jordan” above).

The important thing is that it appears to be a merge commit in disguise. The next largest commit was 99% indentation changes, so let’s call it even and exclude these two as outliers. This brings the average commit size down to 987 LOC.

ChakraCore commit size distribution excluding outliers

So how did Microsoft do? Skimming over their commits, it appears the ChakraCore developers were even more diligent about adding tests than the PostgreSQL team, so I’m unsurprised to see higher numbers. It’s been my experience that teams of professional software developers (professional as in “getting paid to do it”) generally produce larger pull requests thanks to testing requirements. It’s quite often that I find myself writing more test code than feature code.

Node.js

I’m obligated to analyze the Node.js repo after I spent the first two paragraphs of this article discussing it. Let’s look at commit sizes year-over-year.

Node.js is maintaining a strong pace of activity, averaging nearly 3k commits per year for the last five years, and 43k commits overall. Let’s take a look at the commit size distribution:

To nobody’s surprise it also follows a power-law distribution with the vast, vast majority of commits on the smaller side; 39% of commits changed 10 LOC or fewer. This pulls the median LOC changed to 19.

What’s a little strange about this repository is that the average LOC changed per commit is way higher at 686. Why is that?

The largest commit in Node.js changed a whopping 5 million lines of code:

commit 66da32c045035cf2710a48773dc6f55f00e20c40
Author: <redacted>
Date: Wed Apr 14 11:19:54 2021 +0200

deps,test,src,doc,tools: update to OpenSSL 3.0

This pull request updates the OpenSSL version that is statically linked with Node.js from OpenSSl 1.1.1 to quictls OpenSSL 3.0.0+quic.

[...]

Which appears to mostly consist of build-generated files targeting various architectures. In fact, there are over 100 commits that change more than 100k lines, and 5 that change more than 1 million! Microsoft’s pull request doesn’t sound so large after all.

The second and third largest commits also happened to be for OpenSSL upgrades. The fourth and fifth-largest commits were related to upgrading a dependency on a localization library (ICU), and in classic C++ fashion, Node.js builds their C++ dependencies from source, meaning any time they add or upgrade a third party library, they have to bring in all its source files.

Node.js, like any other library, has dependencies on other libraries. It’s just that upgrading these dependencies in Node.js results in huge commits. The maintainers aren’t doing anything wrong here; building your dependencies from source is a totally valid dependency-management strategy in C++ codebases. It murders your already-hours-long build times, though. While such pull requests aren’t small in terms of LOC, they still feel small.

Like the other repositories we’ve looked at, I think that the Node.js maintainers have done an excellent job of writing reviewable code. Flipping through their commits, it seems as if they’ve been able to keep the median LOC changed per commit small while also adding tests with almost every change. Kudos.

21 lines of code

We looked at four fairly large open source projects to get an empirical idea of how large a pull request should be. It was hardly a comprehensive study, but both the median of medians and the average of medians come out to 21 LOC changed per commit. That means the vast, vast majority of pull requests that you see should be around this number.

It doesn’t mean that larger LOC changed is bad.

The standard deviation on LOC changed per commit was large, thanks to the power-law distribution. It was high as 31k in Node.js down to a paltry 5k in PostgreSQL. That means a commit that changes even 5k LOC is generally within a standard deviation of the average.

Ultimately what should be considered “minimal” is up to your company, team, culture, CI/CD pipeline, and tech stack. The biggest takeaway for me is that large pull requests should be rare, and exponentially rarer the larger they get.

In closing

Pull requests should ideally only change “one thing”, whatever that means to your team. Too bad the real world doesn’t fit nicely into a little box. The reality is that the size of pull requests is a reflection of your entire company. While I can’t tell you exactly how large your pull requests should be, I can recommend some strategies that will help move you towards your ideal. I talk about each of these in the “Minimum practical reviewable unit” section:

Use trunk based development; avoid feature branches
If you must use feature branches, redefine “feature” to be quite small
Break off large logical chunks of code (“bounded contexts”) into separate repositories with separate CI/CD
Define “one thing” in terms of a novice or early intermediate
Move the longest running tests into ones that are run on a schedule instead of a required step in a pull request

Commit sizes follow a power-law distribution, meaning commits that change 10 LOC should be an order of magnitude more common than commits that change 100 LOC, which should be an order of magnitude more common than commits that change 1k LOC, and so on. Large pull requests have their place, but if one comes across your desk, you should question whether it is artificially large or if it should have been broken up. Try not to be a pedantic jerk to your coworker who submitted it, though. Just try to help them do better next time.

Appendix: Data collection methodology

From each repository, I only collected commits that:

had a single parent (a.k.a “not a merge commit”)
- This necessarily excludes the first commit from each repository, which I can live with
added or removed at least one line of code from its parent commit according to git diff --shortstat

A funny thing happens if you include merge commits in the analysis. In Git, a merge commit can be between any number of branches (an “octopus merge”), so the question of “how many lines changed” is ambiguous. You might also end up with a merge commit that accidentally adds a new root commit to the repository, and depending on how you diff that merge commit, it can appear like it added the entire repository to itself. This has apparently happened 4 times in the course of the Linux kernel’s development. Those are discussed in more depth here.

When I say “lines of code changed” throughout the blog post I mean the sum of lines added with the absolute value of lines removed. For example, if a commit added 1 line and removed 2 lines, the total lines of code changed is 3.

I pulled the Linux kernel from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git on March 8th, 2025 as directed by the Linux contributor’s guide
I pulled PostgreSQL from https://git.postgresql.org/git/postgresql.git on March 19th, 2025 as directed by the PostgreSQL wiki
I pulled ChakraCore from https://github.com/chakra-core/ChakraCore.git on March 19th, 2025 per its GitHub page
I pulled Node.js from https://github.com/nodejs/node.git on March 19th, 2025 as instructed by their Contibuting.md page on GitHub

I walked the commit trees using pygit2 but invoked git diff via subprocess like so:

diff_output = subprocess.run(["git", "diff", "--shortstat", parent_hex, commit_hex]], capture_output=True, text=True).stdout.strip()

and parsed the output via a regular expression:

diff_regex = re.compile(r"(\d+) insertions?\(\+\)|(\d+) deletions?\(-\)")
matches = diff_regex.findall(diff_output)

I saved the output for all commits into a comma-delimited file (.csv) with the following headers: “Id”, “Date”, “Lines Changed”, “Lines Added”, “Lines Removed”. The Id column was the full commit hash and Date was in the format YYYY-MM-DD, e.g., 2025-03-19. Because I grabbed the date data from pygit2, it was in UTC time. The other columns I think are self-explanatory.

I analyzed the data using a mixture of pandas, numpy, and scipy. The quartiles shown in the graphs were computed using the pandas quantile function (except Q2 was calculated using mean). The “Q4” label was merely placed to the right past the “Q3” line to indicate that the remaining volume belonged in the fourth quartile.

The log-log graphs used base two instead of base ten because I felt like it showed a little more nuance in the data even though base ten probably fit the data better. Plus, like any self-respecting computer scientist, I’m a slut for powers of two.

The data for the stacked bar plots that showed commit sizes per year was computed by first computing a “Bucket” column in my DataFrame that placed each commit into one of the buckets shown in the plot’s legend based on the "Lines Changed" column. The bucket intervals were open, so the “1-10” bucket includes commits that changed 10 LOC. This was accomplished by using the Pandas cut function, after which it was grouped by year via the groupby function: (some variables are excluded for legibility)

bins = [0, 10, 20, 50, 100, 200, 500, np.inf]
bin_labels = ["1-10", "11-20", "20-50", "51-100", "101-200", "201-500", "500+"]
# df is a pandas.DataFrame
# pd is an alias for pandas
df["Bucket"] = pd.cut(df["Lines Changed"], bins=bins, labels=bin_labels, right=True)
grouped = df.groupby([pd.Grouper(key="Date", freq="YE"), "Bucket"], observed=False).size().unstack(fill_value=0)

In the above snippet, size() is a call to DataFrameGroupBy.size, which creates a new Series object that looks like this:

Date        Bucket
2009-12-31  1-10       253
            11-20      103
            20-50      185
            51-100     127
            101-200     86
# ... more dates

and the following call to unstack(fill_value=0) pivots the buckets to be columns, producing a DataFrame that looks like this:

Bucket      1-10  11-20  20-50  51-100  101-200  201-500  500+
Date
2009-12-31   253    103    185     127       86       70    80
2010-12-31   695    254    334     244      147      100   131
2011-12-31   881    277    365     262      183      133   151
# ... more dates and buckets

The graphs were plotted using matplotlib. Credit goes to Dr. Stephanie Valentine for providing copious input on how to present these graphs against my usual style of “eye vomit that induces fear in young children”.

Feel free to republish my figures so long as you attribute them to me.

Until next time.