Optimizing Git’s Merge Machinery, #1

Editor’s note: This is the first post in a series by a Palantir Software Engineer on optimizing git’s merge and rename detection machinery. Click to read the second post.

Press enter or click to view image in full size

In the past, I talked about my work on Renaming and deep directory hierarchies in Git. I alluded to some performance improvements that I would be implementing for the merge machinery and rename detection capabilities in git. In this and subsequent posts, I want to talk about those improvements, which are being made as part of contributing a new merge algorithm to git[1]. In this blog post series, I will describe the optimizations I made to git, which I humorously summarized in my Git Merge 2020 talk as follows:

Don’t do unnecessary work
Don’t redo work
Don’t redo unnecessary work
Fudge “unnecessary”

But which I could summarize slightly more accurately as:

Find paths where detecting renames wouldn’t change the result, and don’t detect renames for those paths
Don’t look for a better than perfect match
Don’t repeatedly detect the same renames done by the upstream branch when rebasing (or cherry-picking) a series of commits (and especially don’t repeat detecting the ones that were unnecessary in the first place)
Find useful heuristics to guide rename detection in order to avoid (or at least reduce) brute-force exhaustive pairwise comparisons

I have completed the implementation since my Git Merge 2020 talk, and have also made a number of improvements to the above optimizations and added two new sets of optimizations:

Tree-level merging (which looks stubbornly incompatible with rename detection, but some tricks can often get it to work)
Good, old-fashioned optimization work — measure performance, find slow codepaths, and fix them individually

This first blog of the series motivates how we use git at Palantir and why we care about these optimizations, then provides the background required to understand the optimizations, and finally presents the first and super simple optimization: don’t look for a better than perfect match (which itself is an improvement on the existing optimization of speeding up git merges by detecting renames based on file hashes rather than looking at file content).

Is the merge algorithm just about `git merge`?

The merge machinery in git (called merge-recursive) powers several aspects of git:

merge
cherry-pick
rebase
revert
am -3
stash
checkout -m

so we are talking about more than just git merge.

Why does git’s merge algorithm need work?

In short:

we need efficient refactoring of large codebases
we want to move towards sparse operations and want to improve their effectiveness
the existing code was at the limits of its design

Efficient refactoring of large codebases

At Palantir, we have a few large-ish monorepos roughly on scale with the size of the linux kernel. Within these repositories, we’ve at times done deep refactorings, including renaming of high-level directories. We want git to handle such work efficiently. If cherry-picking a patch to backport it to some older branch doesn’t work because of limits or takes an extraordinarily long time, folks will resort to hand-patching and risk missing important bits. We had a case where one team lost huge amounts of time, slowing down their work for months, due to the difficulty of backporting patches after a refactoring that involved a lot of renames. This resulted in them calling a meeting where they requested an agreement to never refactor the code ever again; it was a rather suboptimal state of affairs that could have been avoided with a better merge capability.

Sparsity

Beyond refactoring workflows, we want developers to be able to concentrate on just their area of interest as though they had a much smaller repository. We created scripts that allowed sparse checkouts before git sparse-checkout existed (as sparse-checkout itself was based on a SKIP_WORKTREE ability that had been in git for a long time, but had no easy-to-access UI), and then reviewed the git sparse-checkout patches that were contributed by Microsoft and contributed several patches of our own for the feature.

Continuing work in that area involves creating sparse indexes (also being spear-headed by Microsoft), which would enable not only the working tree to be sparse, but git’s index (or staging area) to be sparse as well. Since the index is a pervasive data structure in git, such a change will require significant changes throughout the git codebase. The existing merge machinery is much more intricately linked with the index than most code areas, and changing that is effectively a complete rewrite. So, the sparse-index work will require a new merge algorithm, which is something we are contributing.

Connected to the ideas of sparse-checkouts and sparse indexes, is partial clones — the ability to download just the history for the paths of interest. I have long been interested in this ability, and believe I was the first to post some patches with a very rudimentary implemention of sparse clones. I never finished that implementation, but others eventually created a similar capability. Partial clones will do partial history downloads, and then lazily download extra objects as you access them. Naturally, for partial clones to be useful, they need to avoid downloading extra objects too frequently, in non-optimized ways, or downloading unnecessary objects. Sadly, rename detection could require inordinately many objects to be downloaded, though it was known in some cases that other ways could sometimes solve the problem without the rename detection and avoid downloading as many objects. Thus, finding ways to improve rename detection for merges is a big driving factor for performance with partial clones as well.

Design Limits

The design behind the implementation was of the form “this other bit of code almost does what we need, we just need to fix it up a little”. (That other little bit of code was unpack-trees, if you must know, but that won’t mean anything to you unless you’re familiar with git’s code.). The “fix it up a little” grew and grew until the fixups were unworkable. Unfortunately, merge-recursive is one of those dark, almost mystical areas of the codebase that almost everyone avoids; the two most prolific git authors have stated:

“[It is] some pretty hairy code. Every time I start to look at it I get confused and can’t remember what breakthrough I thought I was close to making before.” (Jeff King)
“I’ve written off that code as mostly unsalvageable a long time ago.” (Junio Hamano)

I came along about 3–4 years after it was written, but I submitted enough fixes to this code that I became the area expert at some point. Most of my initial work was actually around correctness in the face of various crazy edge and corner cases, and there were some edge cases that were just extraordinarily hard to fix given the way the code was structured.

Why a new merge algorithm?

I initially started by just refactoring the existing merge algorithm, as I had been doing for a long time. When I got to the point of suggesting a change to some of the bigger pieces of the design, Junio Hamano (the maintainer of git) suggested just rewriting it instead. He pointed out that git had a pluggable design for its merge backend, allowing someone to write a new algorithm that could co-exist with an existing one. He suggested this would actually be preferable for users stability-wise, as it’d allow us to simultaneously let some folks try out the new algorithm while others used the old one (and even let users switch back and forth with a simple config switch).

Get Palantir’s stories in your inbox

Join Medium for free to get updates from this writer.

This also freed me to rethink some of the fundamental data structures and assumptions with an eye towards both increased performance and fixing those annoying corner cases that were just too difficult to fix in the old design.

Background: basics of how merge/rebase/cherry-pick works

My optimization work requires knowing a few things about git:

Merging two branches is a three-way operation
Git stores hashes of files and trees in addition to hashes for commits
Git does on-the-fly rename detection; renames are not stored anywhere. Merges need renames for two different reasons, and merges need two sets of rename detection per merge. Sadly, rename detection is sometimes slow.
Rebasing and cherry-picking is also a merge operation for each commit being transplanted (using a funny merge base), but the resulting commit is written with a single parent instead of as a merge commit.

If you know and understand these, feel free to skip over this Background section. If any are surprising or new to you, read the relevant section(s) below.

Three-way content merge

Assume the following history exists and the current branch is “main”:

     A---B---C topic
    /
D---E---F---G main

Let’s further assume that in one of your files on the “main” branch you have a section of code that looks like this

line 1
line 2
line 3

However, the same file in the “topic” branch doesn’t have the second line. What should the merge machinery do here? Did “main” add that middle line, or did “topic” delete it? The only way to know is to consult a common point of history; the most recent common commit (the commit labelled “E” in our picture above) is what git calls the “merge base”[2]. If the merge base had the line, then clearly “topic” deleted it and the merge result should delete it. If the merge base didn’t have the line, then clearly “main” added it and the merge result should add it. If the merge base doesn’t match either side of history in this region of context, then you’ve got a merge conflict.

Thus, merging isn’t a two-way operation, but a three-way operation where git consults the merge base (commit E), your current commit (commit G, where “main” points), and the commit at the tip of the other branch (commit C, where “topic” points).[3]. (As a side note, you can take advantage of this by running git config --global merge.conflictStyle diff3; that will make information about the original version available to you when resolving merge conflicts.)

The basic idea behind the merge algorithm is thus to just do a three-way content merge for each file in the repository using the versions of each file from those three commits.[4]

Merkle trees

Many users are familiar with the fact that git has a hash for every commit. However, git also has a hash for every file (“blob”) and subdirectory (“tree”) in the commit as well, with tree hashes derived from the hashes of blobs and subtrees within it, and the commit hash derived from both the top-level tree and the commit author/date/message/etc. This is a data structure known as a Merkle Tree. You can see these hashes for yourself by running

git ls-tree -rt $COMMIT

or by repeatedly running

git cat-file -p $HASH

starting with the hash of a commit and then repeating for any hashes you find in the output of previous commands. Git internally stores and looks up everything by hash. Tree objects record the mappings between the hashes in git’s internal datastore and the filename that should be written to the working directory. (The hash-based storage also provides a form of automatic deduplication.)

Rename detection and merges

As noted in the previous section, git does not store revisions by file. It also has no metadata of any sort to track renames of files done in various commits. Instead, it detects renames on the fly for any relevant command.

There are two reasons that merges need rename detection

for three-way content merging: we need to get the relevant three files paired (e.g. merge-base:src/oldname.java, side1:src/oldname.java, side2:src/subdir/newname.java) so we can merge their content
for directory rename detection: if one side of history adds a file to a directory, and the other side of history renames that directory, we can detect by the renames of files within the directory that the directory was itself renamed and move the new file to within the new directory (or at least notify the user that they have a conflict and need to decide whether to leave the new file in the old directory or move it to the new one)

Because merges have two sides as well as a merge-base, we have to do two sets of rename detection — from the merge-base to side1, and from the merge-base to side2.

Unfortunately, rename detection is a quadratic algorithm. It involves first getting the set of all N filenames unique to the merge base, then all M filenames unique to the given side, then doing a similarity comparison between the content of all N x M files and marking the best matches that are sufficiently similar as renames. Even a small number of renames is sufficient to make rename detection the slowest part of a merge.

Rebasing and cherry-picking

The idea of a three-way merge between a merge base, side 1, and side 2 can also be viewed two other ways. You can equivalently think of it as applying the differences between the merge base and side 1 to side 2, or you can think of it as applying the differences between the merge base and side 2 to side 1.

These other ways of thinking about merges give us a way to understand how rebases or cherry-picks work. From a high level view, a rebase or cherry-pick means we need to take a commit and apply the changes from that commit to some new base commit. Stated slightly differently, we need to get the changes between the commit we are picking and its parent and apply them to some new base commit. But, as noted above, that’s equivalent to saying we want to do a three-way merge between the parent commit, the commit we are picking, and the new base commit. Thus, a rebase or a cherry-pick is implemented as a three-way merge with a funny-looking merge base[5]…it’s just that the merge result only gets recorded with one parent (the new base commit) at the end.

For rebases and cherry-picks of multiple commits, we do a sequence of merges, where the new base commit for the Nth three-way merge is the result from rebasing/cherry-picking the previous commit in the sequence.

Optimization: more complete use of exact renames

As noted above, rename detection works by first getting the set of all N filenames unique to the merge base, then all M filenames unique to the given side, then doing a similarity comparison between the content of all N x M files and marking the best matches that are sufficiently similar as renames. Can we do anything to decrease N & M? Are there special cases we can detect in a pre-pass step?

What if a file is renamed, but its contents are not modified? Such a file would have the exact same hash. Since git stores hashes of each file within the repository, we don’t even have to do any extra work to compute these. We can just check both the the sets of N and M files to see if any from the two sets share a hash. So, for example if our old files and their (abbreviated) hashes were:

c01055a1   hex.java
5eac0a57   word.java
acc057ed   fun.java

and the new files and their (abbreviated) hashes were

c01055a1   hexadecimal.java
0b57ac1e   phrase.java
acc057ed   games.java
decea5ed   copy.java

Then we could detect that hex.java and hexadecimal.java are renames without opening either file, because the two files have the same hash. The same can be said for fun.java and games.java. That means hex.java and hexadecimal.java don’t need to be included when doing exhaustive pairwise comparisons, nor do fun.java and games.java. So, in the end, we’d only need to check the following pairs of files for similarity: (word.java, phrase.java), (word.java, copy.java). That drops us from 12 pairs down to 2.

This simple idea was implemented years ago…with a catch. The same code that does rename detection also does copy detection. In copy detection a single source file might be paired with multiple destination files. Thus, when a common hash is found, you can remove a destination (decreasing M by one) but have to leave the number of sources alone. That would mean in the cases above that we can remove hexadecimal.java and games.java, but not hex.java or fun.java — based on the fact that copy.java could be a copy of either of those files. So we’d end up with 3*2 total pairs of files to compare. 6 pairs is less than the original 12, but still a lot more comparisons than just 2.

Of course, we only need to do the six comparisons if we want copy detection. If we only want rename detection (as is the case for the merge algorithm), then we would only need two. Unfortunately, the code that implemented rename and copy detection had it coded to always do the larger number, which for rename detection basically meant it was spending time looking for a better match than the perfect match already found for a source file.

Granted, this optimization only applies when there are files moved/renamed without any edits between the merge base and the tip of the branch that moved/renamed it. So, the results depend heavily on the particular repository and particular commits you are looking at. However, making the rename detection not look for better than perfect matches provided a speedup factor of ~3 on one particular test case of interest involving rebasing 35 patches onto a branch that renamed ~26K files:

                       Before                   After
mega-renames:   5504.231 s ±  5.150 s   1799.937 s ±  0.493 s

The first real-world testcase with lots of renames that I ever tried this optimization on (a simple cherry-pick of one commit) saw a speedup factor of two. This was by far the simplest optimization but it still provided a pretty impressive speedup. For more details about this optimization; see the upstream thread.

This optimization will appear in git-2.31, which will be released in mid-March.

Author

Elijah Newren

[1] git-2.31 will have a basic preview of the new “ort” merge strategy (where “ort” stands for “ostensibly recursive’s twin”, intended to convey that it will eventually be a drop-in replacement for the existing default “recursive” strategy. It’s also a tongue-in-cheek joke because selecting merge strategies is done with the -s flag, and I chuckle seeing git merge -s ort especially since the command argument parser also accepts git merge -sort). The version of ort in git-2.31 will include most of the ort-related code, but the remaining patches that are ready to submit will take at least another 3 months to get through the review process.

[2] For this post, I’ll ignore what happens when there is no merge base or when there is more than one merge base.

[3] Note that git commits store snapshots, not diffs. This surprises some users because commands like git log -p make the commits look like a diff by simplifying the output to only show the difference between a given commit and its parent, and commands like git pull are smart enough to grab only the objects unique to your new commits, not ones that are already reachable from your existing commits.

[4] I’m obviously ignoring lots of details here, such as anything touching gitattributes (binary files, line ending normalization, special merge drivers), any non-content conflicts (e.g. modify/delete conflicts, rename/delete conflicts, directory/file/submodule/symlink type conflicts, add/add conflicts, rename/rename conflicts, file location conflicts due to directory rename detection, any of the esoteric cases that result in nested conflict markers, etc.), non-unique merge bases, and so forth.

[5] Technically, there are multiple rebase backends, and only the default rebase backend (“merge”) behaves this way. The former default rebase backend (“apply”) operated by computing a patch between the commit to pick and its parent, and then trying to apply the patch onto the new base commit. However, the two methods are logically equivalent in most cases; the differences are that the merge backend has more information available and thus can avoid some shortcomings inherent to apply. Also, the fact that apply was the default until git-2.26 meant it had a few more features grow up around it that have not yet been ported over to the newer backend.

Is the merge algorithm just about git merge?