Version-controlled databases using Prolly trees

13 min read Original article ↗

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net!

Free trial subscription

Try LWN for free for 1 month: no payment or credit card required. Activate your trial subscription now and see why thousands of readers subscribe to LWN.net.

Modern database and filesystems make pervasive use of B-trees, which are tree structures optimized for storing sorted lists of keys and values on block devices. Dolt is an Apache 2.0-licensed project that makes clever use of a variant of a B-tree to support efficient version control for an entire database. The data structure it uses could well be of interest to other projects.

The company behind Dolt, DoltHub, makes its money selling hosted versions of three Apache-licensed open-source projects: Dolt, Doltgres, and DoltLite. The projects are intended to be drop-in replacements for MySQL, PostgreSQL, and SQLite, respectively. They have separate frontends for the different SQL dialects, but the projects share a common storage backend that supports version-control operations.

In practice, that allows the projects to expose Git-like operations as custom SQL functions:

    -- Modify the database
    INSERT INTO users VALUES (...);
    -- Add the changes to a table to staging
    SELECT dolt_add('users');
    -- Commit them
    SELECT dolt_commit('-m', 'Add Joe to users');
    -- See the resulting diff
    SELECT * from dolt_diff_users('HEAD~1', 'HEAD');

At any given time, the database has a current HEAD commit, some number of changes in the staging area, and some number of changes in the working area, just like Git. That deliberate similarity is a major selling point of the projects — the company's home page says: "You already know how to use Dolt." That may be an exaggeration, but it is true that having basic knowledge of Git and SQL seems sufficient to start making use of Dolt.

What might be less clear is what one would use a version-controlled database for. Simply making commits on its own does not really add much power over the existing concept of a database transaction. The real utility of Dolt comes from the ability to restore old commits, fork historical states of the database, and merge in changes after review. For example, a programmer might want to run an analysis against an old state of the database to determine what would have been reported as of a particular date. They could make a new branch pointing at a historical commit, cherry pick a schema change that the analysis needs, and then run it.

Another potential use is to give automated tools a "real" database that they can work against, without giving them the ability to completely break the production database; any changes can be investigated separately and merged in only once it is clear that they won't break anything. Dolt is fully compatible with normal MySQL clients; the only difference is in the connection string, so the tools have no reason to think that they're not talking to the real production database.

    -- Connect to the normal db
    mysql://db-server:3306/mydb
    -- Connect to a named branch
    mysql://db-server:3306/mydb/branch-name
    -- Connect to a read-only historical commit
    mysql://db-server:3306/mydb/ia1ibijq8hq1llr7u85uivsi5lh3310p

Of the three projects, Dolt is the most mature. Doltgres recently entered beta, and DoltLite is somewhere in-between — the components adapted from SQLite have had plenty of testing in their original context, but the project is still on version 0.8.1. The company's performance testing of Dolt suggests that its speed is comparable to MySQL's, although the normal caveats about the difficulty of precise benchmarking apply.

That competitive performance is possible because Dolt only changes the storage layer of the database. The query planner, index maintenance, and so on all reuse MySQL, PostgreSQL, or SQLite's existing implementations. All of these databases use B-trees; Dolt uses a slight variation on a B-tree that allows efficient diffing and shares storage between mostly identical versions.

B-trees

B-trees have a lot in common with their simpler cousins, binary trees. In a binary tree, each node has a key, a value, and two (possibly empty) sub-trees. Keys with a lower value are stored in the left sub-tree, and keys with a higher value are stored in the right sub-tree. As long as some scheme is employed to keep the tree balanced (such as the use of red-black trees in the kernel), this results in reasonably efficient comparison-based searching: each comparison cuts out roughly half of the remaining candidates. Keys are looked up in time proportional to the base-two logarithm of the number of nodes in the tree.

B-trees take this core concept, and remove a bunch of pointer indirections. Each node in a B-tree has an array of keys, values, and pointers to sub-trees. Typically, the size of the array is chosen such that one node fits on one sector or page of the underlying storage. To find an item in a B-tree, the computer scans over the array and either finds the key directly, or finds two keys that the target key should be between, and goes on to check the corresponding sub-tree. This structure greatly reduces the number of levels of the tree and the number of pointer indirections, thus making the whole structure faster to access in practice. In a B-tree with sixteen keys per node, for example, searching for a key only involves one quarter of the number of nodes compared to a binary tree.

[A diagram of a small B-tree, created by 'CyHawk' and shared under a CC-BY-SA license]

This makes B-trees an excellent fit for databases, which have to maintain a balance between efficient searches for particular items, in-order scans over the items in a range, and access to the underlying storage. There are a handful of variations of B-trees used by different databases that provide small speedups. For example, moving all of the values to leaf nodes and storing only keys in the interior nodes of the tree allows for more keys per interior node, making the tree wider and shorter. An extension to that idea adds a "next" pointer to each leaf node so that in-order traversals can proceed along the leaves without needing to load any interior nodes. Dolt's core data structure, described below, incorporates both of those optimizations.

The problem with plain B-trees, and the reason that Dolt is interesting, is that comparing entire B-trees is expensive. The order in which items are inserted and deleted can affect how the internal nodes of the B-tree are structured, which means that the only way to compare two B-trees is by iterating through both trees in their entirety. That downside doesn't matter much to traditional databases, but it poses real problems for version-control systems.

Snapshot-based version-control systems, such as Git, can be thought of as storing many snapshots of some source tree. Each commit is logically independent, and could in theory just be stored as-is. In practice, most Git repositories do not completely change between commits, and it's more efficient to store the differences between subsequent snapshots, rather than the snapshots themselves.

Git, which works with entire directory trees, relies on being able to quickly check whether a particular sub-tree has changed in order to produce compact diffs that include only the actual changes. B-trees aren't compatible with that kind of algorithm, since the same logical list of keys and values could be represented by two different B-trees. Naively applying a diff algorithm to a B-tree could produce a lot of inefficient churn between representations of internal nodes.

The problem comes from the way that B-trees split and merge nodes as items are added and removed. When a B-tree node is full, it is split into two smaller nodes that are each half-full. When a B-tree node is not full enough (a threshold that varies between implementations, but which is typically just under half-full), its children are merged together to produce a more full node. When correctly implemented, this keeps the B-tree balanced and prevents the nodes from wasting storage space. But it also means that splitting and merging decisions depend on the order of insertion into the tree.

Prolly trees

Probabilistic B-trees (Prolly trees, for short) are Dolt's answer to this problem. They are almost identical to B-trees, except that the logic for inserting and removing items is updated such that any two trees that contain the same items are completely identical regardless of how items were inserted. The same property holds for individual sub-trees, so comparing two Prolly trees is much simplified: if a sub-tree hasn't changed between two versions, it will have the same hash, even if an item was inserted and then removed.

To accomplish this, Prolly trees need to be able to determine what size each node "should" be, based only on its contents and not on any internal state of the tree. At first, it might seem tempting to say that every node should be as full as possible, to maximize use of storage space. But that would require sometimes shuffling nodes at the same level of the tree back and forth between parents, which would destroy some of the nice properties of the B-tree and cause additional churn between versions. To keep insertion and deletion efficient, the average node should not be so full or so empty that the insertion or deletion of a single item will need to trigger a split or a merge.

The solution to this problem is to use a hash function to make the decision in a way that depends only on the data and not on how full a particular node happens to be. A hash is effectively a deterministic pseudorandom number that can be compared to a threshold to make a randomized decision, which is where the "probabilistic" part of Prolly trees comes in. To insert a new value into a node, the implementation puts it in the correct position in the array (based on the key), and then recalculates where the boundary between this node and its sibling should fall based on the hashes of the items. Each hash is compared to a chosen cutoff value (the threshold) to decide whether the node should end at that item. Since the values of the hashes are independent, so are the decisions about where to put the boundaries between nodes. Most of the time, therefore, adding an item won't result in a new boundary: its hash value will be above the threshold, and the node will still end at the same place (a preexisting item with a low hash value). Occasionally, an inserted item will have a low hash value, causing the node to split in two and keeping the overall sizes of the nodes balanced.

That picture is slightly complicated by performance considerations; in order to keep the nodes optimally sized, the threshold used is actually dynamic, based on the number of items before a given key in a node. That still means that the actual decisions about where to place the boundaries between nodes are made independently of the pre-existing structure of the tree. The locations of the boundaries depend only on the hashes of the items stored in the tree, which are the same regardless of what order items were inserted in.

This insertion logic is a little complicated, especially if the implementation is optimized for performance, but it's still reasonably compact:

    # Unoptimized Python example code
    # This function is called by some internal node to update one of its child
    # nodes; the return value is used to update the internal node.
    def insert_into_node(node, new_items):
        sibling = node.next_sibling
        new_items = sorted(new_items + node.items)
        collected = []
        nodes = []
        
        while new_items:
            next = new_items.pop(0)
            collected.append(next)
            if hash(next.key) < threshold(len(collected)) \
               or len(collected) >= max_node_length:
                nodes.append(Node(items=collected))
                collected = []

        if collected:
            # The while loop didn't end on a natural barrier
            if node.next_sibling is not None:
                new_nodes, sibling = insert_into_node(node.next_sibling, collected)
                nodes += new_nodes
            else:
                nodes.append(Node(items=collected))
            
        for i in range(1, len(nodes)):
            nodes[i - 1].next_sibling = nodes[i]
        nodes[-1].next_sibling = sibling
        
        return nodes, nodes[0]

With a suitably chosen threshold, the average insertion won't create a new node, and so only one node needs to be updated at each level of the tree. Theoretically, adding a single item could cause all of the nodes after it to be split up differently, but the chances of that are infinitesimal, because the hash values of the items are statistically independent. The hash values of all of the existing items would have to just barely miss the threshold in order to cause that kind of realignment, so it's not a practical performance problem. Since the decision of how to place items into nodes only depends on the data, the same nodes will be created every time.

There are some downsides to using a Prolly tree over a B-tree. For example, nodes need to store the hashes of sub-trees, which requires either more storage space or an additional indirection through a content-addressed block store to turn a hash into an address. Content-addressed storage is a scheme that many version-control systems use to deduplicate data: a particular diff or other object is stored in a location based on the hash of its contents. This ensures that duplicate items will reuse the same storage space. A Prolly tree implemented as part of a version-control system can reuse that existing content-addressed object storage for blocks, and refer to blocks by hash, leading to more compact internal nodes. Other systems will need their own solutions.

Another problem is that the performance of the tree is highly sensitive to how the probabilities of splitting are tuned. Using a constant probability results in the mean node being half-full, but the median node containing only one or two entries. To cope with that, Prolly trees use a dynamic threshold that increases as the node fills up, so that the size of nodes ends up following a normal distribution. Of course, some applications really do have requirements for worst-case performance rather than average-case performance, which is incompatible with the randomized layout of a Prolly tree.

Despite these deficits, Prolly trees could be a useful data structure for applications that depend on being able to compare snapshots of trees, or on being able to store trees in a content-addressable way.

History and future

Prolly trees were invented by Aaron Boodman for the now-defunct Noms database. Afterward, they were picked up by the InterPlanetary Linked Data project and by DoltHub. The storage backend used in Dolt is a forked version of Noms, although the project has seen a fair number of changes and optimizations since then. Some other implementations of Prolly trees exist as libraries, but none apply them to databases.

Dolt provides an appealing evolution on the traditional design of a database. Personally, however, I am excited to see how the ideas could apply in other programs that use B-trees, such as filesystems. Prolly trees are certainly not appropriate for all uses of B-trees, but filesystems share many of the same tradeoffs between in-order traversal, random access, and ease of snapshotting that databases do. Perhaps Prolly trees could help with space-efficient snapshots of large, frequently modified files.



Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.