Btrfs

30 min read Original article ↗

In the previous article , we explored XFS—a filesystem built for extreme scale that divides the disk into independent Allocation Groups, each with its own B+ trees for free space, inodes, and extent tracking. XFS, like every filesystem we’ve covered in this series, shares one fundamental characteristic with ext4, NTFS, and FAT32: it modifies data in place. When you update a block, the new data overwrites the old data at the same disk location.

Now let me introduce you to Btrfs (B-tree filesystem), a filesystem that does something radical: it never modifies data in place. Every single change—to a file, a directory, or even filesystem metadata—writes to a brand new location on disk, leaving the old version untouched until everything is safely committed.

This copy-on-write (CoW) design sounds wasteful at first, but it’s what makes instant snapshots possible—and it combines with several other independent design decisions to give Btrfs its full feature set: a logical address layer that enables built-in RAID, per-block checksums for self-healing data integrity, transparent compression, and atomic transactions—all integrated directly into the filesystem, without external tools.

But before we look at how data is laid out on disk, we need to understand the shape of a Btrfs filesystem—because it looks quite different from everything we’ve covered so far.

A Different Kind of Filesystem

Every filesystem in this series so far lives on a single partition. You format /dev/sda1, and ext4 (or XFS, or NTFS) takes over that block device exclusively. Btrfs works differently.

Btrfs manages a pool of devices. You can create a Btrfs filesystem spanning multiple physical disks, and new devices can be added later—while the filesystem is mounted. The filesystem doesn’t see individual disks; it sees a unified storage pool. This is the first thing that makes Btrfs’s shape unusual.

The moment you have multiple devices, physical addresses stop being enough. A block at offset 500 MB could live on /dev/sda, /dev/sdb, or somewhere across both—raw physical offsets don’t tell you which device. So Btrfs introduces a logical address space: a single unified range of addresses that spans the entire pool, with a separate mapping layer—the chunk tree—translating those logical addresses to physical locations on whichever device actually holds them. And since that translation layer has to exist anyway, Btrfs makes it do double duty: the same mapping that routes a logical address to a physical one can route it to two physical locations on two different drives (mirroring), or split it across multiple drives (striping). RAID, in other words, falls out almost for free once you have the indirection. Different parts of the filesystem can even use different profiles—metadata mirrored for safety, data striped for performance.

Everything above the logical address layer—every tree, every file, every directory—works entirely in terms of logical addresses. The translation to physical devices is handled transparently by the chunk tree; nothing else in the filesystem needs to know which disk a block lives on, or even how many disks there are.

On top of that uniform address space, Btrfs supports subvolumes—independent filesystem trees that share the same storage pool. Each subvolume looks like its own filesystem from the outside: it has its own root directory, its own file and directory tree. But they all live in the same physical space, and blocks can be shared between them. A snapshot is just a subvolume created as a copy of another—initially sharing all blocks, diverging only as changes are made.

So when you look at a Btrfs filesystem, you’re not looking at one flat namespace on one block device. You’re looking at a pool of devices, a logical address layer on top, chunks carved out of that space by type, and multiple independent subvolume trees all coexisting within it. Each of those layers has its own on-disk structure—and almost all of them are built on the same foundation. Let’s understand that foundation first.

The B-Tree Building Block

You’ll hear “B-tree” constantly in the rest of this article. In Btrfs, every piece of metadata—free space, inodes, file data, checksums, chunk mappings—lives in one of Btrfs’s B-trees. They all share the same physical node format.

Each B-tree node occupies one or more contiguous pages, typically making a 16 KB block. Every node—whether it holds file data, checksums, or chunk mappings—starts with the same header carrying a checksum of the node itself, the filesystem UUID, the node’s own logical address, the transaction ID when it was written (generation), which tree it belongs to (owner), how many entries it contains (nritems), and its depth in the tree (level, 0 for leaves).

The generation field is particularly important: it tells you which transaction wrote this node. Its primary use is verifying pointer consistency during tree traversal—each key pointer in an internal node stores the expected generation of its child, and when Btrfs follows that pointer it confirms the child’s header matches. A mismatch means the pointer is stale or points at a CoW’d copy that has since been superseded. The csum field means every node is verified on every read—corruption is caught immediately, not silently propagated.

Beyond the shared header, nodes come in two distinct shapes depending on their depth in the tree.

Leaf nodes (level = 0) are where actual data lives. They use the classic slotted page layout: item descriptors grow inward from one end of the node, while the actual variable-length item data is packed inward from the other end. The two regions grow toward each other, and the node is full when they meet. Each item descriptor carries a key (for sorting), an offset (pointing to its data at the far end), and a size. This lets Btrfs store items of wildly different sizes—a short symlink target, a large directory entry, a tiny inode—in the same 16 KB block without wasting space. If you’ve read the PostgreSQL indexes article , this layout is similar to how PostgreSQL organises data in its pages.

Slotted page layout

Because item descriptors are fixed-size and sorted by key, looking up an item in a leaf is a straightforward binary search over the descriptor array—no need to scan the variable-length data at the far end. Once the right descriptor is found, its offset and size fields tell you exactly where in the node the item’s data lives, and you read those bytes directly. A single jump from descriptor to data, no indirection needed.

Internal nodes (level > 0) contain key-pointer pairs. Each pair records the minimum key of the subtree below, the logical address of the child node (blockptr), and the expected generation of that child—so Btrfs can detect a stale or CoW’d pointer on the way down. Finding the right child is again a binary search: you scan the key-pointer array to find the last entry whose key is less than or equal to your target, then follow its blockptr to the next level. You repeat this at each level until you reach a leaf.

Here’s how an internal node and two leaf nodes look on disk, with the internal node’s blockptr fields pointing down to its children:

Internal node layout

This node structure is also what makes copy-on-write possible. When Btrfs needs to modify a node, it never touches the original. Instead, it writes the modified version to a fresh location on disk, then updates the parent node to point to the new copy. But now the parent has changed—so the parent needs to be written to a new location too, updating its own parent, and so on all the way up to the root. The old nodes are left untouched on disk. Two things make this efficient: first, if a node was already written earlier in the same transaction, it hasn’t been committed yet and can simply be modified in place—no copy needed. Second, when an old node is no longer needed, Btrfs checks whether a snapshot still refers to it before freeing it—if so, the block is kept and shared, and only the new writes diverge.

Each item is located by its key—and every key in every tree across the entire filesystem shares the same structure.

The Universal Key

What ties all these trees together is a single key used everywhere, made of three fields sorted in order:

key = ( objectid,  type,        offset  )

      ( 500,       INODE_ITEM,  0       )  → inode metadata
      ( 500,       EXTENT_DATA, 65536   )  → file data at offset 64 KB
      ( 12,        DIR_ITEM,    hash    )  → directory entry
      ( CSUM_OBJ,  EXTENT_CSUM, 0x1000000 )  → checksum for logical block

The three fields each play a distinct role:

  • objectid identifies the entity the item belongs to. For file and directory items it’s the inode number. For extent items it’s the logical start address of the extent. For checksum items it’s a fixed constant (EXTENT_CSUM) that groups all checksums together. It’s whatever makes sense as the primary identity for that category of item.

  • type is a small integer that says what kind of item this is—INODE_ITEM, EXTENT_DATA, DIR_ITEM, EXTENT_CSUM, and so on. It’s what lets many different item types share the same tree, sorted cleanly by entity first and then by kind.

  • offset is the most context-dependent of the three. For EXTENT_DATA items it really is a byte offset into the file—so all extents for a file are sorted by position. For DIR_ITEM items it’s a hash of the filename, not an offset at all. For INODE_ITEM items it’s always 0, since there’s only one inode item per inode. For EXTENT_CSUM items it’s the logical address of the first block covered. The meaning shifts depending on type, but the sort order stays consistent.

Items are sorted first by objectid, then by type, then by offset. This uniform structure means a single tree can hold many different item types interleaved together—inodes, extents, directory entries, and more all coexist in the same B-tree, sorted cleanly by this key. The same tree traversal code works for all of them. A file’s data extents sit right next to its inode item in the tree, sorted by file position. Directory entries cluster together under their parent inode. The layout is both elegant and cache-friendly—and we’ll see a concrete example of this when we look at FS trees.

With that foundation in place, we’re ready to walk the filesystem from the very first thing the kernel reads on mount.

The Superblock: Where Everything Starts

When the kernel mounts a Btrfs filesystem, the first thing it reads is the superblock. Unlike ext4 (which places its superblock right after a 1024-byte boot gap) or XFS (which puts it at the very start of AG 0), Btrfs writes the superblock at three fixed offsets: 64 KB, 64 MB, and 256 GB into the device. The first copy is always at 64 KB; the other two are redundant backups spread across the disk. If your first disk sectors fail, the filesystem can still be mounted from a backup superblock—the same resilience strategy ext4 uses, but implemented differently.

The superblock holds the global metadata you’d expect: the total filesystem size, used bytes, the checksum algorithm in use (csum_type—CRC32, XXHASH, SHA256, or BLAKE2, chosen at mkfs time), and feature flags that tell the kernel which optional capabilities are active. It also carries its own checksum and a magic number so the kernel can quickly confirm it’s reading a valid Btrfs superblock.

One field deserves special attention: the filesystem UUID (fsid). Remember that a Btrfs filesystem can span multiple disks or partitions—what you’re mounting is not a single block device but a pool. The fsid is the 16-byte identifier for that pool as a whole, shared by every device in it. Every device that belongs to the same filesystem carries the same fsid in its superblock. Alongside that, each device also records its own per-device UUID and a devid number, so Btrfs knows not just that a device belongs to the pool, but exactly which device it is.

This is how Btrfs ties a multi-device pool together at mount time. When the system boots, the kernel (or udev, via btrfs device scan) reads the superblock of every block device it finds. For each one with a valid Btrfs magic number, it extracts the fsid and registers the device in an in-memory table. By the time you issue a mount command—even if you only name one of the devices—Btrfs looks up its fsid in that table, finds all other devices already scanned with the same fsid, and opens them all together. The num_devices field tells Btrfs exactly how many devices to expect, so it can warn you if some are missing before proceeding.

The fields that matter most for actually navigating the filesystem are three tree root pointers: chunk_root (the chunk tree—the logical-to-physical translation layer we described earlier), root (the root tree, directory of all other trees), and log_root (the log tree, for fast fsync() recovery).

Bootstrap

All three pointers are logical addresses—but you can’t follow a logical address until the chunk map is loaded, and you can’t load the chunk map until you’ve read the chunk tree. Classic chicken-and-egg. Btrfs solves it by embedding a small set of chunk mappings directly inside the superblock. These cover just the SYSTEM block group—the region where the chunk tree itself lives. The kernel loads those embedded mappings first, which is enough to translate chunk_root and read the full chunk tree. Once the complete chunk map is in memory, everything else follows normally.

With the bootstrap complete and the chunk map live, let’s look at how the chunk tree actually works.

The Chunk Tree: Logical to Physical

The logical address space is divided into large regions called chunks—typically 1 GB each. A chunk is typed: it holds either DATA (file content), METADATA (B-tree nodes), or SYSTEM (the chunk mapping itself). When the filesystem needs to allocate space for a new file or a new metadata node, it picks a chunk of the appropriate type and allocates within it.

The chunk tree is the concrete implementation of this mapping. Unlike other trees that hold many different item types, the chunk tree contains only one: CHUNK_ITEM. Every key follows the same pattern—objectid is always the same fixed constant (representing the pool as a whole), type is always CHUNK_ITEM, and offset is the logical start address of the chunk. Since objectid and type never vary, the tree is effectively sorted by logical address, making lookups trivial: to translate a logical address, traverse the tree to find the largest key whose offset is less than or equal to it.

The payload of each item describes the full mapping: the chunk’s length, its type (DATA, METADATA, or SYSTEM), its RAID profile, and for each stripe, which device holds it and at what physical offset:

key=(CHUNK_ITEM, 0x0000_0000)  length=1 GiB  DATA|RAID1
  stripe[0]: devid=1  offset=0x1000_0000   ← /dev/sda
  stripe[1]: devid=2  offset=0x1000_0000   ← /dev/sdb

key=(CHUNK_ITEM, 0x4000_0000)  length=256 MiB  METADATA|DUP
  stripe[0]: devid=1  offset=0x5000_0000   ← /dev/sda
  stripe[1]: devid=1  offset=0x6000_0000   ← /dev/sda (same device, duplicated)

The key tells you where the chunk starts in logical space; the payload tells you how big it is, what it contains, and exactly where to find it on each physical device. Every other tree in Btrfs uses logical addresses, and every read or write goes through this table. The rest of the filesystem code never has to think about physical placement.

Now that logical addresses are resolved, the kernel needs a way to find the rest of the filesystem—and that’s exactly what the root tree is for.

The Root Tree: A Directory of Trees

With the chunk tree loaded and logical addresses fully resolvable, the kernel can now follow the superblock’s root pointer into the root tree. This tree’s sole purpose is to act as a directory of all other trees—each entry maps a tree ID to its root node’s logical address.

The main item type is ROOT_ITEM, keyed by tree ID. The payload carries the logical address of that tree’s root node, along with bookkeeping like the tree’s generation and, for subvolumes, the inode of its root directory. System trees have fixed IDs (1–7); subvolumes and snapshots get IDs starting from 256. For a snapshot, the key also encodes the generation at which it was taken, so multiple snapshots of the same subvolume sort neatly together.

The root tree also tracks the parent–child relationships between subvolumes using ROOT_REF and ROOT_BACKREF items—a forward and a reverse link always written in pairs, recording which directory inside the parent contains the subvolume mountpoint and under what name.

A real root tree leaf looks roughly like this:

(1,   ROOT_ITEM,    0) → logical_addr=...   ← root tree itself
(2,   ROOT_ITEM,    0) → logical_addr=...   ← extent tree
(3,   ROOT_ITEM,    0) → logical_addr=...   ← chunk tree
(4,   ROOT_ITEM,    0) → logical_addr=...   ← device tree
(5,   ROOT_ITEM,    0) → logical_addr=...   ← default FS tree
(7,   ROOT_ITEM,    0) → logical_addr=...   ← checksum tree
(256, ROOT_ITEM,    0) → logical_addr=...   ← subvolume @256
(256, ROOT_REF,   257) → dirid=256, name="snap1"   ← 256 contains snap 257
(257, ROOT_ITEM,   42) → logical_addr=..., last_snapshot=42  ← snapshot at gen 42
(257, ROOT_BACKREF,256)→ dirid=256, name="snap1"   ← 257's parent is 256

You can see all the system trees in the first entries (IDs 1–7), followed by a user subvolume at ID 256, and then a snapshot of it at ID 257—with a ROOT_REF in the parent recording where the snapshot is mounted, and a ROOT_BACKREF in the snapshot pointing back to its parent.

Here’s a visual overview of how the superblock, root tree, and all the other trees relate to each other:

Root tree structure

When Btrfs needs to access, say, the checksum tree, it searches the root tree for the key (7, ROOT_ITEM, 0), reads the logical address from the payload, and follows that pointer in.

The root tree is also where subvolumes are registered. As we can see in the example above, each subvolume has its own ROOT_ITEM entry, and snapshots appear right alongside their source with ROOT_REF and ROOT_BACKREF pairs recording the parent–child relationship. When a snapshot is created, a new entry is added pointing to the same root node as the source; from that moment on, the two FS trees are independent but share all their blocks.

Each of those FS tree entries points to the real heart of the filesystem—where files and directories actually live.

FS Trees: Where Files and Directories Live

Each subvolume has its own FS tree, and this is where files and directories actually live. All the items for every file and directory in the subvolume are stored together in one big B-tree, sorted by that universal key. This is the tree where the mixed-item-types design really shines: a file’s inode, its hard link names, its extended attributes, and all its data extents all coexist in the same tree, sorted adjacent to each other by inode number.

The first item type you’ll find for any file is the inode.

Inodes

Every file and directory in Btrfs has an inode item stored in the FS tree with key (inode_number, INODE_ITEM, 0). The payload is the full stat data: size, timestamps, permissions, link count, and flags—equivalent to what ext4 stores in its inode table, but here it’s just another B-tree item. Alongside it, INODE_REF items record hard link names, keyed by (inode_number, INODE_REF, parent_inode)—one per hard link, carrying the name and its position in the parent directory’s index. Extended attributes follow under (inode_number, XATTR_ITEM, hash_of_name).

Now let’s look at how a file’s content is referenced.

Files

File data is mapped through EXTENT_DATA items, keyed by (inode_number, EXTENT_DATA, byte_offset_in_file). The offset in the key is the file position where this extent starts, so all extents for a file sort in order. Each payload points to a logical address in a data chunk. Small files skip the indirection entirely—their data is stored inline directly inside the item payload. We’ll look at how the extent tree tracks these allocations from the other side shortly.

So for a file with inode 256, its entries in the FS tree look like:

(256, INODE_ITEM,  0)        → size, timestamps, permissions
(256, INODE_REF,   2)        → name "myfile" in dir inode 2
(256, EXTENT_DATA, 0)        → file data bytes 0..1 MiB   → logical 0x1000000
(256, EXTENT_DATA, 0x100000) → file data bytes 1..3 MiB   → logical 0x2000000

The FS tree also needs to record where each file lives within the directory hierarchy.

Directories

Directory entries use two complementary item types under the directory’s inode number. DIR_ITEM, keyed by (dir_inode, DIR_ITEM, hash_of_filename), is used for name lookups: compute the hash, do a B-tree search, read the payload to get the target inode number and file type. DIR_INDEX, keyed by (dir_inode, DIR_INDEX, sequential_index), covers the other access pattern: ordered iteration. Every new entry gets the next sequential index, so readdir() can walk entries in stable creation order.

A directory with inode 2 looks like this in the tree:

(2, INODE_ITEM,  0)      → dir stat data
(2, INODE_REF,   1)      → name "mydir" in parent inode 1
(2, DIR_ITEM,    0x3f2a) → entry "myfile" → inode 256
(2, DIR_INDEX,   1)      → first entry (readdir order)

Notice how everything related to a file or directory—its stat data, its name, its extended attributes, its data extents—clusters together in the tree, because all those items share the same objectid. Reading a file top to bottom is a single sequential scan through adjacent leaf entries.

The FS tree tells us who owns which extents—but something also needs to track the reverse: which extents are allocated, how many trees share them, and where free space is.

The Extent Tree: Tracking Space

While the FS tree tracks what files own which extents, the extent tree tracks it from the other direction: for every allocated region, it records who’s using it and how many references there are.

The extent tree holds two main kinds of items.

Block group items describe the large fixed-size regions that Btrfs divides its logical address space into—one block group per chunk. Each one records how much space within it is used, its type (DATA, METADATA, or SYSTEM), and its RAID profile. When Btrfs needs to allocate space, it first picks a suitable block group before looking for a free spot inside it. (Btrfs has an opt-in block-group-tree feature that moves these into their own dedicated tree, but it is disabled by default.)

Extent items are the fine-grained records—one per allocated region. Data extents use key (logical_addr, EXTENT_ITEM, size). Metadata tree nodes use a separate variant, METADATA_ITEM, where the offset stores the node’s level rather than its size—this is a space-saving optimization since all metadata nodes are the same size. Both variants carry a reference count, a generation, and one or more backreferences pointing back to whoever is using this extent.

For the common case of one or two references, the backrefs are packed inline directly inside the extent item’s payload. When a shared extent accumulates too many references to fit inline, Btrfs spills the overflow into additional items immediately following in the tree—EXTENT_DATA_REF for data extents referenced by file, SHARED_DATA_REF when the parent leaf is known, TREE_BLOCK_REF for metadata referenced by tree ID, and SHARED_BLOCK_REF when the parent node is known. Because all these overflow items share the same objectid as their extent item, they cluster together naturally in the leaf.

A real extent tree leaf shows how this looks in practice:

(0x100000, EXTENT_ITEM,   0x4000) → refs=1  DATA
                                     backref: root=256, ino=300, offset=0
(0x104000, EXTENT_ITEM,   0x4000) → refs=2  DATA
                                     backref: root=256, ino=301, offset=0
                                     backref: root=257, ino=301, offset=0  ← shared by two snapshots
(0x200000, METADATA_ITEM, 1)      → refs=1  TREE_BLOCK
                                     backref: root=5
(0x210000, METADATA_ITEM, 0)      → refs=2  TREE_BLOCK
                                     backref: root=256
                                     backref: root=257  ← shared leaf between two snapshots
(0x400000, BLOCK_GROUP_ITEM, 1GiB)→ used=512MiB  DATA|RAID1

The backreferences are what make CoW snapshots efficient. When a snapshot is created and shares blocks with its source, those shared extents get a reference count of 2. When you modify a file in the snapshot, CoW allocates new blocks for the changed data—the old blocks’ reference count drops to 1, and the new blocks start at 1. When a snapshot is deleted, Btrfs decrements reference counts on all its extents and frees those that reach zero. Because updating a reference count is itself a B-tree modification (which triggers CoW), Btrfs uses delayed references: rather than updating the extent tree on every CoW, it batches reference count changes in memory and applies them at transaction commit time, significantly reducing write amplification.

At this point we have extents pointing to logical addresses and sizes, and we can translate those through the chunk tree to find exactly which physical device and offset holds each block. Space accounting keeps all of that coherent—but it says nothing about whether the data stored there is actually correct. That’s the job of the checksum tree.

The Checksum Tree: Integrity on Every Read

This is where Btrfs genuinely earns its reputation for data integrity. Every data block written to disk has its checksum stored here. Metadata nodes carry their checksum in their own node header, so the checksum tree is exclusively for file data.

The tree contains only one item type, EXTENT_CSUM, and all items share the same fixed objectid constant—so the tree is effectively sorted by logical address. The key’s offset is the logical address of the first sector covered, and the payload is a packed array of checksums, one per 4 KB sector, covering a contiguous run of logical space. How big each checksum is depends on the algorithm chosen at mkfs time and stored in the superblock:

CRC32c  → 4 bytes/sector   (~4000 sectors per leaf item, ~16 MB of data)
XXHASH  → 8 bytes/sector
SHA256  → 32 bytes/sector
BLAKE2b → 32 bytes/sector

To look up the checksum for a specific sector, Btrfs searches for the item whose offset is just below the target logical address, then skips forward by the right number of entries in the array. A real leaf looks like this:

(CSUM_OBJ, EXTENT_CSUM, 0x0000000) → [crc32(sector@0), crc32(sector@4K), ...] covers 0..8 MiB
(CSUM_OBJ, EXTENT_CSUM, 0x0800000) → [crc32(sector@8M), ...]                  covers 8..16 MiB

Gaps in the logical address space—holes, unwritten regions—simply have no item, and Btrfs skips verification for those ranges.

On every read, Btrfs computes the checksum of the data and compares it to the stored value. A mismatch means corruption. On a RAID1 setup, Btrfs immediately reads the mirror, verifies it, repairs the corrupted copy, and returns the correct data—all transparently. You can also run btrfs scrub to proactively verify and repair every block in the background without taking anything offline.

Checksum verification

The root tree also registers a few other trees worth knowing about, even if we won’t go into their full detail here.

Other Trees

The device tree is the counterpart to the chunk tree: where the chunk tree maps logical addresses to physical stripes, the device tree records which physical ranges on each device are occupied. It’s keyed by device ID and physical offset, with each item pointing back to the logical chunk that owns that space. When Btrfs needs to find free physical space on a device—during balance, device remove, or resize—it walks the device tree to see what’s already claimed and what isn’t.

The free space tree is an on-disk index of which logical ranges within each block group are available for allocation. It’s loaded at mount time and used as the backing store when the in-memory per-block-group cache is cold or needs to be rebuilt—so Btrfs can find free space without scanning all extent items. Active allocations go through the in-memory structures, with the free space tree providing the persistent foundation underneath.

The log tree handles a special case: fsync() on an individual file. Rather than waiting for a full transaction commit, Btrfs writes just that file’s pending changes to a per-subvolume log tree that can be replayed quickly on the next mount. This makes fsync() fast without forcing an entire transaction commit for every call. The log_root pointer in the superblock points to a log_root_tree—a single B-tree that acts as a directory, with one entry per subvolume that has an active log. On mount after a crash, Btrfs finds the log_root_tree, iterates its entries, and replays each subvolume’s log in turn.

With all these trees in place, we can now look at the feature that ties all of this together—and the reason many people choose Btrfs in the first place.

Snapshots: The Killer Feature

Now that we understand the tree structure, we can see exactly why snapshot creation is so fast.

Creating a snapshot means adding a new entry to the root tree pointing to the same root node as the source FS tree, and recording the snapshot’s generation. That’s it—all data blocks are shared from the start.

Before snapshot:
  Root Tree → FS Tree 5 → [inode/dir/extent nodes...]

After snapshot:
  Root Tree → FS Tree 5   → [same inode/dir/extent nodes...]
           → FS Tree 256 → [same nodes! shared via ref count]

Creating a snapshot is an O(1) operation regardless of how many files are in the subvolume. It’s nearly instantaneous even for a subvolume with millions of files. No data is copied—only the root pointer is duplicated.

After the snapshot, CoW handles divergence naturally. When you modify a file in the original subvolume, the modified leaf node in the FS tree is CoW’d to a new location, which CoW’s its parent, all the way up to the root. The snapshot’s FS tree still points to the old nodes. The two trees diverge gradually as changes accumulate, sharing blocks for everything that hasn’t changed.

Snapshot CoW divergence

Btrfs can also do read-only snapshots, which are guaranteed never to change and are ideal for backups. The btrfs send command can efficiently compare two snapshots and emit a stream of the differences—perfect for incremental backups, because it works at the filesystem level and understands shared CoW blocks rather than just byte-comparing files like rsync does.

We’ve now seen all the moving parts. Let’s watch them work together on two concrete operations.

Putting It All Together: Reading and Writing a File

Let’s trace two concrete operations to see how all these structures work together.

Starting with a read, since it’s the simpler of the two—no allocations, no CoW, just traversal.

Reading a File

Say we want to open /home/alice/notes.txt on a freshly mounted Btrfs filesystem.

The kernel starts with the superblock and follows chunk_root to the chunk tree, which establishes the logical-to-physical translation tables. Then it follows root to the root tree and looks up the default subvolume—determined by a special DIR_ITEM named "default" in the root tree, which the user can change with btrfs subvolume set-default. Now we’re in the FS tree for that subvolume.

Finding /home/alice/notes.txt requires traversal just like any filesystem, but now we can be specific about what’s searched at each step. Inode 256 is the root directory—that’s BTRFS_FIRST_FREE_OBJECTID, the well-known starting point:

  • Search the FS tree for (256, DIR_ITEM, hash("home")) → get inode number of home/
  • Search for (home_ino, DIR_ITEM, hash("alice")) → get inode number of alice/
  • Search for (alice_ino, DIR_ITEM, hash("notes.txt")) → get the file’s inode number

With the file’s inode number—say 512—we load (512, INODE_ITEM, 0) to check permissions and get the file size. Then we walk the EXTENT_DATA items sorted by file offset:

  • (512, EXTENT_DATA, 0) → logical address 0x1000000, length 1 MiB
  • (512, EXTENT_DATA, 0x100000) → logical address 0x2000000, length 2 MiB

The first item’s key tells us this is the extent at file offset 0—the very start of the file. Its payload gives us the logical address where that data lives on disk (0x1000000) and how long it is (1 MiB). The second item’s key offset is 0x100000, which is 1 MiB—exactly where the first extent ends—so it covers the next chunk of the file. Its payload points to logical address 0x2000000 with a length of 2 MiB.

Each logical address is translated through the chunk tree to a physical device and offset. The data is read, and before returning it the kernel looks up (CSUM_OBJ, EXTENT_CSUM, 0x1000000) in the checksum tree and verifies every sector. Corruption is caught here, every time.

Writing is where CoW and transactions come into play.

Writing a File

Creating and writing notes.txt is more involved. When you call write(), the data goes into the page cache immediately—no disk activity yet. Btrfs uses delayed allocation: it reserves space in the block group’s free space counter but doesn’t pick specific blocks until the data actually needs to go to disk.

When the kernel decides to flush dirty pages—or when you call fsync()—Btrfs joins the currently running transaction. There is always one active transaction that multiple writers share concurrently; a new one is only opened when none exists. Transactions are the atomic unit of Btrfs’s crash safety: either all changes in a transaction are committed, or none of them are.

Within the transaction, Btrfs consults the free space tree to find available space in a data block group, allocates a new extent at—say—logical address 0x3000000, and writes the data there. Only after the data write completes does Btrfs update the other trees:

  • FS tree: insert (512, EXTENT_DATA, 0x200000) pointing to 0x3000000; update (512, INODE_ITEM, 0) with the new size and timestamps
  • Extent tree: insert (0x3000000, EXTENT_ITEM, size) with a backref pointing to inode 512 in this subvolume
  • Checksum tree: insert (CSUM_OBJ, EXTENT_CSUM, 0x3000000) with checksums for each new sector

All of these metadata updates CoW the affected tree nodes all the way up to their respective roots.

When the transaction commits, Btrfs flushes all the new and CoW’d tree nodes to their final locations on disk first—then writes a new superblock pointing to the new root tree root. That superblock write is the commit point. Because all the new data is already safely on disk before the superblock is touched, the atomicity doesn’t depend on write size—it comes from the CoW model itself. If the system crashes before the superblock write completes, its checksum will be invalid and the kernel will ignore it, falling back to the previous valid copy. If it completes, the new state is fully consistent.

The log tree (log_root in the superblock) handles a special case: fsync() on an individual file. Rather than committing the entire transaction, Btrfs can write just that file’s changes to a per-subvolume log tree—a small set of changes that can be replayed quickly on the next mount. This makes fsync() fast without requiring a full transaction commit for every call.

Let’s pull all of this together into one final picture.

Summary

Btrfs is built on four interlocking ideas: a copy-on-write B-tree engine, a logical address layer that abstracts physical devices, per-block checksums for data integrity, and reference-counted extents for shared blocks. None of these strictly requires the others, but together they give Btrfs its distinctive capabilities.

Everything in Btrfs is a B-tree, and all those trees share the same node format and the same universal (objectid, type, offset) key. Every node carries its own checksum and generation stamp—corruption is caught on every read, and stale pointers are detected on every tree traversal.

The root tree gives you access to everything else: the FS trees (one per subvolume, holding inodes, hard links, extended attributes, directory entries, and file extents all sorted together by inode number), the extent tree (block groups, allocated extents, and reference counts that make CoW snapshots cheap), the checksum tree (per-sector checksums for all data, verified on every read), and the free space tree (fast free space lookup within block groups).

Snapshots are O(1): creating one adds a single entry to the root tree pointing at the same root node as the source. CoW handles divergence naturally—modified nodes are written to new locations, leaving the snapshot’s pointers intact. The extent tree’s reference counts track sharing and free blocks when references are dropped.

All of this commits atomically: a single superblock write switches the entire filesystem from one consistent state to the next. No journal replay needed—if the system crashes before that write, the previous superblock is still valid and the filesystem is exactly as it was.

In the next article, we’ll explore ZFS—the filesystem that pioneered many of these ideas. ZFS takes data integrity even further, building the entire filesystem as a Merkle tree where every block is cryptographically linked to its parent. Same fundamental CoW architecture, but with different design priorities and a fascinating approach to pooled storage.


Want to dig into the source? The Btrfs implementation lives in fs/btrfs/ in the Linux kernel tree. Start with ctree.c for B-tree operations, transaction.c for the commit path, extent-tree.c for space allocation, volumes.c for multi-device and RAID, and disk-io.c for how the superblock and chunk bootstrapping work. Happy reading!