RAIDZ Expansion feature by ahrens · Pull Request #12225 · openzfs/zfs

6 min read Original article ↗

Motivation and Context

This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful for
small pools (typically with only one RAID-Z group), where there isn't
sufficient hardware to add capacity by adding a whole new RAID-Z group
(typically doubling the number of disks).

For additional context as well as a design overview, see my talk at the 2021 FreeBSD Developer Summit (video) (slides), and a news article from Ars Technica.

Description

Initiating expansion

A new device (disk) can be attached to an existing RAIDZ vdev, by running
zpool attach POOL raidzP-N NEW_DEVICE, e.g. zpool attach tank raidz2-0 sda.
The new device will become part of the RAIDZ group. A "raidz expansion" will
be initiated, and the new device will contribute additional space to the RAIDZ
group once the expansion completes.

The feature@raidz_expansion on-disk feature flag must be enabled to
initiate an expansion, and it remains active for the life of the pool. In
other words, pools with expanded RAIDZ vdevs can not be imported by older
releases of the ZFS software.

During expansion

The expansion entails reading all allocated space from existing disks in the
RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including
the newly added device).

The expansion progress can be monitored with zpool status.

Data redundancy is maintained during (and after) the expansion. If a disk
fails while the expansion is in progress, the expansion pauses until the health
of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting
for reconstruction to complete).

The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.

After expansion

When the expansion completes, the additional space is avalable for use, and is
reflected in the available zfs property (as seen in zfs list, df, etc).

Expansion does not change the number of failures that can be tolerated without
data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old data-to-parity
ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the
larger set of disks. New blocks will be written with the new data-to-parity
ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not
change, so slightly less space than is expected may be reported for
newly-written blocks, according to zfs list, df, ls -s, and similar
tools.

Manpage changes

zpool-attach.8:

NAME
     zpool-attach — attach new device to existing ZFS vdev

SYNOPSIS
     zpool attach [-fsw] [-o property=value] pool device new_device

DESCRIPTION
     Attaches new_device to the existing device.  The behavior differs depend‐
     ing on if the existing device is a RAIDZ device, or a mirror/plain
     device.

     If the existing device is a mirror or plain device ...

     If the existing device is a RAIDZ device (e.g. specified as "raidz2-0"),
     the new device will become part of that RAIDZ group.  A "raidz expansion"
     will be initiated, and the new device will contribute additional space to
     the RAIDZ group once the expansion completes.  The expansion entails
     reading all allocated space from existing disks in the RAIDZ group, and
     rewriting it to the new disks in the RAIDZ group (including the newly
     added device).  Its progress can be monitored with zpool status.

     Data redundancy is maintained during and after the expansion.  If a disk
     fails while the expansion is in progress, the expansion pauses until the
     health of the RAIDZ vdev is restored (e.g. by replacing the failed disk
     and waiting for reconstruction to complete).  Expansion does not change
     the number of failures that can be tolerated without data loss (e.g. a
     RAIDZ2 is still a RAIDZ2 even after expansion).  A RAIDZ vdev can be
     expanded multiple times.

     After the expansion completes, old blocks remain with their old data-to-
     parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distrib‐
     uted among the larger set of disks.  New blocks will be written with the
     new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded
     once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ vdev's
     "assumed parity ratio" does not change, so slightly less space than is
     expected may be reported for newly-written blocks, according to zfs list,
     df, ls -s, and similar tools.

Status

This feature is believed to be complete. However, like all PR's, it is subject
to change as part of the code review process. Since this PR includes on-disk
changes, it shouldn't be used on production systems before it is integrated to
the OpenZFS codebase. Tasks that still need to be done before integration:

  • Cleanup ztest code
  • Additional code cleanup (address all XXX comments)
  • Document the high-level design in a "big theory statement" comment
  • Remove/disable verbose logging
  • Few last test failures
  • remove first commit (needed to get cleaner test runs)

Acknowledgments

Thank you to the FreeBSD Foundation for
commissioning this work in 2017 and continuing to sponsor it well past our
original time estimates!

Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @Fmstrat for portions
of the implementation.

Sponsored-by: The FreeBSD Foundation
Contributions-by: Stuart Maybee stuart.maybee@comcast.net
Contributions-by: Fedor Uporov fuporov.vstack@gmail.com
Contributions-by: Thorsten Behrens tbehrens@outlook.com
Contributions-by: Fmstrat nospam@nowsci.com

How Has This Been Tested?

Tests added to the ZFS Test Suite, in addition to manual testing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist: