The OpenZFS project has a bright and encouraging future as it continues to gain momentum and adoption as the reliable, flexible, and performant filesystem for use cases from small embedded appliances and IoT devices, through every scale of enterprise, to the biggest research clusters in the world.
Today, we’ll look at what is coming in the near term and where ZFS might take us over the next few years as the storage landscape continues to evolve.
OpenZFS 2.4
The next major release of OpenZFS is scheduled to arrive late this fall, likely in early November 2025. This release will continue to build on the renowned stability of ZFS while continuing to add features and improvements to both usability and performance.
Top Klara contributed features and fixes:
Other interesting features:
Planned Features
In addition to what will land in OpenZFS 2.4, there are other projects that are currently in development and would be expected to land in near-future releases of OpenZFS.
Label Redesign
Klara is actively developing a redesign of the ZFS on-disk label format. This will make the label area significantly larger (going from 256 KiB to 256 MiB) to provide an array of new capabilities, while also ensuring the new format can adapt to serve future features that have not been designed yet. The primary goals of the redesign are to support larger sector sizes, as the market has started to introduce flash devices with native sector sizes of 64 or even 128 KiB. Currently, ZFS only supports a maximum sector size of 16 KiB, and even then, it comes with significant trade-offs to ZFS’s ability to rewind in the event of a disaster.
The ZFS label includes a dedicated area of 128 KiB to store a ring buffer of uberblocks, the top-level data structure for each transaction group, that points to the consistent metadata for the entire transaction group. To avoid the potential for data corruption when partially modifying a sector, ZFS stores each 1 KiB uberblock in its own sector. With the original 512-byte sectors used by all storage media when ZFS was originally designed in the early 2000s, this meant that the buffer could hold 128 uberblocks. With modern 4 KiB sector devices, that works out to only 32 uberblocks, and other features like zpool checkpoints and multi-mount protection may reserve an individual uberblock each. With 16 KiB sectors and up to two reserved uberblocks, that would leave only 6 uberblocks in the ring, which could cover only a few seconds' worth of changes.
The new label design will feature a much larger area to store uberblocks, to allow a longer rewind window, as well as a separate area for reserved and best-effort uberblocks. These best-effort uberblocks will allow extreme rewind in cases where a catastrophic configuration error, such as multiple concurrent mounts, has left all of the recent uberblocks in an inconsistent state.
An improved allocation strategy for the userblocks could also allow multiple non-consecutive uberblocks to be stored in a single large sector, ensuring that if an uberblock is corrupted by a partial write, the adjacent uberblocks are not impacted.
The expanded label area will also allow each disk to store the complete configuration of the pool, as opposed to the current model, where each disk only knows about its siblings in the same VDEV. This will allow ZFS to provide better diagnostics about missing disks, and also speed up the discovery and import of pools.
AnyRaid-Z
Building upon the initial work for AnyRaid Mirrors, Klara will be expanding the concept to allow using RAID-Z parity to further maximize the usable storage of a set of disks. AnyRaid will allow a collection of disks of various sizes to offer more usable capacity than traditional ZFS RAID-Z, while still maintaining the same data durability to withstand the loss of 1 – 3 disks, depending on the selected parity level. This additional flexibility allows budget deployments of ZFS to reap the benefits of incremental hardware upgrades rather than needing to wait until all disks in a VDEV have been replaced with larger components.
Forced Export
When a ZFS pool becomes suspended due to the loss of too many devices, be it from hardware failure, external connectivity issues, or removable media, no further activity is permitted until the devices are reconnected. In the case of removable media, this may not be forthcoming. When there are multiple pools on the same system, this can cause some administrative commands to become stuck waiting for the suspended pool, blocking access to some functionality of the remaining pools.
With the forced export feature, when it is determined that the disconnected devices are not likely to return without further intervention, the suspended pool can be forcibly exported, discarding any asynchronously written data that has not yet been sent to disk. ZFS will be able to recover as if the system were restarted due to a power outage or other crash, but without the delay of actually undertaking a reboot. Quickly restoring full functionality to the remaining pools, and possibly the suspended pool as well, provides improved availability and uptime.
Many high-availability configurations may also benefit from this feature. Allowing them to confidently force export a pool, to quickly respond to an issue, and restore functionality by importing the pool on the secondary node.
AWS Enhancements
Using cloud infrastructure presents an interesting set of challenges and opportunities for ZFS. Take the example of AWS’s provisioned IOPS EBS volumes. These allow purchasing guaranteed performance for storage workloads.
The way ZFS batches writes to preserve available IOPS for reads and to concentrate writes to large sequential ranges to optimize throughput may work against these types of rate-limited configurations. A volume limited to 1000 IOPS and 125 MiB/sec with a write-heavy workload will only maximize its bandwidth during the flushing of each transaction group, which is typically once every 5 seconds. EBS does not allow you to use the IOPS you have not used for the previous few seconds, so it ends up making the writes slower.
A feature under development by Klara takes two approaches to maximize throughput on these types of provisioned storage, especially for EBS. First, writes are purposely spread across different regions within the volume to maximize network throughput between the multiple nodes that back the volume. Secondly, starting to write out accumulated asynchronous writes before the end of the transaction group to take advantage of the available throughput and avoid hitting the rate limit during the crucial checkpointing phase.
Potential Features
In addition to the features that are actively under development, there are numerous ideas and initial designs for future features, looking for those who have a use case and would be interested in collaborating on and/or funding development. If you are interested in any of these features, please contact us to discuss how we can work together to bring these features into OpenZFS.
BRT Log
The Block Reference Tree (BRT) feature, also known as “reflinks” or “file cloning”, allows individual blocks or files to be copied-by-reference/cloned to keep multiple copies of the same data in different files without consuming additional storage. This feature has been a game-changer for certain workloads, especially backups. A common backup operation, creating a synthetic “full” backup by combining the last full with the modifications from multiple more recent incremental backups. This approach avoids putting additional load on the systems being backed up, while saving significant space and simplifying and accelerating restoration.
BRT was designed to avoid the overhead and memory consumption of “always on” deduplication. When used at a large scale, it suffers a similar issue to ZFS’s existing deduplication feature, having to do with updating the full BRT metadata for each transaction group. These problems were significantly mitigated in ZFS dedup by the Fast Dedup Log feature introduced in OpenZFS 2.3 by Klara. A similar log and batch mechanism for BRT would reduce the latency of cloning operations and improve the performance of mass deletions or overwrites of cloned blocks.
Support for SMR Drives
Shingled Magnetic Recording, or SMR, increases the density of magnetic spinning media by taking advantage of the fact that the read head on an HDD is significantly smaller than the write head. By laying down the tracks with a partial overlap, like the shingles on a roof, it becomes possible to fit more readable tracks in the same physical space, increasing overall storage capacity.
With the introduction of energy-assisted recording mechanisms like HAMR (Heat Assisted Magnetic Recording) and MAMR (Microwave Assisted Magnetic Recording) beginning to see volume shipments, we expect to see a second wave of SMR drives enter the market, this time offering larger storage gains than the first generations of SMR drives.
The downside to SMR drives is that to write to an overlapped track, you must rewrite each of the tracks in order. Random writes require reading all of the overlapping tracks, applying the modification in the buffer, then rewriting all of the tracks. This results in a significant slowdown if the filesystem is not designed to cooperate with the drive to organize and align the writes to match the “zones” of overlapping tracks.
ZFS requires additional optimizations to make best use of the additional capacity provided by SMR drives without suffering large performance penalties with drive-managed SMR, or incompatibilities with host-managed SMR.
Future Technology
There are a number of interesting developments coming in the storage industry that could prove interesting to future development on ZFS. The open-source software nature of ZFS allows it to continue to adapt to changes in the storage industry, and invites those building new products to contribute to ZFS so those new products can more easily be adopted by the myriad industries that rely on the reliability, flexibility, and performance of ZFS.
NVMe-connected HDDs
The NVMe (Non-Volatile Memory Express) interconnect has revolutionized the way storage is connected to compute. Bringing increased concurrency with multiple queues and massive bandwidth and IOPS increases, along with significantly reduced latencies for flash-based storage.
The technology was then extended with NVMe-oF (NVMe over Fabric) to make use of a number of different existing mechanisms, including TCP. This presents unique opportunities for more reconfigurable software-defined storage by replacing traditional SAS interconnects with switched networking.
To that end, drive manufacturers, including Seagate, are introducing NVMe-connected HDDs. While the more advanced interface isn’t going to make HDDs any faster, the common interface will make more reusable and adaptable infrastructure, and allow more disaggregated systems and open up additional possibilities for high availability configurations.
CXL
The Compute eXpress Link technology takes this concept of disaggregated computing even further. Using CXL.memory, it is possible to directly address the memory in another node, allowing the construction of huge memory systems, as well as opening up the possibilities to reallocate underutilized resources between nodes.
Utilizing this new generation of NUMA (Non-Uniform Memory Architecture) to maximize the available memory for caching, and using CXL’s other mechanisms to share CPU and IO resources, could allow ZFS to scale out without the complexity of a distributed filesystem.
Conclusion
Storage will continue to advance, and the price-per-terabyte of both spinning and solid-state media will continue to come down, while the capacities continue to skyrocket. ZFS, a filesystem with renowned durability, decades of proven operations, and a demonstrated ability to adapt to new technologies, will remain a successful and compelling choice for those building storage infrastructure.