A data corruption bug in OpenZFS?

220 points by moviuro 2 years ago · 114 comments

Reader

cesarb 2 years ago

IMO, part of the issue is that something which used to be just a low-level optimization (don't store large sequences of zeros) became visible to userspace (SEEK_HOLE and friends). Quoting from this article:

"This is allowed; its always safe to say there’s data where there’s a hole, because reading a hole area will always find “zeroes”, which is valid data."

But I recall reading elsewhere a discussion about some userspace program which did depend on holes being present in the filesystem as actual holes (visible to SEEK_HOLE and so on) and not as runs of zeros.

Combined with the holes being restricted to specific alignments and sizes, this means that the underlying "sequence of fixed-size blocks" implementation is leaking too much over the abstract "stream of bytes" representation we're more used to. Perhaps it might be time to rethink our filesystem abstractions?

codys 2 years ago

> But I recall reading elsewhere a discussion about some userspace program which did depend on holes being present in the filesystem as actual holes (visible to SEEK_HOLE and so on) and not as runs of zeros.
"treatment of on-disk segments as "what was written by programs" can cause areas of 0 to not be written by bmaptool copy":
https://github.com/intel/bmap-tools/issues/75
IMO, the issue here isn't filesystem or zfs behavior, it's that bmap-tool wants an extra "don't care bit" per block, which filesystems (traditionally) don't track, and programs interacting with filesystem don't expect to exist.
Some of the comments I've made in this issue describe options to make things better.
(FWIW: the original hn link discusses a different issue around seek hole/data, and the bmap-tool issue is backwards from the issue the parent posits: bmap-tool relies on explicit runs of zeros written not being holes, and particular behavior from programs writing data)
ajross 2 years ago

Indeed, sparse files are simply a mistake to have included in Unix in the first place (I think we blame this on early SunOS? Not sure, though almost certain that 3BSD and v7 didn't have them). Yes, they have been used productively for various tricks, but they create a bunch of complexity that every filesystem needs to carry along with it. It's a bad trade.
- retrac 2 years ago
  
  Sparse files make more sense if you see the file system and paging as unified. If you have allocated an array of 1 billion items, accessing the last item doesn't make the OS zero out everything from 0th to the billionth item, allocating millions of pages along the way. Virtual emory is sparse; so just one page of virtual memory is allocated. Mmap'd sparse files behave the same way.
  - ajross 2 years ago
    
    No, I get it. I'm saying that's a bad design. The data structure for a VM system is a big tree of discontiguous mappings, which matches the API used for accessing it. If you make a random access to memory at an arbitrary spot, you expect to get a VM trap. If you want to map memory, you're expected to know the layout and manage the "holes" yourself (or else to let the OS manage your memory space for you).
    The data structure for a file is an ordered stream of bytes, which matches the API for accessing it. You can jump around by seeking, but there are no holes. Bytes start at 0 and go on from there. Want to seek() to an arbitrary value? Totally legal, presumptively valid.
    Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.
    
    avianlyric 2 years ago
    
    > Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.
    Aren’t all modern file systems implemented as a tree of discontinuous regions? That’s the whole reason block allocators exist, why file fragmentation is a thing (and defragmentation processes).
    How could you reasonably expect to implement a filesystem that under hood only operates with continuous blocks disk space? It would require the filesystem to have prior knowledge of the size of all the files that going to be written, so it can pre-allocate the continuous sections. Or the second writing a file resulted in that file exceeding the length of the continuous empty section of disk, future writes would have to pause until the filesystem had finished copying the entire file to a new region with more space.
    With ZFS its heavy dependence on tree structures of discontinuous address regions is what enables all of its desirable feature. To say the complexity is needless is to implicitly say ZFS itself is pointless.
    
    p_l 2 years ago
    
    The issue is that pretty much all other filesystems at least on Linux, are effectively implemented as swap filesystem drivers with some hierarchical structure on top, because that's the interface pushed by Linux at kernel level.
    In userland, we tend to think of streams of bytes, as provided by original Unix and as all the docs teach us to treat them - that read(), write() are the primitives and they do byte-aligned reads and writes.
    Except the actual Linux VFS has, as its core primitive, mmap() + pagein/pageout mechanism, with read() and write() being simulated over the pagecache which treats the files as mmap()ed memory regions. It's how IO caching is done on Linux, and it's source of various issues for ZFS and people using different architectures because for a long time (changed quite recently, afaik) Linux VFS only supported page-sized or smaller filesystem blocks. Which is a bit of a problem if you're a filesystem like ZFS where the file block can go from 512b to 4MB (or more) in the same dataset, or VMFS which uses 1MB blocks.
    
    avianlyric 2 years ago
    
    What any of that got to do with the bug described in the article? Presumably every filesystem is responsible for tracking the content of sparse files, and where holes are. That's not something the Linux kernel is going to give you for free, the FS needs tell the kernel which pages should be mapped to block address on disk and which pages should be simulated as continuous blocks of zeros with no on-disk representation.
    
    p_l 2 years ago
    
    It's related to the talk about filesystem interface metaphors in this specific subthread :)
    
    ajross 2 years ago
    
    That's true of a storage backend, but not the metaphor presented. Again, the analogy would be a heap: heaps are discontiguous internally too, but you don't demand that users of malloc() understand that there can be a hole in the middle of their memory! Again, the bug here was (seems to have been, it's subtle) a glitch in the tracking of holes in files that didn't ever need to have been there in the first place.
    
    avianlyric 2 years ago
    
    But ZFS doesn't demand that users be aware of holes in files. You can just call `seek()` and `read()` to anywhere, and ZFS will transparently provide zeros to fill the holes. Linux also allows software to become "hole-aware" using `lseek()`, but that's an optimisation that software can opt into, but can equally just ignore.
    The glitch in this case was a failure to correctly track dirty pages that have yet to be written to disk, and thus reading the on-disk data, rather than the data in-memory data within the dirty page. I just so happens this issue only appears in the code that's responsible for responding to queries about holes from software that's explicitly asking to know about the holes. ZFS itself never had any issues keeping track of the holes, the bookkeeping always converged on the correct state, it's just that during that convergence it was momentarily possible to be given old metadata about holes (i.e. what's currently on disk), rather than the current metadata about holes (i.e. what's currently only in-memory, and about to be written to disk).
    
    benlivengood 2 years ago
    
    There are pretty good reasons for treating files as sparse; virtualization and deduplication. Virtualization of storage devices without sparse files would be slowed tremendously by the need to allocate and zero large regions before use, essentially double-writing during the installation and initial provisioning stage. You can force the virtualization layer to implement sparse storage but then you get a host of incompatible disk image formats (vmdk, qcow2, etc.) and N times as many opportunities for bugs like the article describes to be introduced.
    Deduplication is basically a superset of sparse files where the zero block is a single instance of duplication. Deduplication isn't right for every task but for basically any public shared storage some form of deduplication is vital to avoid wasting duplicate copies.
    Sparse/deduplicated files still maintain the read/write semantics of files as streams of bytes; they allow additional operations not part of the original Unix model. Exposing them to userspace probably isn't a mistake per se because it is essentially no different than ioctls or socket-related functions that are a vital part of Unix at this point.
    
    ajross 2 years ago
    
    Those aren't general principles. They're just tricks. Some software uses them. No significant software paradigms are critically dependent on sparse files. Quite frankly almost no significant market-driving software uses them at all. Not sure what you have in mind, but a few examples might be helpful?
    All of them have a straightforward expression using contiguous storage. At best, sparse files allow you to reduce application-layer complexity. But as I'm pointing out, that comes at the cost of filesystem-layer complexity up and down the stack and throughout the kernel, and that's a bad trade.
- cogman10 2 years ago
  
  This a feature I was completely unaware of. Why would you choose to use a sparse file instead of multiple files?
  - wolf550e 2 years ago
    
    Imagine a torrent client (or http client downloading a file in parallel using HTTP range requests). It creates an empty file and then it has downloaded a 1MB of data to write at offset 100GB and wants to write it to disk. It does not want to pay the price of waiting for 100GB of zeroes to be written. The other blocks will all be downloaded and written eventually, all out of order. If the filesystem had an atomic operation to transform a bunch of (block aligned) files into a single file (like AWS S3 Multipart Upload), then sparse files would not be needed for this case.
    
    loeg 2 years ago
    
    Fallocate is a much better interface for this than sparse files. The torrent client does not care how the underlying filesystem provides the ability to randomly write a large file. And fallocate is a much clearer signal to the filesystem than a sparse file.
    
    avianlyric 2 years ago
    
    fallocate is just an interface to create sparse files. The result of using `fallocate` is a sparse file.
    
    loeg 2 years ago
    
    You should read my comment in the context of the one it is replying to. That comment suggested a torrent client using seeks + writes to randomly insert chunks as they were downloaded. I have summarized this approach in my comment as "sparse files," expecting charitable readers to be familiar with the context. This method of creating sparse files does not tell the filesystem anything about the intent of the application and usually creates a bunch of fragmentation under torrent-like workloads.
    
    avianlyric 2 years ago
    
    “sparse files” are specific term[1] referring to files where the filesystem tracks and doesn’t allocate space for unwritten file content (i.e. content that would just be zeros if read) in large preallocated files.
    To use the term “sparse file” to also refer to files with large continuous runs of zeros, created via a seek operation, is just confusing. Those are quite explicitly not sparse files, they’re just files, that happen to be full of zeros (all written to disk). “Sparse file” are quite explicitly the result of the optimisation to avoid writing pointless zeros when preallocating a large file that’s going to written into in an unordered manner.
    Using the term “sparse files” to refer to both the “problem” and the “solution” is just unhelpful, and doesn’t align with the accepted meaning of the term.
    [1] https://en.m.wikipedia.org/wiki/Sparse_file
    
    epcoa 2 years ago
    
    It’s not about being charitable. For those unfamiliar with the terminology this is just confusing, and for those that are familiar this discussion is all fundamental and well known anyway.
    Unfortunately for COW filesystems including zfs and btrfs fallocate doesn’t do anything useful for preallocation. You’re still going to get fragmentation. The two methods outlined are essentially equivalent.
    
    loeg 2 years ago
    
    > For those unfamiliar with the terminology this is just confusing, and for those that are familiar this discussion is all fundamental and well known anyway.
    Eh, agree to disagree.
    > Unfortunately for COW filesystems including zfs and btrfs fallocate doesn’t do anything useful for preallocation.
    Both ZFS and BtrFS have "nocow" modes that are probably more suitable to this type of use case. And other filesystems are widely used.
    
    finnh 2 years ago
    
    Can you point me to docs for ZFS offering a nocow mode? I haven't used it in about a decade, but i can't see how that would work - wafl/cow is a pretty fundamental invariant in everything ZFS does
    
    myself248 2 years ago
    
    Out of curiosity, do you know off-hand how torrent clients do it on filesystems that don't support sparse files? There must be either a preallocate-the-whole-thing step, or a gather-the-pieces-together-and-write-out-the-large-file step. The latter would seem to briefly double the disk space needed at the end of the download, so I suspect they do the former.
    
    ajross 2 years ago
    
    Chunk it up and resassemble, one assumes. Things aren't nearly as clear in the modern world of gigabit pipes into suburban households[1], but when these things were written the filesystem was 100x faster than the link to those peer connections from which the data was fetched. A final copy was only a small overhead.
    [1] Which is why all the stuff we used to torrent is in the cloud now.
    
    cesarb 2 years ago
    
    AFAIK, they preallocate; and even on filesystems which support sparse files, most bittorrent clients have an option to always preallocate (to both reserve space and reduce fragmentation).
  - andoma 2 years ago
    
    Somewhat related is that a few filesystem types on Linux allows you to remove / insert bytes "within" a file. But it needs alignment to filesystem block size. This uses the same syscall, fallocate(2), which can be used to punch new sparse holes in a file where it previously had data.
    See https://man7.org/linux/man-pages/man2/fallocate.2.html
  - axus 2 years ago
    
    A memory map of a gigabyte/whatever, that uses a bunch of large addresses but typically only uses 1% of the available space. Saves someone the trouble of managing the map or compressing it.
    It does feel like a weird decision from long time ago that we're stuck with. I thought it was some quirky Linux feature but it's been around https://en.wikipedia.org/wiki/Sparse_file
  - vlovich123 2 years ago
    
    The number of file descriptors you can have open by a single program is limited and eats up kernel resources.
    
    ajross 2 years ago
    
    Even for a torrent client, the number of active file descriptors is a function of the number of peer connections (e.g. a few dozen). It doesn't scale with the size of the output file.
    
    vlovich123 2 years ago
    
    Think databases like RocksDB not torrents.

mgerdts 2 years ago

When I think of a fs corruption bug, I think of something that causes fsck/scrub to have some work to do, sometimes sending resulting in restore from backups. From the early reports of this, I was having a hard time understanding how it was a corruption bug. This excellent write up clears that up:

> Incidentally, that’s why this isn’t “corruption” in the traditional sense (and why a scrub doesn’t find it): no data was lost. cp didn’t read data that was there, and it wrote some zeroes which OpenZFS safely stored.

dannyw 2 years ago

Fascinating write up. As someone with a ZFS system, how can I check if I’m affected?

moviuroOP 2 years ago

It's a very rare race condition, odds are very low that you were impacted. If you were, you would have noticed (heavy builds with files being moved around where suddenly files are zero).
[0] https://bugs.gentoo.org/917224
[1] https://github.com/openzfs/zfs/issues/15526 (referenced in the article)
dist-epoch 2 years ago

https://github.com/openzfs/zfs/issues/15526#issuecomment-181...
> zpool get all tank | grep bclone
> kc3000 bcloneused 442M
> kc3000 bclonesaved 1.42G
> kc3000 bcloneratio 4.30x
> My understanding is this: If the result is 0 for both bcloneused and bclonesaved then it's safe to say that you don't have silent corruption.
- keep_reading 2 years ago
  
  bclones were only one way to trigger the corruption. This is not a good way to check.
  It's also not worth checking for because this bug has existed for many years. Your data probably wasn't affected. None of the massive ZFS storage companies out there ran into it by now either.
  Your data is fine. Sleep easy.

LanzVonL 2 years ago

It's important to note that the recent showstopper bugs have all been in OpenZFS, with the Oracle nee Sun ZFS being unaffected by either.

nimbius 2 years ago

Oracle laid off basically every Solaris developer in 2017. They are by all observation simply not interersted in the product anymore. its probably the most mournful thing ive seen in tech in a very long time.
OpenZFS is a mighty filesystem hobbled by an absolutely detestable license (the CDDL.) Its greatest single contribution was in all likelyhood to BSD, although it didnt seem to make the OS more popular as a whole.
the latest and greatest from the OpenZFS crowd seems to be bullying Torvalds semi-annually into considering OpenZFS in Linux...which will never happen thanks to CDDL and so the forums devolve into armchair legal discussions of the true implications of CDDL. You'll see a stable BTRFS and a continued effort to polish XFS/LVM/MDRAID before openZFS ever makes a dent.
One could argue OpenZFS is a radioactive byproduct of one of the most lethal forces in open source in the past 20 some years: Oracle. They gobbled up openoffice and MySQL, and went clawing after RedHat just shortly after mindlessly sending Sun to the gallows. Theyre an unmitigated carbunkle on some of the largest corporations in the entire world, surviving solely on perpetual licensing and real-world threat of litigation. That they have a physical product at all in 2023 is a pretty amazing testament to the shambling money-corpse empire of Ellison.
Ultimately the FOSS community under Torvalds is on the right track. Just because Shuttleworth thinks he cant be sued by Oracle for including ZFS in Ubuntu with some hastily reasoned shim doesnt mean Oracle wont nonchalantly send his entire company to the graveyard just for trying. Oracle is a balrog. stay as far away as you can.
- p_l 2 years ago
  
  Oracle isn't copyright holder for OpenZFS. That's one part that OpenSolaris and OpenZFS projects managed to ensure. What Oracle could do was to close OpenSolaris again under proprietary license, something that Brian Cantrill IIRC blamed on the use of copyright assignment, and that open source projects should never use it - with that as a specific example.
  OpenZFS devs have openly declared that no, they are not pushing to include OpenZFS into Linux kernel, and that separate arrangement is just fine, especially since it allows different release cadence and keeps code portable.
  Mainly there's an issue with certain Linux Kernel big name(s) that like to use GPL-only exports (something that has uncertain legal status) in a rather blunt way, and sometimes the reasoning is iffy.
- paldepind2 2 years ago
  
  > Just because Shuttleworth thinks he cant be sued by Oracle for including ZFS in Ubuntu with some hastily reasoned shim doesnt mean Oracle wont nonchalantly send his entire company to the graveyard just for trying.
  Canonical has been shipping the kernel with ZFS for more than 7 years and so far they have not been sued by Oracle.
  - marcinzm 2 years ago
    
    Oracle doesn't sue for fun and Canonical isn't exactly a massively successful company financially speaking. However one day it will want something from Canonical and that is the day the lawyers will come out. Possibly once there's IPO money in it's coffers to go after.
    
    p_l 2 years ago
    
    Except Oracle has no capability to sue anyone over OpenZFS.
- mardifoufs 2 years ago
  
  How is the CDDL any more detestable than the GPL family of licenses? Not saying that they are detestable in any way, but the CDDL is also a free software license so I don't get how it's worse or bad
  - p_l 2 years ago
    
    Specifically, the issue is with GPL, which disallows licenses that have more requirements than itself - in case of CDDL, it's IIRC patent sharing requirements and language around file-specific applicability.
    And of course then there's the part that GPL doesn't apply to linking at all - it applies to derivative code, which OpenZFS is not and thus it does not violate GPL to ship OpenZFS code linked with Linux kernel.
- avianlyric 2 years ago
  
  > You'll see a stable BTRFS and a continued effort to polish XFS/LVM/MDRAID before openZFS ever makes a dent.
  Right now I would put my money on bcachefs[1] rather than BTRFS. bcachefs is currently in the process of being merged into the kernel and will be in the next kernel release. Doesn't currently quite offer everything ZFS does, but it's very close and already appears more reliable than BTRFS, and once stuff like Erasure Coding is stable, it'll be more flexible than ZFS.
  [1] https://bcachefs.org
  - exitheone 2 years ago
    
    I applaud your optimism but the bcachefs install base is tiny compared to btrfs and still there are corruption and data loss stories on Reddit so maybe give it another 5-10 years of mainstream use to stabilize.
    I hope it'll beat btrfs eventually though.
    
    avianlyric 2 years ago
    
    I’m mostly excited about having access to a filesystem that can happy handle a heterogeneous set of disks with RAID 5/6 style redundancy.
    BTRFS RAID-5 implementation has know data loss issues (write hole) that has existed for years now, and doesn’t seem likely to fixed soon.
    Then there’s roadmap feature of extending bcachefs native allocation buckets to match up with physical buckets on storage media like SMR drives and also SSDs that expose their underlying NAND arrangement, allowing bcachefs to orchestrate writes in a manner that best fits the target media. Creates the opportunity for bcachefs to get incredibly high performance on SMR drives (compared to FS that don’t understand SMR media), which would probably provide CMR style performance on SMR drives in all but the most random write workloads.
    But yeah, there’s still some distance for bcachefs to go. But given its inclusion into mainline, and the fact that mainline only accepts filesystems that have already demonstrated a high level of robustness and completeness (semi-recent policy change driven by experiences with FS like BTRFS which took so long to become complete and stable after merge), gives me hope we won’t need 5-10 years of mainstream use for bcachefs to stabilise.
- MCUmaster 2 years ago
  
  Oracle can’t do a thing to an Isle of Man corporation.
- rincebrain 2 years ago
  
  Who on earth is trying to bully Linus into anything? Where have you seen that?
  - LanzVonL 2 years ago
    
    He did go away for a vacation-style treatment a few years ago after offending Intel. Like a re-education camp.
    
    rincebrain 2 years ago
    
    He did go away, yes, but that's not really related to the premise of people trying to convince him to merge a large wad of code undera license that is not GPLv2 and is not strictly less restrictive than GPLv2 into the Linux codebase.
- rustcleaner 2 years ago
  
  This is bull**! It's time for a new license to throw off the old: The Uniform Pirating License! It's a license which you stick on anything which you need a license, and it conveys all rights and zero obligations to you. Possession of the code is sufficient to run, change, and propagate code. Legal system be damned; we have cryptography and Tor, The State's law here is irrelevant (also when did you give The State license to bully you around anyway?)!
  My fix: spin up a .onion to host my distribution of the Linux kernel containing ZFS integrated and BtrFS excised, do not answer abuse/legal emails, don't even have email to receive aforementioned emails. What's the pencil-necked shrimpy IP lawyer at Kernel Foundation going to do? Shut down Tor?
  - bb88 2 years ago
    
    There are some very creative solutions for getting around license restrictions.
    The LAME mp3 encoder originally was a series of patches that could be applied to the Fraunhaufer ISO dist10 release.
  - rustcleaner 2 years ago
    
    Who cares if ACME Inc can't use the UPL due to corporate risk. It would be charmingly ironic if the world's best filesystem could really only be used by petty home users who can utilize the UPL with near-zero risks...

frankjr 2 years ago

I wonder if any large storage provider has been affected by this. I know Hetzner Storage Box and rsync.net both use ZFS under the hood.

mappu 2 years ago

Wasabi Cloud Storage have a Sponsored-By tag on the git commit fixing the issue, so I assume they're highly involved somehow.

joshxyz 2 years ago

anyone know what diagram tool did he use? thanks

egberts1 2 years ago

Plantuml, doable in.
- guiambros 2 years ago
  
  Any idea which diagram in PlantUML more specifically? I looked at a handful of the PlantUML categories (each one with dozens of examples) and haven't seen anything like the diagrams in OP's post.

commandersaki 2 years ago

Excellent writeup robn!

lupusreal 2 years ago

Is anybody using bcachefs yet?

frankjr 2 years ago

I'm keeping an eye on it but it's not there yet e.g. https://github.com/koverstreet/bcachefs/issues/619#issuecomm...
- ktm5j 2 years ago
  
  Well, to be fair they tried and failed to reproduce the corruption that was reported. While I agree that I'm not ready to dive into bcachefs, I'm not exactly swayed by this bug report.

MenhirMike 2 years ago

Periodic reminder to check if your backups are working, and if you can also restore them. It doesn't matter which file system or operating system you use, make sure to backup your stuff. In a way that's immune to ransomware as well, so not just a RAID-1/5/Z or another form of hot/warm storage (RAID is not a backup, it's an uptime/availability mechanism) but cold storage. (I snapshot and tar that snapshot every night, then back it up both on tape and in the cloud.)

tetha 2 years ago

It is also a good idea to test the restore procedures and documentation as well.
Don't have the grizzled old storage admin / DBA test the backup. They know a million and one weird necessary workarounds and just execute them. However, if you need a restore and they are currently exploring caves or something, things turn dire. Have a chipper junior restore something based off of the documentation (and prepare to spend a few days updating documentation...)
And make sure to test backup you don't regularly touch. And very much test those backups you really don't want to test.
- ericbarrett 2 years ago
  
  As a grizzled old storage admin who somehow made a career out of database backups, I wholeheartedly agree with all of this. Especially having someone else do a test restore. They don't have to be junior, just not intimately familiar with the systems involved.
  - FridgeSeal 2 years ago
    
    I’ll go a step further:
    Have a different person do it each time, having them add and refine the documentation and any tooling once they’ve done it. Keep any tools and scripts used fastidiously current-few things are worse than “to fix this issue, run the repair.sh script” only to find it stopped working 6 months ago because it relied on some extremely specific lib somewhere.
  - MenhirMike 2 years ago
    
    Oh, database backups are fun, especially if the database server is still running! You want multiple databases, all at a consistent point in time, without taking the system offline? SUFFER, YOU FOOL! The joys of realizing that file system snapshots won't help with data that's still to be committed and that taking a backup database-by-database means that two databases whose data relies on each other are no longer properly in sync really warms my heart. Oh wait, it's the whiskey that runs through my veins that does that, being a backup operator is a fantastic pathway into alcoholism. Especially once the databases become so large, that the time it takes to take the backup become a performance concern of the running system.
    I think Postgres did it right by abbreviating their "Continuous Archiving and Point-in-Time Recovery" as PITR because it's very close to PITA. But PITR and CHECKPOINT actually make Postgres probably one of the better database systems to backup (and restore!), so yet another reason why I think it's a fantastic database.
- toast0 2 years ago
  
  One nice thing about a circa 2010 MySQL setup is setting up a new replica is easiest by restoring a backup. If you have to do that from time to time, your backups get tested by regular process.
bgro 2 years ago

It’s always amazing to me how frequently backups silently fail. Every backup software or general common tool to back things up that I’ve seen has many points of silent failure where it just gives up copying at some point in the process or skips over files for some reason without indicating what or why.
If you don’t delete files as you go, now you have an unknown partial backup state that basically doubles your needed space.
If you delete as you go, sometimes something happens and the process stops or corrupts so your data is now split and you may have lost something.
Even trying to log all the failures during the process is amazingly difficult and solutions to work around that specific problem, themselves, somehow introduce more and new types silent failure in some type of irony.
- MenhirMike 2 years ago
  
  Yes! The worst is that even if you set up all kinds of reports etc. on what you expect, if the backup runs for weeks/months successfully, you just stop paying attention and then when something fails, you won't notice it.
  I do think that file systems that support snapshots - like ZFS, but I think LVM can be used for stuff like ext4, and Apple APFS does too - is the way to go. Not sure how well NTFS's Shadow Copies/Volume Shadow Service work, I heard horror stories, but not sure if those are one-off freak accidents. Probably worth considering ReFS anyway these days on a Windows Server. But with a Snapshot, you're at least insulating yourself mostly from changes to the data you're backing up. At the expensive of managing snapshots, that is, getting rid of old ones after a while because they keep taking up space.
  - MenhirMike 2 years ago
    
    (Edit: Though a snapshot of the file system isn't enough if you need to back up services that are currently running. E.g., a database server might have stuff uncommitted in memory that wouldn't be captured by a file system snapshot. But database backups are their own beast to wrangle.)
amluto 2 years ago

This particular bug won’t be easily caught just by testing backups, as the bytes in the filesystem never actually change. So you can diff the bytes on disk between the live system and the backup, and they’ll match.
I like to keep a separate database of what files I expect to have along with their hashes. The off-the-shelf tooling for this is weak, to say the least. Even S3’s integrity checking support is desultory at best, and a bunch of S3 clones don’t implement it at all (cough minio cough).
stevenAthompson 2 years ago

I see this advice repeated frequently, but it's always very general.
Do you have any advice as to HOW the average home NAS user can affordably backup modern NAS devices?
The last time I looked it could easily cost hundreds of dollars per month to back up as little as 40TB to the cloud.
- prmoustache 2 years ago
  
  Well data protection is expensive, nobody said the contrary.
  Backup what you value the most, ignore what you don't and apply tiers depending on what needs to be kept but you can deal with transferring it back home slowly and what you need immediately in case of a failure.
  My rules of thumb are:
  - always invest 3x the price of your hot live NAS storage in backups. If you can't afford buying 40TB of storage, you can't afford having 10TB of live storage. Period. Goal is to have at least one copy locally and one externally and have more space to on the backup storages to account for retentions, changes and help with migrations.
  - if you can't afford 3 redundant storages(RAID), favor having 3 times non redundant storage (no RAID) over having less copies of redundant one.
  Additional tip to reduce cost and avoid expensive cloud offering is to find a reliable and trustable relative or friend that can host your external copy of your backup. Nebula or Tailscale now makes it very easy without having to configure routers and stuff. In exchange you can offer that person to host his/her backup storage.
  Also digitalizing material stuff is nice, but printing digital photos is also a great way to preserve copies. I'd rather save the photos I cherish the most than having 3 backup copies of 10TB of blurry or non outstanding photos. After years of having them all digitally, I am inveting back in printing photos and making albums. You can also print photobook multiple times and have some stored at a relative's place.
- vladvasiliu 2 years ago
  
  As the sibling says, 40 TB is not exactly "average home nas" territory. What I personally do, though I don't have 40 TB available even if I counted all my hard drives together, is I just have a second device that can hold the data and back up to it regularly.
  My NAS has something like 5 TB used. It's all synced to an old server that can hold about 8 TB and that's off most of the time (no fun living next to a jet engine). This cold server lives at my parents' house.
  My "really important stuff" on the NAS, which is a few hundred GB of pictures and such, is regularly backed up to a bucket with object locking.
  My "super important stuff", which is my company's accounting and other such documents, and lives on my laptop, is backed up to the live NAS and handled there as the really important stuff. I also back up my laptop to two normally offline external drives, one of which lives in my apartment and the other at my parents' house.
  Everything non-cloud is ZFS, so after each backup to an external drive or "cold NAS", I run a scrub to make sure it is still operational. The live NAS runs a scrub every Monday morning.
  Granted, this is not a "modern NAS" environment, since it made no sense to me to forego the free servers that my employer was going to send to the trash and buy some expensive off-the-shelf solution without the guarantees of ZFS (despite the issue TFA talks about). I know about power usage, but my live NAS eats less than 50W at idle (which is 99% of the time), so breaking even with the electricity prices in France would take forever.
- xoa 2 years ago
  
  I agree with you completely that it's used in too trite away. Which I think has echoes to backups and a lot of other "data hygiene" things in general (like doing backups at all initially, or strong passwords, or setting up new systems) which our industry has a long and unfortunate history of leaving manual and assigning a PEBKAC to when what was really needed was more automation. Manual effort doesn't scale, and cost is absolutely a critical issue for a long tale of data owners. A fundamental part of the entire value of ZFS and NAS for that matter is automating away all sorts of issues surrounding data integrity, from checksumming to disk integrity to backups, and doing so in a way that's highly dependable.
  Which is how it should be. Yes bugs can happen but there's only so many 9s most of us can chase on our budgets. And "always test backups" in particular adds cost. Testing means restoring onto hardware that you can then use live, separate from your actual primary hardware or at a minimum on primary hardware with >2x the set size and enough performance to squeeze it in during downtime or around work. So yet another big increase in cost. "Testing backups" isn't trivial.
- gosub100 2 years ago
  
  I have about that much data and LTO-6 (2.5tb per tape), and it's a huge PITA. I'm probably doing it wrong, but this is what worked for me: making an ext4 filesystem as a file, exactly 2500gb in size, formatting it, and stuffing it with data until there is < 5 gb free. take the checksum and manifest of that file, and write it to tape (takes 4 hrs without verify, plus another 1-3 hrs (can't remember now, its faster) to verify. repeat until your 40tb is done.
  I know you can use ZFS snapshots but I'm not experienced enough to trust that I could make a 20-40tb snapshot without screwing something up. Plus it's all video files so I can roughly keep track of what's what and I can ignore the stupid LTO compression.
  It takes days, its noisy, and very tedious. But thats #hoarderLyfe lol
- dewey 2 years ago
  
  “Average home NAS user” doesn’t have 40TB of data. With a subset of data that’s important like photos it’s not that expensive and with Backblaze and other services that are directly integrated in operating systems like Synology also not that hard to do.
  - samastur 2 years ago
    
    I agree with the advice which is what we do. Average home user (with emphasis on average) doesn't have 40TB, but a "normal" non-professional one might.
    We have about 9TB of photos. I can easily imagine someone like us, who is into video, of having more than 40TB of videos.
    
    feanaro 2 years ago
    
    When will you ever be able to appreciate and look at 9T of photos?
    
    k1t 2 years ago
    
    You don't always immediately know which ones will be important.
    Today you might take 10 photos of your family and keep the best one where everyone is smiling.
    But 10-20 years from now you will probably appreciate having kept the other 9 where the baby is crying, the kid is making a face, and grandma has started to wander off.
    
    doublepg23 2 years ago
    
    AI tools analyze photos pretty well now. It’s very common they bubble up old photos I had forgotten about.
    
    fomine3 2 years ago
    
    Good point, now AI is a real good excuse for thoughtless data hoarding.
    
    pferde 2 years ago
    
    When you're old and retired, and are reminiscing about your kids or grandkids back when they were small, or about past vacations.
    My parents tend to take a lot of photos whenever the family is together, and it used to bother me. Only in recent years I started to understand them.
    
    hotpotamus 2 years ago
    
    I've passed through the other end of this. I spent a few hundred hours scanning my father's and grandfather's slides, negatives, and prints on high-end scanners in 2010. There were thousands of images, and since then that number has probably increased several orders of magnitude with digital cameras and then phones. The sheer number is beyond human comprehension. Now that images are so trivial to make, I value curation much more than shear number. I suppose it's always a quantity vs quality thing.
- grepfru_it 2 years ago
  
  LTO. I bought an LTO-5 system to backup 6TB of critical data and 12TB of nice-to-have data. LTO-6 is better if you can afford it.
  Downside to tape backup is you need throughput, or the ability to do disk-disk backups
  - dist-epoch 2 years ago
    
    For 20 TB LTO seems too expensive.
    20 TB of SSD costs about $1000.
    Or you could get a 20 TB hard drive for $300.
    
    grepfru_it 2 years ago
    
    Drive failure and managing those drives are hidden costs you are not considering.
    I have had multiple hard drives fail and been left stranded. Tape fails but not nearly as often as disks
    LTO6 and LTO7 are not expensive for 20TB
- MenhirMike 2 years ago
  
  If you really need 40TB of irreplaceable data, then I think S3 Glacier Deep Archive might be worth looking at. According to the Amazon calculator it's something like $45/month, though of course the data might take a while to get ready if you need to restore it. There are other S3 Storage tiers as well, that are a bit more expensive but offer quicker recovery. Backblaze B2 looks like it would be about $240/month, which is IMHO also pretty reasonable for 40TB. I haven't calculated the initial traffic costs though, I assume the first upload might be a bit costly, but once it's up there, you just pay storage until you need to restore it.
  If you can figure out how to split the data into categories, you could save money as well. E.g., which of this data is truly irreplaceable - stuff like personal photos, source code, whatever it is that can never be re-created. If you're running a business, then stuff that needs to be available immediately in order to keep the lights on. Those things needs to be on storage that also gets backed up daily, preferably in full, and preferably to multiple clouds.
  Stuff that can be re-created from sources (e.g., rendered outputs) are less critical because in the worst case, you can just spend some days/weeks to re-create it.
  Also consider regular offline backups - put it on a tape drive or on some hard disks/SSDs or even optical media (yes, it would take something like 400 BDXL disks to back up 40 TB, but I assume the data doesn't rapidly change) and put it in some offsite storage facility in case your place burns down.
- jhot 2 years ago
  
  My cheap solution for large datasets is to buy a raspberry pi and external hard drive(s), setup in a friend or relatives house, and setup syncthing. One friend has a copy of my ripped discs, my parents have copies of my photos, etc. Make sure the remote instance is in read only mode.
  For sensitive data I would run something else that can be a Restic target so backup data is encrypted, I currently use a cloud drive that supports WebDAV for that.
  - riffraff 2 years ago
    
    How do you perform the testing of these backups tho?
- wil421 2 years ago
  
  I don’t try to backup my Plex library. Most of my family pictures and videos are on my MBP and I rsync the picture folder a couple times a month to the NAS. Every 6 months I get my cold storage 6TB drive and back up what I can. My MBP runs Backblaze so I have another backup of my most critical items.
- FeepingCreature 2 years ago
  
  AWS S3 Deep Glacier is really cheap nowadays (at least in some zones), on the order of $1/TB. As an average home NAS user with 8TB of data, I've finally taken the plunge and started backing it up. It was never worth the cost before.
  - gallexme 2 years ago
    
    How much is recovery of let's say 500gb a month/1 full restore a year ?
    
    FeepingCreature 2 years ago
    
    Googling says 2c/GB, cheaper (10x) in bulk.
    
    kstrauser 2 years ago
    
    You might wanna double-check your math. I used the AWS pricing calculator, said I wanted to store 8000GB in Glacier Deep Archive in us-east-2, and wanted to recover it using 16000 API requests (wild guess). That, plus $0.05-$0.09/GB transfer came out to about $960 to recover.
    Glacier is always super cheap as long as you don’t need to recover, and then it’s ferocious.
- hosteur 2 years ago
  
  I use restic to back up my NAS to Hetzner storagebox.
  Also, you can probably tier your data. Maybe you don’t need same level of backup for all your 40TB.
- throw0101b 2 years ago
  
  > The last time I looked it could easily cost hundreds of dollars per month to back up as little as 40TB to the cloud.
  You only have to backup the data that is important to you and you don't want to lose in case your house gets robbed, floods, burns down, etc.
  If you don't mind losing 40T of data, you don't have to back it up at all.
  Otherwise get another NAS, installed it at family/friend's house, and set up a VPN between the two: then use rsync/zfs-send/whatever.
- linuxdude314 2 years ago
  
  Cloud archival tier storage is much cheaper than that now.
  Glacier vaults in S3 are quite affordable these days.
andyjohnson0 2 years ago

I'd be interested to know what tape setup you use? I occasionally look into using LTO tapes for home backup, but the media and hardware always seems a bit too expensive compared to something like Backblaze (which I currently use).
Also afaik tapes need a stable storage environment: how do you manage that?
- grepfru_it 2 years ago
  
  LTO5. The cost of my LTO5 system is the cost of downloading all of my data once from a remote cloud provider. It's a nobrainer
  - linuxdude314 2 years ago
    
    As someone who used to admin a 30PB+ LTO library I love me some tape, but unfortunately it’s not that simple of a value proposition.
    Bit rot is less of a thing with LTO, but still a thing.. I.e. you will at some point need to update your LTO system and it’s storage media.
    The robot I owned was the library storage for movie frames at a major motion picture studio. We would upgrade every other release, so while I was there we were upgrading from LTO-5 to LTO—7.
    The robot was big, and would write data to two redundant tapes. One copy would be sent to Iron Mountain, the other stayed in the robot.
    Creating a backup like you are isn’t really protecting much if you don’t have a good facility to store the tapes in.
    Part of the point of paying for a service like AWS Deep Glacier is that it’s an offsite backup.
    An LTO backup has no advantage over a hard disk if your home catches on fire.
    
    grepfru_it 2 years ago
    
    Well I store two sets of offsite tapes. One at my colo site and one in a climate controlled storage facility. I rotate my tapes weekly and the inbound tapes get restored and compared against the disk backup.
    I also ran IT departments for the last 30 years, so you probably shouldn’t use me as the scapegoat :)
- MenhirMike 2 years ago
  
  For tapes, LTO is really the only game in town, every other tape format is dead. You can get LTO-4 tape drives for dirt cheap because companies have been upgrading them. Yeah, they'll be used, but those drives are meant for heavy duty, and you can just pick up some spares. I found that IBM Fibre-Channel drives are available aplenty, cheap, and they usually come with a front bezel for installation into either two 5.25" slots or something like a Dell PowerVault 114X. (Unlike Library Drives that usually come naked and in non-standard form factors). A FibreChannel host adapter, some cables and transceivers, for probably less than $20 combined, and you're good to go. LTO-4 tapes hold 800 GB and are readily available new for affordable prices as well.
  I did upgrade to an LTO-5 drive last year so so, after finding a new-in-box from a liquidation sale for something like $450. The nice thing about LTO is that it's 2 Generations R/W and 3 Generations Read - so the LTO-5 drive will Read/Write LTO-5 (1.5 TB) and LTO-4 tapes, and read LTO-3 tapes. I think with one of the new standards (LTO-8?) it's a bit more muddy, so check compatibility.
  I think that LTO-4 and LTO-5 is the sweet spot for hobbyists: You still need to spend some money on a drive or two and buy brand new tapes, but it's reasonably affordable. That said, for a business, I'd just bite the bullet and buy a new drive. Dell sells an external SAS LTO-7 drive brand new for $3700 list price, but I think there might be cheaper options. Together with some tapes and a SAS Controller, I'd say that for $5000 you can get a decent, brand new setup.
  I put the tapes in Turtle LTO Cases (https://turtlecase.com/products/lto-20-black), and they sit in a closet. It's not climate controlled or anything, but the place is roughly at a similar temperature year round. The tapes aren't THAT sensitive, but I'd definitely not store them in the garage where I might get a 50+ degree temperature difference throughout the year. That said, there are companies that offer off-site storage options with climate controlled environments. I haven't looked into their pricing since I didn't need it, but the nice thing about tapes is that you can just backup to two tapes and send the second tape off-site. LTO has built-in encryption support, so that's an option.
  Twice a year or so, I run a restore of the tape and compare it to the SHA256 that I took while backing up the file (I did build myself some rudimentary cataloging system to SHA256 hash every file, then back it up to tape with tar, and make a record of what file with what SHA256 got backed up when on what tape). I've yet to encounter any bit rot/defective tape issues, but YMMV.
  I do use Backblaze's B2 service as well for cold-ish storage. Though I only back up truly irreplaceable or inconvenient to recreate data into B2. That way, I have multiple copies of truly important stuff, I have stuff readily available where I am, and I have terabytes of stuff that isn't worth the expense for the cloud since I can re-create it, but nice to have a copy of.
  Tape Drives may be overkill for many and external hard drives (plural!) might be a better option for many. What I like about tape drives is that the media isn't "hot". If I have ransomware running wild, connecting an external hard drive puts everything on it at risk (hence the need for multiple drives), whereas with a tape, it would have to specifically try to rewind the tape and start overwriting, and I would notice it. But YMMV, I never had a ransomware problem myself, but I do have stuff I really don't want to lose, so multiple backups of it in multiple ways (Daily .tar archive on a hard drive, backed up to tape, and backed up to the cloud) should hopefully give defense in depth and the ability to at least recover some older state.
  - andyjohnson0 2 years ago
    
    Thank you, really, for taking the time to write all this: extremely informative. I think this will be my priority for Q1 next year.
KingOfCoders 2 years ago

After every backup it needs to automatically be checked if it isn't corrupted. At minimun check file size and see if it can be decrypted / untarred. Best check the latest data.
Backup that isn't checked isn't done.
mekster 2 years ago

The periodic reminder is to at least use 2 different implementations to make backups than rely on 1 such as Borg.

hulitu 2 years ago

> This whole madness started because someone posted an attempt at a test case for a different issue, and then that test case started failing on versions of OpenZFS that didn’t even have the feature in question.

One will expect more seriosity from filesystem maintainers and serious regression testing before a release.

amelius 2 years ago

Shouldn't we expect formal verification methods, even? Or is that too much to ask for?

Settings

A data corruption bug in OpenZFS?

Keyboard Shortcuts