Making Backup Validation Easier

41 points by jaw 6 years ago · 17 comments

Reader

gruez 6 years ago

This seems like a worse than just hashing the file. Random bit flips will probably go undetected using this method, but won't be with hashing.

jawOP 6 years ago

I'm mostly trying to address cases where there is no original file that I fully trust. If I'm exporting my data from some web app/service, I can't get a hash of the data as it is in the actual source of truth on their servers, and there's multiple points at which an error could be introduced before the completed export file lands on my machine.
It's a good point that hashing is a better method when you have access to the original files.
- close04 6 years ago
  
  > and there's multiple points at which an error could be introduced before the completed export file lands on my machine
  Aren't all bets off at this point? I mean validating the backup seems skipping steps if you are not validating the source. Scrolling through thumbnails is better than nothing, sure. But it's really prone to false negatives. Corrupted images can look good in a thumbnail and your eyes might just miss even glaring corruption because you just scrolled too fast. If it's not an image file it gets more challenging.
  You seem to have one of those corner cases where basically no automated method can solve your problem but the volume of data is just low enough to alleviate the issues with a bit of manual intervention.
  - jawOP 6 years ago
    
    > Scrolling through thumbnails is better than nothing, sure.
    "Better than nothing" is pretty much what I'm going for here. Almost all my personal data stored in cloud services falls into this "corner case": I only have indirect access to the source, it's important enough to me that I want to do some level of checking, but it's not important enough to spend the huge amounts of time it would take to inspect every individual datum.
akie 6 years ago

The contents of my backups are never the same, not from one single day to the other - so hashes would be useless.
- HideousKojima 6 years ago
  
  You don't need hashes to match between days at all though. You simply hash the file that was just backed up, and the the backup copy of it, then compare the two
  - tonyarkles 6 years ago
    
    I guess it depends somewhat on what you're backing up and what the anticipated failure modes might be. As an example, if there was a bug in my todo software that deleted a bunch of entries, the hash scheme wouldn't pick that up. You've just successfully backed up corrupted data, and you're not aware of it. SQL dumps would be another good example of this. If one day you do a backup and the backup reports that it has archived significantly fewer rows than yesterday, you know something's up. Maybe a fault lost some data, maybe the archiver is broken, etc.
    
    gruez 6 years ago
    
    What you're describing is significantly harder than what's described in the blog post. Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).
    Most people solve this issue by keeping multiple versions, not by trying to "validate" the backups somehow.
    
    tonyarkles 6 years ago
    
    Maybe I'm misinterpreting the output from their backup tool, but isn't that exactly what it's doing?
    Metrics for todoist-fullsync: name 1 days ago 8 days ago ------------------------------ files 1 1 bytes 82363 86661 items 85 87
    The "items" line there seems like it's actually parsing the file and counting the number of entries in it? It's also captured in point #2: "Can be intuitively evaluated as plausible or suspicious. If the number of tasks in my Todoist export dropped from dozens to 1, that would be cause for concern."
    
    jawOP 6 years ago
    
    Right. Aside from 'files' and 'bytes', the metrics are just the result of running shell commands specified in a config file. In this case it's `jq '.items | length' $PARCEL_PATH`, i.e., parse the file and print the length of the attribute named "items".
    Obviously, that won't catch all potential problems in the file, but it's a low-effort way to catch some.
    
    jawOP 6 years ago
    
    I keep multiple versions as well, and also use third-party backup software on all these files. These techniques are meant to be part of something analogous to a 'defense in depth' against errors in the backup process, not thorough or foolproof.
    > Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).
    But you don't have to do that perfectly to get value out of it; for example:
    - If the .json file parses as json, then at least you probably didn't truncate the download mid-stream.
    - If it also contains a particular attribute, then you probably didn't save a structured error response instead of the actual data, or save something from a radically-nonpassively-changed endpoint that might no longer be adequate.
    - If it also has roughly the number of elements you expect, you probably didn't miss entire pages of the response.
  - derekp7 6 years ago
    
    This works except in the case where your backups include live database files (where you put the database in extended logging mode, back up the data files while they are being modified, then back up the logs).
    I haven't found a good way to verify these without doing a full database restore and seeing if the logs apply cleanly, along with having the DB do internal checks.
    
    gruez 6 years ago
    
    Isn't this use case solved by snapshotting the volume, then backing up the snapshot? Since the snapshot captures the filesystem state at a point in time, any database that's crash-tolerant should be fine. Snapshotting is natively supported on Windows and Macs, not sure about linux.
    
    jlgaddis 6 years ago
    
    > Snapshotting is natively supported on Windows and Macs, not sure about linux.
    JFTR, this is supported on Linux as well and, especially when using LVM, is quite simple and straightforward.
    You can do it manually [0,1] or using tools made for just this purpose, such as mylvmbackup [2] (which should be available in most distribution's package repositories).
    ---
    [0]: https://www.badllama.com/content/mysql-backups-using-lvm-sna...
    [1]: https://www.percona.com/blog/2006/08/21/using-lvm-for-mysql-...
    [2]: https://www.lenzg.net/mylvmbackup/

close04 6 years ago

I think making a list of the files to be copied and their hashes, then a list of files that were copied and their hashes, then comparing the 2 lists should provide an even quicker way to validate. Or even hashing the entire source and destination (hash of the list of hashes) and providing both values to the user to visually compare.

As far as I can tell the method described in the article doesn't really validate the backups in any way, just provides some statistics that will fail in very plausible ways.

And of course, if the data is important to you and there are special circumstances that could affect the process, nothing beats an actual restore test.

jawOP 6 years ago

I replied to a similar point about hashing here - https://news.ycombinator.com/item?id=23032633
You're correct that the methods I described are a far cry from actually guaranteeing that the backup has no errors. In the same way that a unit test doesn't prove code is error-free, but _can_ justify increased confidence in the code, I'm interested in techniques that can justify increased confidence in my backups. Particularly in cases where I don't have direct access to the original data, and where exhaustively checking the data manually is too time-consuming to be worth it.

wila 6 years ago

What I did is to do all my work in a VMware virtual machine.

Then I wrote software for backing up VM's automatically (disclaimer: this is a commercial product I sell)

There's options for getting an email on success, failure or both. The VM files are all hashed.

VMs are easy to restore, so an actual restore is pretty easy without risking to overwrite the original. If a file hash does not match on restore, then my software will complain, but continue the restore anyways.

FWIW, all my code etc... is also in source control, so I am not relying on a single layer for that.

Settings

Making Backup Validation Easier

Keyboard Shortcuts