Ask HN: Tool for Validating Filesystem Structure
My backend application is split into an SQL database and a filesystem with user data.
Each folder in the root application directory belongs to a user (identified by id), and subsequent subfolders identify a project. There are certain requirements for the name of each user folder and each project folder, and inside each project folder some other files and folders must necessarily exist, and extra ones should never.
I'm wondering if a tool or type of filesystem exists that lets me enforce structure, much like an SQL database would enforce relationships and datatypes in a given column. I could have written a script to do such validations, but it seems like a frequent enough problem that a general-purpose tool should exist, and it would be easier to maintain a bunch of rule files over a complex script.
Validations would include: - in this depth of filesystem, only folders can exist (no files), and should match this regex - files named "projectCover.png" under a project folder should not exceed a certain file size - each project folder should have a folder named "uploads", and for each file inside it, a "thumbnails/<file>.png" should also exist.
Another validation isn't as trivial: database rows have file paths which should point to images in the filesystem, and I would wish to validate that these indeed exist. Storing binaries in the filesystem would remove these possible data inconsistencies, but has practical issues. Frequently people will ask for a very specific thing, not understanding the greater context of what they are asking for. If they knew the greater context, then they would be asking different questions. I suspect this question is one of those questions, but since this question feels foreign, I am probably the one who doesn't understand the greater context. Why don't you need replication? What are the limitations you have (economic?) that resulted in this architecture? An example question might be, what value does the directory structure add over just hashing the file, storing the file by it's hash in the same directory with all the other files, and then referencing the hash in your SQL? You make good points. > Why don't you need replication?
We do not need replication right now because our user base is low, but I don't see how an hierarchical file structure would prevent us from doing it. > what value does the directory structure add over just hashing the file, storing the file by it's hash in the same directory with all the other files, and then referencing the hash in your SQL?
This would remove the need for validating the hierarchical filestructure. However, having one makes it easier to simply navigate the filesystem with pre-existing tools (find, grep, or bash scripts), to help debug issues, find certain projects, etc. Using your solution would abstract the filesystem, and it would become a blackbox to us without the database's aid. This would also make it harder to spot certain problems - e.g., is an image being referenced in a project it doesn't belong to? It also doesn't help me validate max file sizes per type of file, nor does it help me check if all paths in the database actually exist in the filesystem. Having the filenames be its hash is an interesting solution to assure integrity, though. > We do not need replication right now because our user base is low What happens when the hard drive or hard drives in this machine fails? > I don't see how an hierarchical file structure would prevent us from doing it. Because you are tied to one machine. Replication across machines creates a consistency problem. How do you ensure consistency? Either you write to N machines, which means when one fails, it doesn't get written to and you have to reconcile later, or you write to one and replicate to another. (I am not a database guy and I'm on a programming hiatus, so maybe I dont understand) If you cronned rsync or a backup script, how much data are you willing to lose? If a change is made right after backup happens, is it a big deal to lose it if a machine fails before the next backup? The hash question was more to assess why your files need to be in a hierarchical structure rather than in a key value store, where the key is the file hash, and the value is the contents of that file. I was imagining an example being that nginx serves files directly out of these folders. If you think you are going to grow significantly, tying information directly to a machine will create incredible operational pain in the future. Abstracting it away from a machine lets you do things like stick your data into s3, which will let you avoid a large number of operational pitfalls. > Using your solution would abstract the filesystem That's where I am going. If you do grow and your data out grows a machine or machines, I imagine you will have to do this. If your data is in s3, I imagine you are going to define config files which get consumed by wherever your "write" call is made and consumed by a script that validates those assumptions and emails a report of assumption violations. I would assume a client and a cronned validation script that look something like: Operational pain comes from how state is managed, SPOFs, and consistency models. So anytime you have a SPOF, it merits thinking what it would take to mitigate it. Thank you for the input! I was thinking about the BSD utility mtree https://man.freebsd.org/cgi/man.cgi?mtree(8) then I remembered the recently posted https://github.com/mactat/framed
Anyway I don't know your use case in detail, if this is a premature optimization, well that is the root of all evil. However if it is not an optimization, but an architectural decision, I would encourage you to think about: Class ValidatedS3Writer:
def __init__(self, s3, validator):
self.s3 = s3
self.validator = validator
def write(self, path, payload):
self.validator.validate(path, payload)
self.s3.write(path, payload)
s3 = ValidatedS3Writer(S3(), Validator(VALIDATION_CONFIG_YAML))
----
if __name__ == "__main__":
validator = Validator(VALIDATION_CONFIG_YAML)
for path in db.all_paths():
validator.validate(path, s3.read(path)) # or s3.metadata(path)
If these files in a hierarchy rarely change and are pointed to by say an nginx config, then making sure your backups are sane (and testing them, actually testing them), and writing a validation script is probably a better investment that allows you to spend more time on product. 1. What happens when a machine fails
2. What happens when you have hit some machine based limit.