Lock Files Considered Harmful

4 min read Original article ↗

TLDR: lock files considered harmful; use lock files.

One easy way to ensure consistency/reproducibility is to reference external things with a hash

That is exactly what a lock/sum/etc file does, it provides a hash to a referenced external thing so that when I go to rebuild it is consistent and reproducible.

rather than relying on some arbitrary name or HTTP URL that isn’t dependent on the content

This is a quality of life improvement, you can actually define hashes directly within python requirement.txt files and there by not have a “lock file”. When I used to work with python I hated to use these hashes, they were hard/annoying to figure out (at the time, maybe it has gotten easier) and depending on the lock files we are talking about did not improve the developer experience. Golang’s sum file is, IMO, perfect. I tell go to update deps which retrieves the updates I want, updates the hashes for me in the lock file. If I want to be very specific about updates, I can ask go to add one specific dep to my repo and add it to my sum file. If I want to take it to the next step I can vendor my dependencies and yet the sum file still guarantees that the vendored files in the repo were untampered with, hopefully even during transport (although if you can tamper with files you can tamper with the hashes in the lock file…)

The core problem is that humans are bad at staring at hashes and comparing them to assure they match what is expected. This is the same reason urls rarely point to ip addresses directly for the general public. They are bad at numbers but good at names/words. That’s where the lock files are the current best middle-ground we have. I define a name of a package, some arbitrary range or extremely specific verion(s) and the language converts that to exactly what [you] want, a guaranteed list of hashes & references to external content. Which could still disappear at any time.

Hashes can be given alongside such legacy mechanisms

Http/s is legacy? Or am I misunderstanding what is being referenced as legacy here?

Git is a popular way to do this (its commit IDs are content hashes), although other Merkle-trees/chains like IPFS, BitTorrent, etc. may be better at distribution.

Git hashes are still bad for humans, so bad that most of us look only at short hashes or use tools to compare the large hashes them for us. Or worse yet completely ignore them and just copy/pasta.

Other than using git over ssh, git typically is still used over https. Git the protocol can only be used to read public repositories, last I checked there is no auth on that protocol.

IPFS while it does have hashes the content is not guaranteed to continue to exist so you will eventually have missing content which arguably is worse. IPFS uses hashes as the reference so it effectively cut out the human readable reference unless you use their IPNS (name service). So if you were to use IPFS you’d still end up with a list of raw hashes in a file maybe with comments as to what the hashes point at, which IMO sounds awful. How do I know what hash 1 is? Load it? Which implies trusting that hash.

Bittorrent has the same issue, no seeds and peers don’t have 100% collectively you have lost the data unless a seed comes back online. Bittorrent also uses hashes and ways to reference the resource they want (lock file). Typically their files look like a lock file, the content is chopped up into blocks and those blocks have reference pointers and hashes.

Computers by definition are meant to change and be changed, content on them is then subject to the same thing. There is only so much you can do about it, hashes and lock files seem like a small price to pay to get close to prefect reproducibility.

I just don’t see a way to escape lock files (hash+reference/resource pointer for human readability).