Docker layers are a horrible dependency model

3 min read Original article ↗

TL;DR: OCI layers are a list. Dependencies are a DAG. A list is a toposort of a DAG. Terrible.

Introduction

Docker was designed to allow even completely braindead idiots to deploy software. This is not a joke. Docker was developed at dotCloud, Inc., which offered a PaaS solution at the time. They noticed that their customers were struggling to deploy software on their platform. So they developed a packaging system for entire operating system images that worked with the paradigms that pure developers were used to. Namely: VCS commit history.

Now, anyone who knows a little more about installing operating systems (low bar, I know) than your average pure developer can tell you that a literally materialized history of changes that the system has undergone is a fucking terrible way to describe OS build instructions. Anyone could come up with a more efficient system than that.

But for dotCloud, it didn't matter how horribly inefficient this system was. After all, if customers required 2GB worth of disk space to run a simple Hello World web server, who profited? The PaaS provider.

Why VCS history is useful, and system mutation history isn't

Source code is written for two reasons: To describe an exact behavior to a machine and to make it possible for humans to understand and modify that description. Having an exact history of the bytes that are in those source code files is useful because the logical content exists in almost the simplest form possible. Structural changes that don't affect the logical meaning of the content (such as running a code formatter after the fact) are rare occurences.

The files of an operating system constantly change in logically meaningless ways: Cache files are modified, the internal layout of binaries changes between updates, ... It makes no sense to capture these changes, yet this is precisely what Docker does.

Humans care about the source code and its history. Compilers care about the source code as it is right now. The runtime environment does not care about the source code at all. Docker treats an OS image as if it were source code, and your deployment target is forced to know exactly which data was downloaded into your package manager's metadata cache when the software was built.

How dependencies work

Layers: How dependencies don't work

Of course, nobody uses docker commit anymore (right?), so surely the linear history design isn't actually a problem, right?

But what about multi-stage builds?