LWN.net needs you!Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing.
Some system administrators running Ubuntu 20.04 had a rough time on June 8, when Ubuntu published kernel packages containing a particularly nasty bug that was caused by an Ubuntu-specific patch to the kernel. The bug led to a kernel panic whenever a Docker container was started. Fixed packages were made available on June 10, but there are questions about what went wrong with handling the patch; in particular, it is surprising that kernel 5.13, which has been beyond its end-of-life for months, made it onto machines running Ubuntu 20.04, which is supposed to be a long-term support release.
Ubuntu's kernel release lifecycle
Unless it is following a rolling-release model, a Linux distribution project will often pick a kernel branch and stick with it for the lifetime of a distribution release. For example, a release that ships with a 5.4 kernel, as Ubuntu 20.04 did, might receive updates to later 5.4.x kernels, but is unlikely to be upgraded to 5.15 until the next major release of the distribution. For this reason, such projects often prefer or even require a branch that has been designated as a long-term maintenance branch by the stable kernel team. It's easier for a distribution maintainer to sleep at night knowing that the version of the software they are shipping is supported upstream.
Debian and Red Hat both adhere to these rules when picking kernels for
their releases; Ubuntu claims to do so as well, at least for its LTS
(long-term support) releases. Those releases are made every two years, and
supported for five. They ship with a long-term stable kernel; Ubuntu provides
updates to that for the lifetime of the release.
Ubuntu also makes non-LTS releases at six month intervals in between
the LTS releases. In contrast to the LTS releases, these releases are only
supported for about six nine months, and are declared end-of-life (EOL) a month three months after
a newer release is made available. Because of their relatively short
shelf-life, Ubuntu does not restrict itself to long-term kernels for these
releases. The most recent non-LTS release, Ubuntu 21.10, shipped with
Linux 5.13, which is not a long-term branch. In fact, the 5.13
branch was declared EOL on September 18, 2021, almost a month before
Ubuntu 21.10 was released on October 14.
Users who prioritize stability highly value the long window of support that comes with an LTS release, but five years is a relative eternity in the world of hardware, particularly in fast-moving areas like graphics. In order to support newer hardware, Ubuntu periodically publishes new hardware enablement (HWE) stacks for its LTS releases. These are comprised of packages backported from the latest (possibly non-LTS) release. The HWE stack includes updated kernel packages, and may also include updated Xorg and Mesa packages.
According to Ubuntu, the HWE stack is enabled by default for new desktop installs of Ubuntu, but needs to be explicitly chosen for server installs. This opt-in policy also seems to only apply to users installing from the ISO image; the default Ubuntu 20.04 images on Amazon AWS, Azure, and Google Cloud all come with the HWE kernel pre-installed. Many system administrators (including me) choose the HWE stack for their servers as well, either out of a desire for features only available in newer kernels or out of a need for a kernel that works with their hardware.
When considered independently from each other, the decision to bypass long-term kernels for non-LTS releases and the decision to publish HWE kernels to extend the hardware support of LTS releases both seem reasonable. In combination, though, these two decisions can lead to a somewhat surprising situation; users running a "long-term support" distribution can end up running a version of Linux which is considered end-of-life by the kernel developers.
As of this writing, users running the HWE kernel on Ubuntu 20.04 will get a 5.13 kernel backported from 21.10. Ubuntu 22.04, which is the next LTS release, includes the 5.15 kernel, which is a long-term stable branch. This is currently available to 20.04 users under the name "hwe-20.04-edge". It will presumably replace the kernel from 21.10 as hwe-20.04 sometime before July 14, when Ubuntu 21.10 is itself EOL. For now, though, and for the past few months, anyone running the HWE kernel on 20.04 is running a kernel based on 5.13. Since the HWE kernel is the default kernel on all three major clouds, problems with it can affect a large slice of Ubuntu's users.
A tale of four filesystems
The HWE kernel allows using newer hardware and kernel features with Ubuntu LTS, but it seems that this may come with some cost to stability. The root cause of the kernel crash lies at the intersection between no less than four different filesystems, although none of them are filesystems in the traditional sense of something that writes data to persistent storage.
The first is overlayfs. As the name might suggest, overlayfs allows overlaying the files in one directory (the "upper" directory, in overlayfs parlance) on top of the files in another (the "lower" directory.) This results in a mount point that contains all of the files in both the upper and the lower directories; if both directories contain a file with the same name, overlayfs presents the version present in the upper directory. Any changes made to an overlayfs mount are reflected in the upper directory. The functionality provided by overlayfs is particularly valuable to container runtimes such as Docker that store container images as a series of layers; overlayfs provides an efficient way of constructing a container's root directory from these layers. It has been a part of the kernel since version 3.18 in 2014.
The second filesystem involved is AUFS, which does everything that overlayfs does, and a lot more, but its implementation is significantly more complex. AUFS weighs in at about 35,000 lines of code, whereas overlayfs is about 12,000. AUFS was first submitted for inclusion into the kernel in 2008, but was never merged; since then, it has continued to be maintained out-of-tree. Ubuntu included AUFS in its kernels through version 20.10, but dropped it in 21.04.
The third filesystem is shiftfs, which was originally created by James Bottomley in 2018 to allow remapping the user and group IDs in a mounted filesystem, and while it has never been merged upstream, it has been included in Ubuntu's tree since the 5.0 kernel series. Canonical's LXD project can use shiftfs to speed up the creation of unprivileged containers, where the root user inside the container is mapped to a user other than root outside of it; otherwise, filesystems would need to have their user and group IDs changed to be used in that way. It is unlikely that shiftfs will ever land in Linus Torvalds's tree, though, as its functionality is entirely duplicated by the ID-mapped mounts that were added to the kernel in version 5.12. LXD has since been updated to use ID-mapped mounts when available.
The fourth and final filesystem in our story is procfs. As is generally known, each process running on a Linux system has a corresponding directory in /proc. Among a great many other things, each of these directories contains a subdirectory named map_files, which has a collection of symbolic links. Each link corresponds to a range of addresses in the process's address space that has been mapped to a file; the name of each link indicates the range of addresses that are mapped, and the destination is the file that is mapped to that range. For example:
$ ls -l /proc/$$/map_files/
total 0
lr-------- 1 jordan everybody 64 Jun 22 16:21 55e0cc120000-55e0cc14d000 -> /usr/bin/bash
lr-------- 1 jordan everybody 64 Jun 22 16:21 55e0cc14d000-55e0cc1fe000 -> /usr/bin/bash
...
The most prominent user of the map_files subdirectory is perhaps the Checkpoint/Restore In Userspace (CRIU) tool, which allows for "checkpointing" a process by serializing its entire state to disk, and later "restoring" it by recreating the process from its serialized state.
What does the patch do?
The patch that caused the kernel panic when creating Docker containers was intended to correct a problem when using overlayfs and shiftfs together. If a process mapped a file from such a mount, the symbolic link in map_files would point to the original "unshifted" version of the file, instead of the path inside the shiftfs mount. This broke checkpointing and restoring Docker containers, because the files the symbolic links in map_files were pointing to were in filesystems that weren't mounted inside the container.
This problem was discovered early in 2020, and fixed shortly after the release of Ubuntu 20.04. At the time, AUFS was included in Ubuntu's kernel. The developers of AUFS had also faced challenges related to differentiating between the real name of a file and its alias inside of an AUFS mount. To address this, the AUFS patch introduces an additional field called vm_prfile to the kernel's vm_area_struct, which is populated with AUFS's name for the file. To fix the problem with overlayfs and shiftfs, Ubuntu's developers needed to keep track of a file's alias inside of a synthetic mount, and, since AUFS had already added vm_prfile for a similar purpose, they chose to reuse it instead of introducing another field. Knowing that their fix was dependent on AUFS being enabled, they also chose to guard it in an #ifdef block — if AUFS was not configured into the kernel, then the patch became a no-op.
How things went wrong
When Ubuntu's developers ported the shiftfs-related patches from their 5.8 kernel branch to their 5.13 and 5.15 kernels, the patch that corrected the problem with map_files and shiftfs was left out, because it depended on AUFS, which had been dropped from Ubuntu's kernel. When those kernels were backported to Ubuntu 20.04, where AUFS continues to be supported, the missing patch was noticed, and it was applied to Ubuntu's 5.13 and 5.15 trees as well.
Unfortunately, the internals of overlayfs changed over time in a way that eventually caused the patch to be incorrect. As a result, when a file on an overlayfs is mapped into memory, the function added by the patch attempts to release a reference to a struct file using fput(), but the structure had already been freed due to an earlier fput() call. That causes the kernel to panic.
On Ubuntu 21.10, where 5.13 is the default kernel, this didn't cause any problems. Since AUFS is not enabled, the #ifdef block around the code introduced by the patch prevented it from being compiled into the kernel. The problem occurred when 5.13 and 5.15 were rebuilt for Ubuntu 20.04. Since an HWE kernel needs to support all of the features that are supported by the kernel it is replacing, AUFS was enabled in these builds, and the code containing the extraneous fput() was compiled in.
The problem was noticed in May, almost immediately after the patch was added back in. However, it appears that 5.13 was overlooked; the patch was reverted in Ubuntu's 5.15 branch and replaced with a version that did not call fput(), but the incorrect version remained in the 5.13 branch and made it into the 5.13 HWE kernel.
According to the changelog, the problematic kernel package was built on June 3, although it may not have been published to Ubuntu's package repositories for some time afterward. The problem was reported on June 8. Until updated packages were made available on June 10, the only recourse available to affected users was to manually roll back to a previous kernel.
Conclusion
Maintaining an out-of-tree kernel patch for any length of time is an arduous task. As much as Linux has an iron-clad guarantee of user-space compatibility, it provides zero assurances about the stability of internal kernel interfaces between versions. Things that do not get merged often quickly fall by the wayside, due to the sheer level of effort required to keep up with changes elsewhere in the kernel.
When Ubuntu ships out-of-tree patches with its LTS releases, it is signing its kernel developers up for the task of maintaining them for at least five years, often across multiple branches of the kernel simultaneously. Sometimes these bets pay off; Ubuntu included overlayfs in its kernel before it was merged, and now it is maintained upstream. On the other hand, even though Ubuntu dropped support for AUFS in 2021, because the distribution shipped it in 20.04, they are on the hook for supporting it until 2025. Their latest LTS release, 22.04, still contains support for shiftfs; those patches will be hanging around in Ubuntu's tree until at least 2027. As the problem with the patch demonstrates, keeping these patches up-to-date is no simple task; changes in other parts of the kernel can and will cause problems, which requires careful attention.
Based on those timelines, it doesn't seem like things are set to get any easier for Ubuntu's kernel developers anytime soon. Indeed, things may actually be destined to become harder; as the kernel now provides equivalent functionality, interest in these out-of-tree alternatives is likely to wane, which will place the burden of maintenance even more squarely upon Ubuntu's shoulders. The bets that don't pay off turn into debt, with compound interest.
In the end, it appears that Ubuntu fell victim to at least some level of self-inflicted complexity. Ubuntu's developers quickly caught and fixed the problem, but only in one of the affected branches. Unfortunately, the branch that was missed is the one that was shipped to users.
| Index entries for this article | |
|---|---|
| GuestArticles | Webb, Jordan |