The difficulty of safe path traversal

This article brought to you by LWN subscribers
Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Aleksa Sarai, as the maintainer of the runc container runtime, faces a constant battle against security problems. Recently, runc has seen another instance of a security vulnerability that can be traced back to the difficulty of handling file paths on Linux. Sarai spoke at the 2025 Linux Plumbers Conference (slides; video) about some of the problems runc has had with path-traversal vulnerabilities, and to ask people to please use libpathrs, the library that he has been developing for safe path traversal.

Sarai began by defining what he meant by path safety. There are two kinds of path safety that are relevant to runc, he said. The first is "regular" path safety, which applies to any application working with files: when operating on a path, one of the components might change unexpectedly. For example, a program could be reading several files from a directory using absolute paths, only for a directory in the middle of the path to be changed, causing the program to see a mixture of files from different directories. That kind of time-of-check-to-time-of-use (TOCTTOU) error comes up all of the time in path-handling code. He shared a slide showing 14 different CVEs in runc since 2017, all of which involved this kind of problem. LWN covered one in 2024 and one in 2019 that were particularly noteworthy. The second kind of path safety deals with the peculiarities of virtual filesystems, and needs to be handled separately, Sarai added.

There are some partial solutions to regular path safety, Sarai said. The openat2() system call, for example, lets programmers specify to the kernel how it should handle certain kinds of ambiguities. On the other hand, openat2() is often blocked by seccomp(), since one of the arguments is a pointer. So, the only real alternative is to implement path traversal in user space, by opening each directory along the path by hand. That is "quite finicky, but you can do it." The Go programming language recently added standard library support for doing that kind of traversal, for instance. Sarai's recommendation is to use libpathrs, the MPL-and-LGPL-licensed Rust library (with C bindings) that he has written to do this delicate dance correctly.

The second kind of path safety that runc has to deal with, above and beyond what most applications deal with, is what Sarai calls "strict" path safety. Some virtual filesystems, such as procfs, require extra attention to detail to use safely. On a normal filesystem, one does not typically care which exact inode one opens, so long as it is for a file located at the right path. But on procfs, it's important to operate on the exact file the program was expecting: "overmounts or fake mounts can trick us into doing dangerous things." For example, a program that reexecutes itself using /proc/self/exe would be quite broken if a user bind-mounted a different file into that path.

This isn't a new problem. For regular files, one can use the RESOLVE_NO_XDEV option to openat2() to avoid crossing any filesystem boundaries. But that doesn't work for procfs's magic links (such as /proc/self/exe) "for some reason." [Sarai later wrote to me to clarify that the reason is actually straightforward: since /proc/self/exe refers to an executable file on another file system, of course RESOLVE_NO_XDEV blocks it. The trick is making sure that the magic link hasn't been overmounted with a reference to the wrong other filesystem.] The solution is to use the filesystem mount API to obtain a file descriptor for the root of procfs that no other process can interfere with.

That solution doesn't work inside nested user namespaces, however. So, programs that need to work inside nested namespaces (such as runc, when people want to nest containers) have to resort to some extremely delicate checks that particular paths haven't been overmounted.

Sarai expected some people to be skeptical that this was a real problem, on the basis that only root can mount things. That's not really true in the context of runc, though. There are, of course, mount namespaces to deal with. But there is also the fact that runc is used as a backend by a large number of different high-level containerization programs. Those programs often give users a lot of flexibility to configure container mounts, which can result in unprivileged mounting into containers that runc is working with, sometimes in ways that permit a program to escape the container. Almost all of the vulnerabilities in runc have been misconfiguration bugs, he said. Sometimes, runc can prevent those by recognizing an obviously invalid configuration, and sometimes the higher-level programs need to be patched to avoid generating the problematic configuration in the first place.

He then went through a series of pop quizzes to illustrate the difficulties that come up in configuring a container. Two of runc's most recent vulnerabilities, CVE-2025-31133 and CVE-2025-52565, involved using a mount and a symbolic link to trick runc into giving a container access to /proc/sys/kernel/core_pattern, the procfs file used to configure the kernel's core-dump handler. Writing into that file could let processes escape from inside a container.

Sarai didn't go into detail on why setting the core-dump handler could enable an escape in the talk, but he was probably referring to the problem Christian Brauner highlighted in May. When the kernel launches the configured core-dump handler, it runs with full privileges in the root namespace; if the container can configure the core-dump handler to be a file that it controls, this allows it to effectively take over the system.

The solution in runc was to use much stricter validation of special inodes, to move to libpathrs for path traversal, and to use TIOCGPTPEER to validate that console files are really console files and not sneakily overmounted regular files. But the work also brought to light some potential kernel changes that could make writing this kind of path-handling code much safer. Sarai suggested adding a RESOLVE_NO_DOTDOT option to openat2(), to ~~prevent traversing into a parent directory by accident~~ ban the use of ".." in paths at all. He said that it would also help to block all overmounts of procfs magic links; most have been blocked since kernel version 6.12, but "most" and "all" are different prospects in the world of security.

For people writing user-space applications, Sarai's recommendation is to switch to a more file-descriptor based design, rather than relying on paths. Ideally, use openat2() or libpathrs to handle path manipulation. Every system call that works with path names is potentially dangerous, he said.

One member of the audience asked whether Sarai had any advice for safely dealing with paths in cgroupfs (the virtual filesystem for manipulating control groups). Sarai replied that the RESOLVE_NO_XDEV flag to openat2() was quite helpful for making sure that one stays within cgroupfs. Version 1 control groups are "annoying," and he didn't have much advice for dealing with them correctly. For version 2 control groups, however, he recommended opening a file descriptor for the root of the filesystem and checking the inode number. If that is correct, it's much more certain that the program is interacting with the real cgroupfs.

Another member of the audience asked how the CVEs that Sarai had highlighted were discovered. "Not through fuzzing or anything, just people looking," he answered. In 2018, he "provided a very general script" for creating scenarios that can result in path-traversal problems. Since then, people have been poking at runc and slowly evolving the attacks to reach increasingly obscure corner cases. Each year's CVEs tend to be an evolution of the previous year's, he said.

Someone else asked whether he thought that using virtual filesystems such as procfs and cgroupfs to present kernel interfaces was a mistake. "Well, there's several attacks that could never happen if it were [designed using system calls instead], and that makes you wonder, right?" Sarai replied. On the other hand, virtual filesystems do provide some nice benefits. The LXC container runtime has a fake procfs implementation to help control-group-unaware programs to identify process and memory limits, for example, he said.

That didn't satisfy some people, who pointed out that there are also ways to intercept system calls. At that point, however, the session ran out of time and the discussion spilled over into the hallway.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for helping with travel to Tokyo to cover the Linux Plumbers Conference. ]

Index entries for this article
Conference	Linux Plumbers Conference/2025