As ye clone(), so shall ye AUTOREAP

LWN.net needs you!
Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing.

The facilities provided by the kernel for the management of processes have evolved considerably in the last few years, driven mostly by the advent of the pidfd API. A pidfd is a file descriptor that refers to a process; unlike a process ID, a pidfd is an unambiguous handle for a process; that makes it a safer, more deterministic way of operating on processes. Christian Brauner, who has driven much of the pidfd-related work, is proposing two new flags for the

clone3()

system call, one of which changes the kernel's security model in a somewhat controversial way.

The existing CLONE_PIDFD flag was added (by Brauner) for the 5.2 kernel release; it causes clone3() to create and return a pidfd for the newly created process (or thread). That gives the parent process a handle on its child from the outset. This pidfd can be used to, among other things, detect when the child has exited and obtain its exit status.

The classic Unix way of obtaining exit-status information, though, is with one of the wait() family of system calls. When a process exits, it will go into the "zombie" state until its parent calls a form of wait() to reap the zombie's exit information and clean up the process entry itself. If a parent creates a lot of children and fails to reap them, it will fill the system with zombie processes — a classic beginner's process-management mistake. Even if the parent does not care about what happens to its children after they are created, it normally needs to carefully clean up after those children when they exit.

There are ways of getting out of that duty. The kernel will send a SIGCHLD signal to the parent of an exited process, but that signal is normally ignored. If the parent explicitly sets that signal to SIG_IGN, though, its children will be automatically cleaned up without bothering the parent. This setting applies to all children, though; if a parent does care about some of them, it cannot use this approach. Among other things, that means that library code that creates child processes cannot make the cleanup problem go away by setting SIGCHLD to SIG_IGN without creating problems for the rest of the program.

One of the new flags that Brauner is proposing is CLONE_AUTOREAP, which would cause the newly created process to be automatically cleaned up once it exits. There would be no need (or ability) to do so with wait(), and no SIGCHLD signal sent. The flag applies only to the just-created process, so it does not affect process management elsewhere in the program. It also is not inherited by grandchild processes. If the child is created with CLONE_PIDFD as well, the resulting pidfd can be used to obtain the exit status without the need to explicitly reap the child. Otherwise, that status will be unavailable; as Brauner noted, this mode "provides a fire-and-forget pattern".

The CLONE_AUTOREAP flag seems useful and uncontroversial. The other proposed flag, CLONE_PIDFD_AUTOKILL, may also be useful, but it has raised some eyebrows. This flag ties the life of the new process to the pidfd that represents it; if that pidfd is closed, the child will be immediately and unceremoniously killed with a SIGKILL signal. The use of CLONE_PIDFD is required with CLONE_PIDFD_AUTOKILL; otherwise, the pidfd that keeps the child alive would never exist in the first place. CLONE_PIDFD_AUTOKILL also requires CLONE_AUTOREAP, since it is likely to be enforced when the parent process exits and can no longer clean up after its children.

The intended use case for this feature is "container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd". In other words, this flag is meant to be used in cases where processes are created under the supervision of other processes, and they should not be allowed to keep on running if the supervisor goes away.

Linus Torvalds quickly pointed out a potential problem after seeing version 3 of the patch series: this feature would allow the killing of setuid processes that the creator would otherwise not have the ability to affect. A suitably timed kill might interrupt a privileged program in the middle of a sensitive operation, leaving the system in an inconsistent (and perhaps exploitable) state. "If I'm right", Torvalds said, "this is completely broken. Please explain.". Brauner responded that he was aware that he was proposing a significant change to how the kernel's security model works, but was insistent that this capability is needed; he wanted a discussion on how it could be safely implemented:

My ideal model for kill-on-close is to just ruthlessly enforce that the kernel murders anything once the file is released. I would value input under what circumstances we could make this work without having the kernel magically unset it under magical circumstances that are completely opaque to userspace.

Jann Horn questioned whether this change was actually problematic. There are various ways to kill setuid processes now, he said, including the use of SIGHUP signals or certain systemd configurations. "I agree that this would be a change to the security model, but I'm not sure if it would be that big a change." He suggested that, if this change is worrisome, it could be restricted to situations where the "no new privileges" flag has been set; that is how access to seccomp() is regulated now. Brauner indicated that this option might be workable, and the current version of the series (version 4 as of this writing) automatically sets "no new privileges" in the newly created child when CLONE_PIDFD_AUTOKILL is used.

Meanwhile, Ted Ts'o suggested that, rather than killing the child with SIGKILL, the kernel could instead send a sequence of signals, starting with SIGHUP, then going to SIGTERM, with a delay between them. That would give a setuid program the ability to catch the signal and clean up properly before exiting. Only if the target process doesn't take the hint would SIGKILL be sent. Even then, he was not convinced that this feature could be made secure: "I bet we'll still see some zero days coming out of this, but we can at least mitigate likelihood of security breach."

Brauner did not respond to that last suggestion, and the conversation wound down, relatively quickly, without any definitive conclusions having been reached. There is clearly a bit of tension between the need for supervisors to have complete control over the applications they manage and the need to keep existing security boundaries intact. Whether that tension is enough to keep relatively new process-management approaches from being implemented remains to be seen.

Index entries for this article
Kernel	pidfd
Kernel	System calls/clone()