Bring your own init: PID 1 handoff

9 min read Original article ↗

microsandbox ships its own init system called agentd. It's the critical component that powers all the magic we get to do from userspace.

Every sandbox boots through agentd, and at boot agentd is PID 1. It mounts the guest rootfs, prepares runtime state, wires networking, and makes the guest controllable from the host. Without it, the sandbox does not become a microsandbox; it is just a kernel looking for userspace.

But Linux only has one PID 1.

That is where things get interesting, because a lot of the Linux service ecosystem assumes that one slot belongs to systemd, OpenRC, runit, s6, or whatever init the distro was built around. Package scripts call systemctl. Daemons expect a service manager. Shutdown semantics route through init. If you've ever installed a .deb that calls systemctl enable inside a container and watched it fail with "Failed to connect to bus", or tried to stand up postgres + redis as services for a CI integration test, you've felt the edge of this.

So we needed both worlds: our init first, because it powers the sandbox; your init after, because it powers the distro. The handoff is how we make that work.

In Docker, Podman, and most container runtimes, the process you ask for is PID 1.

docker run --rm alpine sh

That sh is PID 1 in the container's pid namespace. PID 1 has special responsibilities the kernel hands to no one else: it reaps orphaned children, it handles signals as the implicit "everything" target, and it's expected to keep the namespace alive. Most application processes were never written with that in mind.

The day-to-day consequences are the usual list of papercuts:

  • Zombie processes. Anything sh forks and forgets about doesn't get reaped. Give it a while, and they pile up.
  • Signal handling falls apart. SIGTERM to PID 1 is ignored unless the process registered a handler. Containers exit slowly or hang.
  • systemctl doesn't work. It can't find systemd because there is no systemd. Same for loginctl, D-Bus session bus, journalctl --user.
  • Service tests are awkward. If your CI image needs to stand up postgres and redis as units to run an integration test, you can't.

The standard escape hatch is to wrap your process with a tiny init shim like tini or dumb-init. That fixes signals and zombies but does nothing for the systemctl problem. The other escape hatch is "run systemd in a container", which is possible but usually requires runtime-specific flags, cgroup wiring, elevated privileges, or image conventions that make it less portable.

There's a more honest fix: don't run your workload as PID 1 at all. Run a real init. The trick is doing that without giving up the init work microsandbox has to do first.

The thing that makes a real init handoff possible is that microsandbox owns the guest boot path. A container runtime hands the workload directly to PID 1 and that's the whole story. Because we're booting a microVM, we get an earlier hook: run our own agent as the temporary PID 1, prepare the guest, then replace PID 1 with the image's init before the workload ever sees it.

Inside the microVM, the kernel execs our agentd binary as PID 1. agentd is small on purpose. It mounts the rootfs, configures the network interface, and sets up the runtime directories the host-side control plane needs.

That is where the divergence from a container runtime happens. Without --init, agentd stays as PID 1, builds its async runtime, opens the host control channel, and serves exec / filesystem requests. That path is fast and quiet, and it is the right default for one-shot workloads.

With --init, agentd does its boot-time setup, then forks before building the runtime or opening the host channel.

                kernel exec
                     |
                  agentd (pid 1)
                     |
                   .fork()
                  /        \
       agentd (pid 1)        agentd (child)
              |                   |
       exec(systemd)        builds runtime,
              |             opens host channel,
       systemd (pid 1)      serves requests

The parent execs the user-chosen init. After the exec, systemd is PID 1. The kernel doesn't care that this slot was previously held by something else; PID 1 is whatever lives in the slot. The child agentd continues as a normal process under the new init, then builds its async runtime, opens the host control channel, and starts serving exec and filesystem requests.

What you end up with inside the guest:

$ ps -fA --forest
UID     PID   PPID    CMD
root      1      0    /sbin/init                     # systemd
root    212      1     \_ /init.krun                 # surviving child agentd
root    214    212     |   \_ bash                   # whatever you're running
root    236      1     \_ /lib/systemd/systemd-journald

systemd is PID 1 and the ancestor of every guest process. The surviving child agentd runs under systemd and spawns host-issued exec sessions as its own children. When the user runs systemctl status, they get a real answer.

On shutdown, the child agentd hands off again in reverse: it asks the new PID 1 to power off through the init's normal shutdown path where one exists, and falls back to a signal path if it doesn't. The init owns the teardown order, so units stop in dependency order instead of being kill -9'd.

The handoff mechanic isn't tied to systemd. It's just an execve(2). The kernel doesn't care what binary lands in PID 1, only that something does. So --init accepts any absolute path:

msb run my-image --init /sbin/init -- bash         # systemd, sysvinit, whatever
msb run my-image --init /init -- bash              # s6's stage 1
msb run my-image --init /sbin/runit -- bash        # runit
msb run my-image --init /sbin/openrc-init -- bash  # openrc

The convenience form --init auto probes a list of well-known paths and picks the first that exists:

/sbin/init
/lib/systemd/systemd
/usr/lib/systemd/systemd

That covers most Debian, Ubuntu, and Fedora-derived systemd images. For everything else, --init <abs-path> is exact and reproducible.

If --init auto fails to find anything, agentd refuses to boot and kernel.log lists every path it tried. We picked that over a silent fallback because "systemd not found" silently dropping back to agentd-as-PID-1 produced confusing failures: systemctl returns "Failed to connect to bus" and you spend an hour wondering which step in the boot sequence broke.

Three flags, all optional after --init:

  • --init <auto | path>: pick the init.
  • --init-arg <arg>: append argv to the init. Repeatable. Defaults to [<cmd>] when empty.
  • --init-env KEY=VAL: env merged on top of the inherited environment.

Concrete: boot under systemd's multi-user target with the standard container marker env var:

msb run ghcr.io/superradcompany/debian-systemd:12 \
  --init /lib/systemd/systemd \
  --init-arg --unit=multi-user.target \
  --init-env container=microsandbox \
  -- bash

The SDKs expose the same concept with language-native shapes. Rust uses SandboxBuilder::init and init_with. TypeScript uses .init and .initWith on the builder. Python takes init= on Sandbox.create, accepting a string, a dict, or an InitConfig-like value when you need argv and env.

Once systemd is PID 1, the guest behaves like a normal Linux box:

msb run ghcr.io/superradcompany/debian-systemd:12 --memory 1G --init auto -- bash

Inside:

apt update && apt install -y nginx
systemctl enable --now nginx
curl -s localhost | head -1
# <!DOCTYPE html>

That's it. There's no init wrapper, no tini flag, no special cgroup mode. The Debian package post-install hook works because it can talk to systemd, and --now starts the unit because that's what systemctl does on this box.

For repeatable CI you'll want to bake nginx into the image (or apply a rootfs patch) so the apt step doesn't run on every sandbox start — the interactive install above is just the shortest demo.

The same pattern works for many service-oriented packages: systemd units start, systemctl talks to PID 1, and daemon-based workflows become practical. Postgres, Redis, sshd, Cloudflare WARP, and "I need a session bus to test my GTK app" all get a lot closer to bare metal. Some services may still need the right image, capabilities, kernel modules, or networking setup to fully come up. --init is a foundation, not a guarantee.

Three things to be aware of when you reach for --init:

Memory. As a rough sizing rule, systemd's idle resident set lands around 50 MiB; add the daemons it manages and the working set grows fast. The agent-only default fits comfortably in 256 MiB, and a systemd boot with a couple of services usually wants closer to 1 GiB. Treat these as starting points, not budgets — set --memory for your actual workload.

Boot latency. With agent-only PID 1, the guest is "ready" by the time agentd finishes its setup, which is sub-second. With systemd, you also wait for multi-user.target (or whatever target you booted into) to finish. Fresh Debian images settle in 1-3 seconds; whether that's noise or signal depends on what you're doing.

Operational surface. Once you're under a real init, you also inherit its operational surface: timers, sockets, dependency graphs, target ordering. For most use cases that's the point, but if you only need a process supervisor, runit or s6 will do the same job in a fraction of the resident set.

Reach for --init when:

  • A service install or test expects systemctl, journalctl, D-Bus, or session-bus things.
  • You want to test "this is what running on a real Linux server looks like" without spinning up a real Linux server.
  • You're running a workload that ships only as .deb / .rpm packages with systemd post-install scripts.
  • You want per-task egress identity through a VPN client (WARP, Tailscale) that runs as a system daemon.

Stay with the agent-only default when:

  • The workload is a single binary or a Python script. There's nothing for systemd to manage; you'd just be paying for boot time.
  • You're under tight memory budgets (sub-256 MiB).
  • Boot latency matters and you're churning sandboxes per request.

The --init handoff is feature-complete and stable. We also publish multi-arch guest images for Debian, Ubuntu, Fedora, and Alpine with the right init pre-wired and rebuilt weekly, so you don't have to lean on third-party *-systemd forks.

If you want to try the handoff now, this is the one-liner:

msb run ghcr.io/superradcompany/debian-systemd:12 --init auto -- bash

PID 1 is whatever you tell us it is.