This is post 2 of a series. Post 1 covers the binary rewriter.
Yesterday’s post covered how a binary rewriter replaces every syscall instruction in a Linux binary with a trap. The process thinks it’s making system calls. Instead, a small shim intercepts each one, checks policy, and decides what to do. The process runs inside a lightweight KVM-based VM with no operating system — just the shim.
That raises two immediate questions: if there’s no kernel, who handles the syscalls? And what does a VM look like when there’s no OS inside it?
This post answers both.
A Linux kernel is a massive piece of software. It manages hardware, schedules processes, enforces permissions, routes signals, maintains dozens of filesystem types, handles networking from raw Ethernet to TCP congestion control, and implements roughly 450 system calls. All of this exists because the kernel must handle the general case — any number of processes, any hardware, any workload.
But look at a single-process container workload. A Python script that reads data, calls an API, and writes output. What does it actually need from the kernel?
Kernel subsystemWhat it doesDoes a single process need it?Process schedulerDecides which process runs nextNo — one process, it always runsIPC (pipes, shared memory, message queues)Processes communicate with each otherNo — nobody to talk toUser/group permissionsControls access between usersNo — one process, one identityDevice driversManages hardware devicesNo — no hardware to accessVirtual filesystem layerManages mount namespaces, overlayfs, procfsNo — just needs to read and write filesSignal routingDelivers signals between processesMinimal — no sender to receive fromMulti-process memory managementCOW fork, shared mappings, per-process page tablesNo — one address spaceNetworking stackFull TCP/IP, routing, netfilter, socket buffersPartial — needs socket I/O, but not the full stack
Most of what a kernel does is coordination between competing processes and abstraction over diverse hardware. A single-process workload with known I/O patterns needs almost none of this.
Instead of stripping down a Linux kernel — which, as discussed in the previous post, leads to entangled dependencies and hacks — the approach is to write the syscall handlers from scratch. Implement just what the process needs. Nothing more.
Here’s the actual dispatch table from the shim. Every syscall the process makes ends up here:
File I/O — the basics of reading and writing data: - read, write, open, close, openat — standard file operations - lseek — seek within a file - stat, fstat, statx, newfstatat — file metadata - access — check if a path exists - readlink, readlinkat — resolve symbolic links - getdents64 — list directory entries - getcwd — current working directory - pread64 — read at a specific offset - writev — scatter-gather write - ioctl — device control (mostly stubbed — returns ENOTTY for terminals)
Memory management — the process needs to allocate memory: - brk — extend the heap - mmap — map anonymous memory (with page tracking) - munmap — release mapped memory - mprotect — change page permissions - mremap — resize a mapping (allocate new, copy, free old) - madvise — advisory hints (accepted, ignored)
Networking — socket operations for HTTP/API calls: - socket — create a socket - connect — connect to a remote host - bind — bind to a local address - sendto, recvfrom — send and receive data - getsockname, getpeername — socket address queries - poll — wait for I/O readiness
I/O multiplexing — event loops for async runtimes: - epoll_create1, epoll_ctl, epoll_wait — epoll interface - pipe2 — create a pipe pair - eventfd2 — event notification
Process identity and time: - getpid, gettid — process/thread ID (returns 1) - getuid, getgid, geteuid, getegid — user/group IDs (returns 1000) - uname — system identification - clock_gettime — high-resolution timestamps (computed from TSC, no VM exit) - getrandom — random bytes
Process lifecycle: - exit, exit_group — terminate - clone, fork, vfork — spawn (escalated to hypervisor — creates a new VM) - execve — execute a new binary (escalated to hypervisor) - wait4 — wait for child (escalated to hypervisor)
Runtime stubs — syscalls that runtimes like Python/musl probe for during startup. They don’t do real work, but returning an error would cause the runtime to crash or fall into slow paths: - rt_sigaction, rt_sigprocmask — signal handling (returns 0, no signals delivered) - sigaltstack — alternate signal stack (returns 0) - set_tid_address, set_robust_list — thread setup (returns safe defaults) - arch_prctl — set FS/GS base for TLS - prlimit64 — resource limits (returns configured maximums) - futex — futex operations (returns 0 — safe for single-threaded) - rseq — restartable sequences (returns ENOSYS, glibc handles this gracefully)
That’s roughly 60 syscalls that the shim handles today — enough to run a statically-linked CPython 3.12 binary through startup, HTTP calls, file I/O, and shutdown. Other runtimes will have different requirements. Go’s runtime probes different syscalls at startup. Node.js with V8 exercises a different set. The dispatch table grows as we test against more workloads — each new runtime might add a handful of cases. But the shape holds: single-process workloads use a small fraction of the 450 syscalls Linux provides, and the hypervisor backstop means we don’t need to implement everything before we start.
Here’s a detail that surprises people: the guest process runs at ring 0 — the same privilege level as the shim. There’s no user/kernel boundary inside the VM. No ring 3 to ring 0 transition on each syscall. No SYSENTER/SYSEXIT overhead.
In a traditional OS, the ring 0/ring 3 split exists to protect the kernel from the process. But in a single-process VM, there’s nothing to protect — the shim is the kernel, and the process is the only thing running. The real isolation boundary isn’t between rings inside the VM. It’s the VM itself. KVM and the hypervisor enforce that the guest — shim and process together — can’t touch host memory, can’t access host devices, can’t escape the VM. That boundary is enforced by hardware (VT-x, EPT page tables), not by ring transitions.
Running everything at ring 0 has practical benefits. The INT3 trap from the rewritten syscall instruction stays within ring 0 — same-privilege interrupt, no stack switch, no segment reload. The CPU pushes three values (RIP, CS, RFLAGS) instead of five (which includes a stack switch for cross-ring interrupts). The shim handler runs, computes the result, and IRETQ returns to the process. The round-trip is faster because there’s no privilege transition to perform.
It also simplifies the shim. No need for separate kernel and user page tables. No need for SWAPGS to switch segment bases. No need to manage TSS entries for stack switching. The shim’s page tables are the process’s page tables. Everything that would normally exist to maintain the user/kernel boundary — and everything that can go wrong at that boundary — is simply absent.
The natural concern is: if the process runs at ring 0, can’t it overwrite the shim? Today, the shim’s code pages are mapped read-only in the guest page tables, so a direct write faults. But the process also runs at ring 0, which means it could in theory modify the page tables themselves. The answer to this is hardware memory protection keys — Intel’s PKS (Protection Keys for Supervisor). With PKS, the shim’s pages are tagged with a key that the process cannot write to, even at ring 0. The page table pages themselves get the same treatment. This is a hardware-enforced separation within a single privilege level — no ring transition needed, no performance cost. This is on the roadmap before shipping — it’s what protects the shim’s process state, the policy table, and the page tables from modification by the guest code. Only the shim can operate on that memory. The architecture is designed around PKS from the start; it just hasn’t been wired up yet.
Not all syscalls are handled the same way. The shim has three distinct paths, and the choice matters for both performance and security:
Tier 1: Emulate in the shim (nanoseconds)
Most syscalls never leave the VM. brk extends a pointer. getpid returns 1. clock_gettime reads the TSC and converts it to nanoseconds using a frequency the hypervisor provided at boot. mmap tracks allocations in a simple list. read and write on file descriptors operate on ring buffers in shared memory.
No VM exit. No hypervisor involvement. The shim computes the result and returns it directly. This is the fast path, and it’s where the vast majority of syscalls go.
Tier 2: Delegate to the hypervisor (microseconds)
Some operations genuinely need host resources. connect() needs to open a real TCP connection. fork() needs to create a new VM. Writing to stdout (fd 1) needs to reach the host terminal.
For these, the shim writes the syscall number and arguments into a shared memory region — the governance mailbox — and triggers a VM exit via an I/O port write. The hypervisor reads the mailbox, performs the real operation, writes the result back, and resumes the VM. One round-trip, a few microseconds.
Tier 3: Deny (nanoseconds)
Anything not in the dispatch table falls through to the default case. The shim escalates it to the hypervisor, which checks policy. If the syscall isn’t authorized — and for the ~390 syscalls that aren’t implemented, it never is — the hypervisor returns -EPERM or -ENOSYS.
The denied syscall never executes. No side effects, no partial state changes. The process gets an error code and continues.
Syscall arrives at shim
│
├─ In dispatch table?
│ ├─ Can emulate locally? → handle in shim (ns)
│ └─ Needs host resources? → governance mailbox → VMexit → hypervisor (µs)
│
└─ Not in dispatch table → escalate → hypervisor denies (-EPERM / -ENOSYS)
This is the key difference from unikernels.
Unikernels compile the application and a minimal kernel into a single image. When the unikernel encounters something it can’t handle — a system call it didn’t implement, a device it doesn’t have a driver for, a networking edge case — it’s stuck. There’s no fallback. The scope of what you must implement up front is enormous, and getting it wrong means the application crashes.
The shim doesn’t have this problem. Anything it can’t handle, it escalates to the hypervisor. The hypervisor is a normal Linux process on the host, with full access to the host kernel. It can make real system calls, open real sockets, access real files. The guest process doesn’t know the difference — it made a system call and got a result back.
This changes the engineering economics completely:
UnikernelShim + HypervisorMust implement before shippingEverything the app needsJust the hot pathHandling edge casesCrash or return errorDelegate to hypervisorAdding new syscall supportRebuild and redeploy the imageAdd a case to the dispatch tableUnmodified binariesUsually no — need to recompileYes — binary rewriter handles it
This doesn’t mean delegation is free. Every syscall that gets escalated to the hypervisor is a per-syscall design decision. How much guest state does the hypervisor need to read? Can it access the guest’s memory buffers directly, or does data need to be copied? Does the hypervisor need to maintain state across multiple calls (like a file position for sequential reads)? Can the operation be performed asynchronously, or does the guest block until the hypervisor responds?
For write(1, buf, len) this is straightforward — the hypervisor reads len bytes from the guest’s buffer and writes them to the host’s stdout. For connect() it’s more involved — the hypervisor needs to perform a real TCP handshake on the host, manage the resulting socket, and set up a ring buffer pair for subsequent I/O. For fork() it’s a major operation — snapshot the guest’s memory, send it to a pool daemon, spin up a new VM with the snapshot.
Each delegated syscall is a small protocol between the shim and the hypervisor. The governance mailbox carries the arguments, but the hypervisor needs to know what those arguments mean and how to act on them. This is engineering work — not as much as implementing a full kernel, but not zero either.
The system works because the set of syscalls that actually need delegation is small. Most calls are emulated locally in the shim. The few that need the hypervisor are well-defined, and you add them one at a time as workloads demand them.
Abstract tiers are useful, but code makes it concrete. Here’s one syscall from each of the first two tiers.
When a process calls clock_gettime(CLOCK_REALTIME, &ts), a normal kernel goes through the vDSO, reads clock sources, applies NTP adjustments. The shim does this instead:
pub fn now_ns() -> u64 {
let tsc_at_boot = clock_field(0);
let unix_ns_at_boot = clock_field(8);
let tsc_freq_khz = clock_field(16);
if tsc_freq_khz == 0 {
return 0;
}
let elapsed_tsc = rdtsc().wrapping_sub(tsc_at_boot);
let elapsed_us = elapsed_tsc / (tsc_freq_khz / 1000);
unix_ns_at_boot + elapsed_us * 1000
}
The hypervisor writes three values into a known memory location at boot: the TSC value at boot time, the corresponding Unix timestamp, and the TSC frequency. The shim reads the TSC directly with rdtsc — which doesn’t cause a VM exit on modern CPUs — computes the elapsed time, and returns wall-clock nanoseconds.
No VM exit. No host interaction. A Python process calling time.time() millions of times in a loop pays nanoseconds per call instead of microseconds.
This is a practical shortcut, not the final design. TSC-based time drifts over long-running VMs because there’s no NTP correction. For short-lived agent tasks (seconds to minutes), the drift is negligible. For longer workloads, we’d use KVM’s paravirtual clock (kvm-clock) or periodically sync against the host. The architecture supports either — the shim just reads from a fixed memory location, and what the hypervisor writes there can change without touching the shim code.
When a process writes to stdout, the output needs to reach the host terminal. The shim can’t handle this locally — it needs the hypervisor. This is the escalation path:
pub fn escalate(nr: u64, a1: u64, a2: u64, a3: u64, a4: u64, a5: u64) -> u64 {
unsafe {
let mb = mailbox();
// Write syscall number and arguments into the shared mailbox
mb.add(0).write_volatile(nr);
mb.add(1).write_volatile(a1);
mb.add(2).write_volatile(a2);
mb.add(3).write_volatile(a3);
mb.add(4).write_volatile(a4);
mb.add(5).write_volatile(a5);
mb.add(7).write_volatile(0); // pre-clear return value
compiler_fence(Ordering::SeqCst);
// Trigger KVM_EXIT_IO → hypervisor wakes and handles the request
outl(NEXUS_GOV_PORT, nr as u32);
compiler_fence(Ordering::SeqCst);
mb.add(7).read_volatile() // return value written by hypervisor
}
}
The mailbox is a fixed-address struct in guest memory — 8 qwords: syscall number, 6 arguments, and a return value. The shim fills in the arguments, executes outl on a designated I/O port, which triggers KVM_EXIT_IO on the host side. The hypervisor reads the mailbox, performs the real write(1, buf, len) on the host, writes the result back into the mailbox, and resumes the VM. One round-trip, a few microseconds.
No virtio queues. No shared ring buffers for control. No feature negotiation. Just a struct in memory and an I/O port trigger. The simplicity is deliberate — this is a hot path that needs to be auditable in minutes, not days.
So the shim handles syscalls and the hypervisor handles delegation. But where does all of this live in memory? In a normal VM, the guest OS manages its own address space — the hypervisor hands it a chunk of RAM and lets it allocate. Here, there’s no guest OS to manage anything. The hypervisor places every page with a specific purpose, and the guest gets exactly what it needs.
The address space is split into two halves via two PDPT entries:
PDPT[0] → Low memory (0x00000000 - 0x3FFFFFFF, 1 GiB)
ELF binaries, brk heap. This is where the process lives.
PDPT[1] → High memory (0x40000000 - 0x5FFFFFFF, 512 MiB)
System area: page tables, shim, stack, rings, mmap pool.
No user code loads here.
Low memory is entirely for the guest binary. ELF segments load at their toolchain-native addresses (0x200000, 0x400000, etc.) without conflicting with system infrastructure. The first 2MB (PD_low[0]) is not present — any NULL pointer dereference traps immediately.
The system area at 1 GiB has a fixed layout:
SYS_BASE + Offset Size Purpose
──────────────────────────────────────────────────
0x0000 4KB PML4 (page map level 4)
0x1000 4KB PDPT (page directory pointer table)
0x2000 4KB PD_low (page directory for 0-1GiB)
0x3000 4KB PD_high (page directory for 1-2GiB)
0x4000 4KB GDT + IDT + governance mailbox
+0x100 IDT (256 entries)
+0x4E00 GOV mailbox (64 bytes)
0x5000 ~28KB Shim code (.text + .rodata)
0x18000 96KB mmap page tables (24 PTs)
0x20000 4KB Shim data page
+0x800 Initial brk GPA
+0x808 Heap limit
+0x900 Policy table (128 × 8 bytes)
+0xD00 Clock data (TSC freq, boot time)
0x200000 2MB Guest stack (grows down)
0x1000000 48MB mmap region (bump allocator)
Every page has a specific purpose. There’s no general-purpose allocator, no free list, no dynamic allocation of system structures. The hypervisor knows the exact state of every byte before the guest starts.
The page table pages (PML4, PDPT, PD_low, PD_high) are mapped read-only in the guest. The guest cannot modify its own address space. It can’t mark new pages as executable. It can’t remap shim memory as writable. It can’t create new mappings outside the regions the hypervisor set up.
This isn’t enforced by a policy check — it’s enforced by the page table permissions themselves. A write to any page table page triggers a fault. There’s no syscall to call, no privilege to escalate. The mechanism for changing the memory layout simply doesn’t exist inside the guest.
At offset 0x4E00 in the GDT page sits the governance mailbox — 64 bytes of shared memory between the shim and the hypervisor:
Offset Type Field
──────────────────────────
0x00 u64 syscall_nr
0x08 u64 arg1
0x10 u64 arg2
0x18 u64 arg3
0x20 u64 arg4
0x28 u64 arg5
0x30 u64 arg6
0x38 u64 return value
The shim writes the syscall number and arguments, triggers KVM_EXIT_IO via an outl instruction, and the hypervisor reads the mailbox on the host side. After handling the request, the hypervisor writes the return value and resumes the VM.
Why not virtio? Virtio is designed for high-throughput data transfer between guest and host — descriptor rings, available/used ring buffers, feature negotiation, driver initialization. For a syscall escalation path that carries 8 values per invocation, that’s enormous overhead. The mailbox is a fixed struct at a fixed address. No initialization. No negotiation. No driver code in the guest.
The governance mailbox handles control flow — syscall escalation. But for streaming network I/O, a synchronous mailbox is too slow. Each send() or recv() would need a VM exit to move data.
Network sockets use ring buffers in shared memory. Each connected socket gets a ring pair — one for transmit, one for receive — mapped at a fixed GPA region starting at 0xC0000000 (3 GiB):
Ring pair layout (per socket):
┌──────────────────────────────────┐
│ Control (64 bytes) │
│ head, tail, capacity, flags │
├──────────────────────────────────┤
│ TX ring (64KB) │
│ shim writes → hypervisor reads │
├──────────────────────────────────┤
│ RX ring (64KB) │
│ hypervisor writes → shim reads │
└──────────────────────────────────┘
64 ring pairs total, ~8 MiB
Single producer, single consumer. Lock-free. The shim writes to the TX ring and the hypervisor drains it on the host side, sending the data over the real TCP connection. The hypervisor fills the RX ring with incoming data from the network, and the shim reads from it. Head and tail pointers are cache-line aligned to avoid false sharing.
What we’re doing here is essentially emulating what a network device provides at its lowest level — a TX ring and an RX ring for moving bytes between software and the wire. The difference is that a real NIC’s ring buffer feeds into a full TCP/IP stack in the guest kernel: socket buffers, congestion control, routing tables, netfilter, segmentation offload. The guest doesn’t need any of that. The entire TCP/IP stack is delegated to the hypervisor. The shim presents a socket API to the process, translates it into ring buffer reads and writes, and the hypervisor handles the actual protocol work on the host — where a mature, battle-tested network stack already exists.
When a process calls send(fd, buf, len), the shim copies data into the socket’s TX ring and returns immediately — no VM exit. For recv(), the shim checks the RX ring; if data is available, it copies it out without a VM exit. Only when the ring is empty does the shim stall or escalate.
The ring buffers are backed by a shared memfd — the hypervisor and the guest see the same physical pages. The common case (data available, ring not full) involves zero VM exits.
File I/O is fundamentally different from network I/O — files have random access, seek positions, and known sizes. A ring buffer doesn’t make sense here.
Instead, file contents live in a page-backed cache region at GPA 0x80000000 (2 GiB). The hypervisor loads file data into this region before or during guest execution, backed by hugepages when available. Each file in the VFS metadata table points to a data GPA within this cache where its contents reside.
This isn’t a traditional buffer cache. In a multi-process environment, the kernel’s page cache is a shared resource — it has to handle eviction pressure from competing processes, speculative readahead that may be wasted, and cache coherency across processes accessing the same file. None of that applies here. There’s one process. We know exactly what it will need at startup (the runtime), and the policy tells us what task-specific files it will access. There’s no contention, no eviction, no wasted readahead. We can make informed decisions about what to pre-load and what to load lazily — the noise of a multi-process environment is simply absent.
When the process calls read(fd, buf, len), the shim looks up the fd’s data GPA and current file position, then copies directly from the cache pages into the process’s buffer. No ring buffer, no VM exit — just a memcpy from one guest address to another. lseek() updates the position. pread() reads at an arbitrary offset. Random access works naturally because the file is just a contiguous region of memory.
Where does the file data come from? The hypervisor reads it from the host filesystem. The policy configuration specifies a mapping — host path to guest path — and the hypervisor loads the contents into the cache region. But not all files are loaded the same way.
Pre-cached files are loaded before the guest starts. These are files that the runtime needs immediately at startup — Python’s standard library, SSL certificates, shared configuration. If CPython tries to import os and the file isn’t there yet, the interpreter crashes before user code ever runs. Pre-caching ensures the runtime boots without stalling. The cost is paid once, before KVM_RUN, and can be amortized across a warm pool of pre-booted VMs that all share the same base files.
Lazy-loaded files are loaded on demand — the metadata entry exists in the table (the guest can see the file in directory listings and stat it), but the data pages aren’t populated yet. When the process actually reads the file, the shim checks a data_ready flag in the metadata entry. If the data isn’t loaded, it signals the hypervisor via an I/O port and spins until the flag is set. The hypervisor loads the data from the host into the cache pages, sets the flag, and the shim resumes.
Lazy loading matters for several reasons. First, task-specific data: a warm pool of pre-booted VMs can’t know what task they’ll run — the input data is different every time. The VM boots with the runtime pre-cached, gets assigned a task, and task-specific files are loaded lazily when the process first touches them. The first read pays a VM exit; every subsequent read is a memcpy.
Second, memory. A Python installation with its standard library and common packages can be hundreds of megabytes. Most of those files are never touched in a given task — the process imports a handful of modules, not the entire stdlib. Pre-caching everything would make each VM far too memory-hungry, especially when running hundreds of them on a single machine. Lazy loading means the VM only pays memory for files the process actually reads.
Third, and this is something we’re still exploring: dynamic loading opens the door to content hooks. Because the hypervisor controls when and how file data is loaded into the cache, it has an interception point — it can inspect, transform, or substitute file contents before the guest sees them. A file that contains sensitive data on the host could be redacted based on the agent’s policy before being loaded into the guest cache. A configuration file could be rewritten per-task based on the agent’s permissions. The guest’s read() returns whatever the hypervisor placed in the cache pages, and the guest has no way to know whether that matches what’s on disk. This is future work, but the architecture supports it because of the lazy loading design.
The guest never sees host paths. It sees whatever guest paths the policy defines. A file at /data/customers.csv in the guest might come from /var/tasks/job-4821/input.csv on the host. The mapping is entirely controlled by the hypervisor — the guest has no way to discover or access host paths that aren’t in its VFS configuration.
To be clear: this is not a filesystem. There’s no ext4, no overlayfs, no mount table, no inodes, no dentries. It’s a flat list of memory regions with names. open("/data/input.csv") walks the metadata table for a match. Found → allocate an fd. Not found → -ENOENT. The file doesn’t exist because access was denied — it doesn’t exist because there is literally nothing at that path.
This eliminates entire categories of filesystem attacks:
AttackTraditional filesystemIn-memory VFSPath traversal (../../etc/shadow)Possible if misconfiguredNo directory tree to traverseSymlink followingPossibleNot implemented today — if added, resolved in the metadata table, not the filesystemTOCTOU race conditionsPossibleSingle process, no concurrent mutation — contents are fixed once loadedMount namespace escapePossible with privilegesNo mount conceptInode/dentry exhaustionPossibleFixed table, no allocation
Every syscall that isn’t implemented in the shim doesn’t exist as an attack surface. But more importantly, even the syscalls that are implemented are simpler — and simpler means more auditable.
Consider open() in the Linux kernel. It handles path resolution across mount namespaces, follows symbolic links (up to 40 levels deep), checks permissions against uid/gid/capabilities, negotiates with the filesystem driver, allocates inodes and dentries, handles O_CREAT/O_EXCL atomicity, manages file locks, updates access timestamps. It’s thousands of lines of code with decades of CVEs.
open() in the shim: walk a flat metadata table. If the path is there, allocate a file descriptor. If not, return -ENOENT. No mount namespaces, no symlinks, no permission checks beyond “does this entry exist.” The table itself defines what exists.
The same applies across the board:
SyscallLinux kernel complexityShim complexitymmapVMA trees, page fault handlers, COW, shared mappings, file-backed mappings, huge pagesBump allocator, anonymous onlyconnectFull TCP state machine, routing table, netfilter, congestion controlWrite destination to mailbox, hypervisor does the connectgetpidNamespace-aware PID lookup across PID namespacesReturn 1clock_gettimevDSO, multiple clock sources, NTP adjustmentsRead TSC, multiply by pre-computed frequency
The shim’s clock_gettime implementation is instructive. The hypervisor reads the host’s TSC frequency at boot and writes it into a known memory location. The shim reads the TSC directly (which doesn’t cause a VM exit on modern CPUs), multiplies by the frequency, and returns the result. No VM exit, no host interaction, nanosecond precision. Ten lines of code where the kernel has hundreds.
The ~390 syscalls that aren’t in the dispatch table aren’t bugs to fix. They’re attack surface that doesn’t exist.
No ptrace — the process can’t debug or inspect itself. No mount — the process can’t modify its filesystem view. No setuid — the process can’t escalate privileges. No kexec_load — the process can’t replace the kernel. No bpf — the process can’t install eBPF programs. No perf_event_open — the process can’t access performance counters.
These syscalls are regularly the source of privilege escalation CVEs in containers. In the shim, they return -ENOSYS. The code path that could be exploited doesn’t exist — not in the shim, not anywhere in the VM. There is no kernel to exploit.
An important consequence of honoring the syscall ABI: the process doesn’t need to be recompiled, re-linked, or modified in any way beyond the binary rewrite from post 1. A statically-linked Python binary, compiled by someone else, downloaded from a package repository, runs unmodified. So does a Go binary, a Rust binary, or a C program linked against musl.
The shim doesn’t care what language the binary was written in or how it was compiled. It cares about what syscalls the binary makes and what arguments it passes. As long as those fall within the ~60 implemented syscalls — which they do for typical single-process workloads — the binary runs.
This is where the library OS and unikernel approaches fall apart in practice. They require rebuilding the application against a custom runtime. That means maintaining compatibility with every language ecosystem, every package manager, every build system. The binary rewrite + syscall ABI approach sidesteps all of this: if it runs on Linux, and it uses fewer than ~60 syscalls, it runs on the shim.
A single VM with a shim and a hypervisor backstop is useful. A thousand of them on one machine is a platform. But a thousand VMs with 512MB of system area each would consume half a terabyte of RAM. That doesn’t work.
Tomorrow’s post covers how shared memory, warm pools, and copy-on-write make it possible to run thousands of these VMs on a single machine — and why the fixed, deterministic memory layout described here is what makes sharing possible in the first place.
This is post 2 of a 7-part series on building a minimal VM runtime. Subscribe to get the rest.
If you have questions or want to discuss — reach out on LinkedIn.