Hey all. I wrote a blog post recently that was meant to be a small update on things I’ve been working on with TinyKVM. However, it sort of came out of nowhere and didn’t explain the big-picture architecture.
TinyKVM
TinyKVM is a light-weight specialty hypervisor that sandboxes a limited set of regular/unmodified Linux programs. Typically programs that act as servers and run some kind of event loop like Rusts tokio. Its primary design is focused on compute and request-response workloads, and it especially thrives on running specialized static executables. It can reset a program in record time back to a previous state, and the intention is for the reset mechanism to be used on every single request.
And, I guess you can still run simple programs inside it:
$ kvmserver --allow-read run uname -a
Linux tinykvm 3.5.0 x86_64 x86_64 x86_64 GNU/Linux
$ kvmserver --allow-read run perl /usr/games/cowsay Hello TinyKVM World!
______________________
< Hello TinyKVM World! >
----------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||So, the overall architecture is designed for per-request isolation and scales concurrently through tiny VM forks. Tiny KVM VM forks.
Press enter or click to view image in full size
Per-request isolation
Normally when you host a web service, each time a request comes in, that request will be handled inside a stateful application that can remember things over time. The drawback is that a break-in can be made to affect future requests. The break-in might never be able to escape the application if it’s sandboxed or jailed (or both), but it will still have access to the application itself and it could make it start doing shady things with future client requests (or backend services). That is, it could set up shop in your application.
Per-request isolation reduces the fallout effects by resetting the whole application back to a known good state after every single request, no matter what. It will appear as if the whole VM just disappeared after the request ended and was replaced with a new one for the next request, losing any and all changes that were made. We can say that each request is being handled in an ephemeral request VM. This safety mechanism lowers the blast radius of an attack and eliminates all types of resident temporal attacks. It also removes garbage collection from the equation, as there is never any opportunity for it to run, potentially avoiding extreme spikes.
An example:
let state = 0;
Deno.serve({ port: 8000 }, (_req) => {
state += 1;
return new Response(String(state));
});We’ll run this program with Deno in the TinyKVM CLI:
export DENO_V8_FLAGS=--predictable,--max-old-space-size=64,--max-semi-space-size=64
export DENO_NO_UPDATE_CHECK=1
kvmserver -e -t 1 --warmup 1000 --allow-all run deno run --allow-all local.tsWe assume deno is in PATH. The -e argument means ephemeral and enables per-request isolation, while -t is the number of tiny VM forks that make up the concurrency of our isolated application. During startup it will send 1000 warmup requests, which will definitely activate the V8 JIT. And after warmup it will become ephemeral and start listening for requests. Let’s send the 1001th request:
$ curl -D - http://127.0.0.1:8000
HTTP/1.1 200 OK
content-type: text/plain;charset=UTF-8
vary: Accept-Encoding
content-length: 5
date: Mon, 27 Oct 2025 17:40:25 GMT1001
$ curl -D - http://127.0.0.1:8000
HTTP/1.1 200 OK
content-type: text/plain;charset=UTF-8
vary: Accept-Encoding
content-length: 5
date: Mon, 27 Oct 2025 17:40:49 GMT
1001
We sent two requests, and yet the number didn’t get incremented. We know that it was incremented during warmup because we did 1000 warmup requests and it’s now 1001. The request VM was reset back to a former state in between my two requests.
Concurrency and memory usage
TinyKVM enables massive concurrency by making tiny forked VMs from warmed up main VMs. These tiny forks use copy-on-write pages to reduce their memory footprint. The forks are single-threaded and run at native performance. Some pagetable-walking is avoided by using huge pages essentially everywhere until a write happens.
A Deno hello world program needs only 13.5mb RSS to start with a VM snapshot, and around 228kb memory per fork instance. For reference, the Deno main executable is 109mb.
The RSS for each VM at load-time is nearly the same as no guest memory is in use yet. Ideally the numbers are equal, but there are small unexplained differences. What’s interesting is of course the additional memory used after a VM fork has been created, and then after a request has been processed. We can see that the scalability is quite insane. It’s hard to believe, actually.
Gated persistence for per-request isolation
While we cannot have persistence in the request VMs themselves, as they will keep getting reset after requests conclude, we can have a separate VM for that. TinyKVM supports a safe memory-sharing RPC mechanism that lets you call a function in a remote VM while pausing the caller VM. We’ve also discovered that you can resume Deno from a paused state in a custom FFI-function and the regular Deno event loop will still work. We’ll write our request and the answer into a shared buffer pass to the remote VM. It will be bidirectional true zero-copy IPC. Here’s our example persisted program:
import { connect } from "jsr:@db/redis";const redisClient = await connect({ hostname: "127.0.0.1", port: 6379 });
await redisClient.set("value", "0");
let result = 0;
while (true) {
const buffer = waitForRemoteBuffer(result);
if (buffer.length === 0) {
result = -1;
continue;
}
const redis_answer = await redisClient.incr("value");
const response = "Hello, " + redis_answer + " from Persisted Deno!";
const { read, written } = new TextEncoder().encodeInto(response, buffer);
result = read < response.length ? -1 : written;
}
Our separate program with persistence is sitting in an endless loop waiting for a buffer. When a request (with a buffer) is received, we’ll process it and then go back around, calling waitForRemoteBuffer again with the result. The result will be handed back to the caller currently waiting in a paused state. Any changes made to the buffer we received is also visible in the caller. The waitForRemoteBuffer function will call a FFI-function that is compiled automatically by KVM server and made available through a fixed libkvmserverguest.so filename. So, all persistence programs use the same API.
Our request program now looks something like this:
Deno.serve({ port: 8000 }, (_req) => {
const remote_buffer = new Uint8Array(256);
const len = remoteResume(remote_buffer, remote_buffer.byteLength);
if (len < 0) {
return new Response("Internal Server Error", { status: 500 });
}
const remote_str = new TextDecoder().decode(
new Uint8Array(remote_buffer.buffer, 0, len),
);
return new Response(remote_str);
});This program is the one handling the request, and it will be reset after the request concludes. We’re accessing our persistence program through the remoteResume FFI-function.
We can load both programs at the same time in KVM server. Our so-called storage program along with the main request program like so:
export DENO_V8_FLAGS=--predictable,--max-old-space-size=64,--max-semi-space-size=64
export DENO_NO_UPDATE_CHECK=1
kvmserver -e -t 1 --warmup 1000 --allow-all storage deno run --allow-all remote.ts ++ run deno run --allow-all local.tsThey are now completely separate from each other, living in separate VMs. In this example, requests can only access persistence through a single buffer.
You should see an extra line when starting KVM server now:
Storage VM initialized. init=416ms
Listening on http://0.0.0.0:8000/ (http://localhost:8000/)
Warming up the guest VM listening on 0.0.0.0:8000 (1 threads * 1000 connections * 1 requests)
Program 'deno' loaded. epoll vm=1 ephemeral-kwm huge=0/0 init=161ms warmup=55ms rss=73MBA storage VM has been initialized from a completely separate program. The storage VM has persistence and won’t get reset unless it crashes. Since we’re using a KV-store as our storage, we aren’t losing any data even if there’s a mistake that crashes and restarts our persistent VM. Now when we make cURL requests we will see that we appear to have persistence again, while having added only a meager 50MB to RSS. And most importantly we maintain a reduced blast radius for any attacks.
We could use Redis directly from the request VM, but then any break-in will also have access to it. Only having access to persistence through a bottleneck that you can apply scrutiny to is a good defense strategy.
Let’s send some cURL requests now and see what happens:
$ curl -D - http://127.0.0.1:8000
HTTP/1.1 200 OK
content-type: text/plain;charset=UTF-8
vary: Accept-Encoding
content-length: 32
date: Mon, 27 Oct 2025 18:23:47 GMTHello, 1001 from Persisted Deno!
$ curl -D - http://127.0.0.1:8000
HTTP/1.1 200 OK
content-type: text/plain;charset=UTF-8
vary: Accept-Encoding
content-length: 32
date: Mon, 27 Oct 2025 18:23:48 GMT
Hello, 1002 from Persisted Deno!
This time the requests were able to count upwards. This persistence is achieved through the new sandbox-to-sandbox RPC mechanism that I tried very hard to explain in the last blog post.
What is the storage/persistence program for?
The storage/persistence program is entirely optional. Its main purpose is to allow a tenant to remember things across requests while still benefiting from per-request isolation, where you can’t store anything at all. It’s fairly okay to whip up two JS programs that talk to each other like shown above.
The main benefit of the persisted program (the storage VM) is that it can naturally have more privileges, like access to a database or being able to connect to other services. Yet it is also sandboxed, and doesn’t have full access to the host system. Meanwhile the main program (in the request VMs) can have heavily reduced privileges, and should perhaps not have any filesystem access at all and only very limited network access, if any.
Another use-case for the persistent program is database connection pooling. It can realistically only be done from the persisted program: Request VM ← → Storage VM (pooled) ← → Database.
Custom event loops
It turns out that Denos event loop still works even though we have a custom FFI-based outer event loop. If you look back at the persisted program, you’ll notice it‘s looping around and pausing on a custom FFI-function. I’ve never quite understood why it Just Works, but isn’t it nice?? We have custom system calls that let us access a different VM, and that VM is sitting in a custom event loop, and yet we can still await on a Redis client?
while (true) {
let buffer = ffi_function();
let val = await redis.get("value");
write_to(buffer, val);
}I’m guessing that the call to await something will do everything in Denos event loop first, and only then return. But it’s nice that it works.
Validation
So, does all of this really work?
$ ./wrk -c1 -t1 -L http://127.0.0.1:8000 -H "Connection: close"
Running 10s test @ http://127.0.0.1:8000
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 119.77us 94.24us 4.47ms 99.21%
Req/Sec 8.01k 373.70 8.30k 88.12%
Latency Distribution
50% 111.00us
75% 115.00us
90% 126.00us
99% 196.00us
80554 requests in 10.10s, 14.67MB read
Requests/sec: 7975.70
Transfer/sec: 1.45MBLooks OK to me. It should be accessing our persistent program which again accesses Redis on every single request. We also close the connection, as that is required in KVM server.
$ curl -D - http://127.0.0.1:8000
HTTP/1.1 200 OK
content-type: text/plain;charset=UTF-8
vary: Accept-Encoding
content-length: 34
date: Wed, 29 Oct 2025 12:12:25 GMTHello, 369859 from Persisted Deno!
Yep, it’s been counting! Not bad!
Benchmarks
I did some benchmarks against TinyKVM in a specialized server that avoids emulating I/O (and doesn’t need Connection: close). You might think that I should be benchmarking against state-of-the-art IPC memory-sharing solutions like iceoryx2, but that is not possible as we can’t share memory while the caller VM is executing at the same time. The caller VM could write garbage into the shared memory while the other side is using it to crash it (or worse). Since that is completely off the table, and we have to write the full request into a buffer anyway, we might as well use Redis. The difference between pipes and a local TCP socket is very small.
The first column is our unique RPC method. The point is not to say that accessing Redis is somehow bad here. It’s actually quite good. When the server is not busy, the two methods are largely the same. Rather, the point is to show that accessing storage also isn’t expensive. It is a VM-to-VM bidirectional communication method with shared memory and low latency that is safe to use because the caller is paused.
The third benchmark is the unfortunate reality without storage access, as you can’t reuse connections in an ephemeral VM. Storage is non-ephemeral and that enables us to improve Redis performance by being in the second column. Normally a fresh connection has to be opened every request (as it gets fully reset). So we are avoiding around 30% of the overhead when we want to access Redis.
Finally, we will make the system busy using schbench:
Since the RPC method avoids the scheduler, it has predictable and low p99 latency even when the server is busy. We can see that just reading and writing from a socket (this was connection reuse) incurs unbounded latency when things are busy around us.
So, I hope these benchmarks explain why we do the weird things that we do in per-request isolation land. We’re dealing with adversarial tenants and have to architect things with certain things in mind in order to safely access other services. To sum it up:
- Every request is ephemeral and gets completely wiped after conclusion
- We cannot reuse open connections in ephemeral VMs
- We compensate with a persistent storage VM which also is just a tenant program
- We have easy access to the storage VM through bidirectional communication and shared memory
- It’s faster to access Redis through a connection pool in our storage VM program than opening a new connection on every request
I think there are enough sandboxes out there today that sandbox single programs or full systems very well. I’ve always wanted to work on obscure near-idiotic things, and I want to avoid building things that already exist. Currently that has led me down the garden path of per-request isolation.
Thanks for reading!
-gonzo