Building a Linux Container Runtime from Scratch

217 points by curmudgeon22 9 months ago · 69 comments

Reader

pss314 9 months ago

I loved this hands-on presentation Containers From Scratch by Liz Rice from few years ago https://www.youtube.com/watch?v=8fi7uSYlOdc.

Today, Linux containers in (less than) 100 lines of shell by Michael Kerrisk was published https://www.youtube.com/watch?v=4RUiVAlJE2w.

seungwoolee518 9 months ago

Michael Kerrisk wrote a series of article about Linux namespace on lwn.net [0]
[0]: https://lwn.net/Articles/531114/#series_index
Brian_K_White 9 months ago

That bash/busybox demo is awesome. The code is at: https://man7.org/tlpi/code/ (/tlpi-dist/consh/ in the tar)
I still used lxc-utils in my rc script which now seems like positively cheating and may as well use docker.
Brian_K_White 9 months ago
On my birthday while attending Arisia January 2010 I wrote a single rc script with about 30 non-boilerplate lines of bash (the 3 functions) that does:
```
  * start all enabled containers on boot
  * stop all running containers at shutdown (ie gracefully wait for them all to shut themselves down before letting the host proceed to shut itself down)
  * start/stop/status any specified container on command
  * list all containers (known/configured, running or not)
  * every container has a gnu screen console
  * simple config file per container to define network & root dir etc.
```
(these are the latest versions of the wiki page and the referenced rclxc package, but I created the wiki page and the script on Jan 18 2010, despite the wiki history. The weird link for the rpm is because home:aljex no longer exists on the opensuse build service)
https://en.opensuse.org/SDB:LXC
https://anna.lysator.liu.se/pub/opensuse/repositories/home%3...
Whopping 3 files in the package, and one is just a symlink, and the other is just a single rmdir command. No daemon, the script only runs to do something. Not even systemd, just plain old sysv init.
I never developed it beyond essentially proof of concept because my companies owner listened to vmware salespeople, but I did use it in quasi-production for a year or two. (some developer vms, a few internal services, 20 or so customers)
But to me it did prove the concept and I would have liked to just work on that instead of using vmware or anything else. I completely gag when I look at kubernetes or even just podman when I had this so long ago and got so much function out of so little code and complication.
I mean it would obviously get larger and more complicated as it grew to handle more cases and supply more features. I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code. I feel like once you cross that point you have wandered off the track and are now doing bad engineering in some way and need to go back and figure out where you started driving in your sleep and get back on track solving the problem of getting the necessary job done in some sensible way.
- kubafu 9 months ago
  
  > I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code.
  And that should be the right approach 90% of the time. Thanks for your comment!

Joker_vD 9 months ago

> Importantly, we designed Styrolite with full awareness that Linux namespaces were never intended as hard security boundaries—a fact that explains why container escape vulnerabilities continue to emerge. Our approach acknowledges these limitations while providing a more robust foundation.

So what do you do, exactly?

klysm 9 months ago

Say “it’s probably fine” and hope that the people building the foundational systems are protecting us
- Joker_vD 9 months ago
  
  No, I mean, what do the Edera developers do differently, in order to provide more robust foundation with this new container runtime called Styrolite? They still use Linux namespaces, as far as I can tell from TFA.
  - denhamparry 9 months ago
    
    Edera developer here, we use Styrolite to run containers with Edera Protect. Edera Protect creates Zones to isolate processes from other Zones so that if someone were to break out of a container, they'd only see the zone processes. Not the host operating system or the hardware on the machine. The key difference here between us and other isolation implementations is that there is no performance degradation, you don't have to rebuild your container images, and that we don't require specific hardware (e.g. you can run Edera Protect on bare metal or on public cloud instances and everything else in-between).
    
    xmodem 9 months ago
    
    What underlying primitives are you relying on to provide isolation, if not linux namespaces?
    How does your approach compare to Google's gVisor?
    
    asmor 9 months ago
    
    It's Xen, and they even explain why it's not KVM here: https://github.com/edera-dev/krata/blob/main/FAQ.md
    
    sys_call 9 months ago
    
    gVisor emulates a kernel in userspace, providing some isolation but still relying on a shared host kernel. The recent Nvidia GPU container toolkit vulnerability was able to privilege escalate and container escape to the host because of a shared inode.
    Styrolite runs containers in a fully isolated virtual machine guest with its own, non-shared kernel, isolated from the host kernel. Styrolite doesn't run a userspace kernel that traps syscalls; it runs a type 1 hypervisor for better performance and security. You can read more in our whitepaper: http://arxiv.org/abs/2501.04580
    
    xmodem 9 months ago
    
    Thanks for the explanation. So you are using virtualisation-based techniques. I had incorrectly inferred from other comments that you were not.
    I skimmed the paper and it suggests your hypervisor can work without CPU-based virtualisation support - that's pretty neat.
    Many cloud environments do not have support for nested virtualisation extensions available (and also it tends to suck, so you shouldn't use it for production even if it is available). So there aren't many good options for running containers from different security domains on the same cloud instance. gVisor has been my go-to for that up until now. I will be sure to give this a shot!
    
    0x1ceb00da 9 months ago
    
    So it's a lightweight way of running docker images inside a virtual machine?
    
    sys_call 9 months ago
    
    Yes, precisely. This also provides container operators with the benefits of a hypervisor, like memory ballooning, and dynamically allocating CPU and memory to workloads, improving resource utilization and the current node overprovisioning patterns.
    
    klysm 9 months ago
    
    So it’s a VM?
    
    znpy 9 months ago
    
    > Edera Protect creates Zones to isolate processes from other Zones
    What do you mean by "zone" exactly?
    
    sys_call 9 months ago
    
    A zone is jargon for a virtual machine guest environment (an homage to Solaris Zones). Styrolite and Edera runs containers inside virtual machine guests for improved isolation and resource management.
    
    znpy 9 months ago
    
    > an homage to Solaris Zones
    i asked specifically because the word "zones" reminded me of solaris zones :)
    > Styrolite and Edera runs containers inside virtual machine guests for improved isolation and resource management.
    do your have your own vmm or is it firecracker with make up and a wig?
    
    klysm 9 months ago
    
    How exactly is this an improvement over VMs?
    
    sys_call 9 months ago
    
    We run unmodified containers in a VM guest environment, so you get the developer ergonomics of containers with the security and hardware controls of a VMM.
flkenosad 9 months ago

Anyone know if it's possible to update the Linux kernel so that namespaces are hard security boundaries? I wonder what that would entail.
- eyberg 9 months ago
  
  When we speak of 'hard security boundaries' most people, in this space, are comparing to existing hardware backed isolation such as virtual machines. There are many container escapes each year because the chunk of api that they are required to cover is so large but more importantly it doesn't have isolation at the cpu level (eg: intel vt-x such as VMREAD, VMWRITE, VMLAUNCH, VMXOFF, VMXON).
  This is what the entire public cloud is built on. You don't really read articles that often where someone is talking about breaking vm isolation on AWS and spying on the other tenants on the server.
  - vaylian 9 months ago
    
    > There are many container escapes each year because the chunk of api that they are required to cover is so large
    What API? The kernel syscall API?
    If we assume for a moment, that there are no bugs in the Linux namespace implementation, would containers be as safe as virtual machines?
    
    eyberg 9 months ago
    
    No. As I'm responding to this Qualys just announced three new bypasses as of today: https://seclists.org/oss-sec/2025/q1/253 .
    
    vaylian 9 months ago
    
    Sorry, can you elaborate? Your answer is not really clear. Why is it not possible for Linux namespaces to be secure?
  - flaminHotSpeedo 9 months ago
    
    > This is what the entire public cloud is built on.
    Well... The entire public cloud except Azure. They've been caught multiple times for vulnerabilities stemming from the lack of hardware backed isolation between tenants.
    
    richardwhiuk 9 months ago
    
    Azure has the same level of isolation for VMs at a hardware level as AWS.
    
    flaminHotSpeedo 9 months ago
    
    How Azure isolates VM's is completely unrelated, because containers are not VM's. And if you meant to assert that Azure uses hardware assisted isolation between tenants in general, that was not the case for azurescape [1] or chaosDB [2].
    [1] https://unit42.paloaltonetworks.com/azure-container-instance...
    [2] https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v...
    
    richardwhiuk 9 months ago
    
    It is the case for VMs that customers create.
    It hasn't always been the case for manged services, but I don't think that's true for AWS either.
    
    flaminHotSpeedo 9 months ago
    
    Unmanaged VM's created directly by customers still aren't relevant to this discussion. The whole point here is that everyone else uses some form of hardware assisted isolation between tenants, even in managed services that vend containers or other higher order compute primitives (i.e. Lambda, Cloud Functions, and hosted notebooks/shells).
    Between first and second hand experience I can confidently say that, at a bare minimum, the majority of managed services at AWS, GCP, and even OCI use VM's to isolate tenant workloads. Not sure about OCI, but at least in GCP and AWS, security teams that review your service will assume that customers will break out of containers no matter how the container capabilities/permissions/configs are locked down.
- GardenLetter27 9 months ago
  
  A lot of use cases don't want that though. It's nice having lightweight network namespaces for example, just to separate the network stack for tunneling but still have X and Wayland working fine with the applications running there.
- fulafel 9 months ago
  
  Have a look at gVisor for one approach.
z3t4 9 months ago

Once you have set up the namespaces you drop all capabilities so if the program gets hacked while it's running it can do very little.
- denhamparry 9 months ago
  
  Edera developer here. I agree! But there are instances we need to run with additional capabilities, and we’re also dependent on people knowing how to do the right thing. We’re trying to improve this by setting this by default, but also improving the overall performance and efficiency of running containers
- znpy 9 months ago
  
  honest question: how is this any better than running non-root containers?
  They can do very little anyway, that way.
  - sys_call 9 months ago
    
    Non-root containers still operate under a shared kernel. Non-root containers that run under a vulnerable kernel can lead to privilege escalation and container escapes.
    Styrolite is a container runtime engine that runs containers in a virtual machine guest environment with no shared kernel state. It uses a type 1 hypervisor to fully isolate a running container from the node and other containers. It's similar to Firecracker or Kata containers, but doesn't require bare metal instances (runs on standard EC2, etc) and utilizes paravirtualization.

seungwoolee518 9 months ago

When I was digging into Container (a.k.a it uses linux namespace capabilities) lwn.net's series of article helps me a lot.

https://lwn.net/Articles/531114/#series_index

shortrounddev2 9 months ago

I've seen many examples of people creating containers for Linux; I wish it were comparably easier to create containers for Windows. The fundamental software exists on Windows (AppContainers are how UWP apps work) but the documentation around AppContainers is very sparse/opaque because Microsoft doesn't want you to use AppContainers to make a general purpose sandbox environment like Snap or Flatpak; they want you to write UWP apps. It would be immensely helpful if you could run any arbitrary win32 or higher application in a sandboxed AppContainer where the NT System calls only had access to, say, the application's local folder and its %APPDATA% folder.

Alas, I think that Microsoft has simply given up on Native application support on Windows. Currently the only good way to write native apps for windows is still Win32/MFC and Winforms.

In fact, I think that secretly even Microsoft knows that everyone hates their UI frameworks/runtimes (and the fact that Microsoft deprecates them 2 years into their lifespan) because Microsoft STILL provides modern .Net 8/9 bindings for Winforms in 2025. If only they would just replace the GDI renderer with Direct2D, it would be literally perfect

pjmlp 9 months ago

Windows containers exist, their are based on the jobs, and Microsof took the approach to use the same APIs docker world expects to have as means to integrate with the DevOps container world expectations.
https://learn.microsoft.com/en-us/virtualization/windowscont...
You missed GDI+, Direct2D API is a COM mess that we only put up with because DirectX, and DirectX team doesn't like .NET, thus nothing like XNA or Managed DirectX will ever happen again.
WPF also exists, and since Build 2025 has regained parity with WinUI in official Windows GUI frameworks, that aren't in maintenance mode, aka Forms and MFC.
However, WinUI 3.0 with WinAppSDK has been a mess of project since Project Reunion was announced back in 2021, after almost four years it is still a shadow of UWP tooling, this is where I agree with you, it was so badly managed that nowadays only the Windows development team really cares about it, and most likely because their job depends on having to use WinUI.
But if you so wish to go through the pains of WinUI, there is Win2D.
- shortrounddev2 9 months ago
  
  While windows containers exist, the documentation surrounding them at the API level is sparse. Anything from Azure just tells you to use docker.
  As far as I can tell GDI+ is still software rendered? DirectX Com objects aren't difficult to work with at all, ive never understood why people hate them so much. The point of using direct2d would be to provide hardware rendering for winforms.
  Wpf is OK compared to winui 3 but it still suffers from xaml.
  - pjmlp 9 months ago
    
    Because the API was designed to be compatible with Docker tooling.
    GDI and GDI+ are hardware accelerated for years now,
    https://learn.microsoft.com/en-us/windows-hardware/drivers/d...
    Maybe because COM tooling sucks, in C++ land, Microsoft re-invents the approach to use COM every couple of years, and it is too much C/C++ style instead of being a proper modern C++ approach to handle COM.
    While on .NET land, DirectX team couldn't care less, and leaves the community the work to make the interop work without issues.
    The XAML hate comes mostly from outside traditional Windows developer circles.
    
    shortrounddev2 9 months ago
    
    yes but the point is to not have to use docker to containerize an app; it would be nice to be able to containerize an app with a built in runtime or something that is just literally not docker. Microsoft could solve so many of its security issues with an equivalent to Snap.
    Again, I don't get what the COM hate is. In DirectX, it's basically just become a simple way to manage the life cycle of an object.
    And Xaml hate is the hill I'm willing to die on. UI should be defined in either a dom or a winforms-like API, but not a mix between the two. Xaml is just straight up one of the worst things Microsoft has created
    
    shortrounddev2 9 months ago
    
    Also the hardware acceleration in gdi and especially gdi+ is not totally complete. Text rendering in gdi+ is still handled in software and only some operations in gdi are hardware acclerated

m00dy 9 months ago

We are an algorithmic trading company [0], and our trading strategies are primarily built as pure Rust libraries. We've been searching for a way to sandbox the strategies we host, as not all of them are signed or open source for verification. Styrolite seems like a promising solution to address this issue, so we’re planning to give it a try.

[0]: https://cycletop.xyz

denhamparry 9 months ago

Edera developer here! Thank you for sharing and any feedback you have would be great! Edera Protect is written in Rust too, and our focus is also performance as well as isolation.

pzmarzly 9 months ago

Why not use any of the existing OCI Runtimes? They take well-defined[0] JSON description as input, and are pretty well-contained (single static binary). And because they are separate binaries, not libraries, you don't need to worry about things like thread safety or FD leaking.

[0] https://github.com/opencontainers/runtime-spec/blob/main/con...

zamalek 9 months ago

"I don't need the full capabilities of OCI." In my (now very much stagnating) Nix-like pet project[1] I merely want a hermetic build environment. Rolling my own container runtime was no more difficult than, what would likely be, a nightmare of emulating a complete OCI container for the simple purpose that I'm after.
Simple problems need simple solutions, and OCI is really complex. I was initially overjoyed by the prospect of deleting my code, but it looks like this project doesn't have rootless/shadowutils support yet (which is solely useful for not having to worry about su or caps during development).
[1]: https://github.com/porkg/porkg/tree/rs
r3trohack3r 9 months ago

I’m currently exploring this for an AI context because I haven’t found a better solution for letting K8S manage AI workloads that need direct GPU access on OSx
- denhamparry 9 months ago
  
  Edera developer here. Edera Protect is being developed to manage access to the GPU hardware on a Node with the containers running your workloads. We talk a lot about isolation between containers, but we're also focused on adding this isolation throughout the stack, from containers/processes down to hardware.
  - r3trohack3r 9 months ago
    
    Sounds compelling - I can’t see any mention of apple silicon on your site, any intention of supporting it?
- pm90 9 months ago
  
  You're running a kubernetes cluster with nodes that are running OSx?
- brcmthrowaway 9 months ago
  
  Why are you building AI anything
harha_ 9 months ago

The beginning of the article answers to your question.

infogulch 9 months ago

How does this compare to recently discussed Landrun?

https://news.ycombinator.com/item?id=43445662

cedws 9 months ago

Isn’t the gold standard of containerisation gVisor? Can’t get much more restrictive than proxying and filtering syscalls. As far as I remember it’s the default runtime on GKE.

denhamparry 9 months ago

Edera developer here. gVisor is restrictive, but its at a cost of performance. Personally, I'd say Edera Protect is one level deeper. We create Edera Protect Zones to provide isolation, so we create a Zone that is isolated from the OS and hardware of the machine running the container. So we don't proxy or filter syscalls, as the isolation is a layer deeper. We are also focused on ensuring that Edera Protect is as performant (if not better) as running a container today with containerd.
Finally, if you wanted to, you could run gVisor within Edera Protect, but we feel that Edera Protect would already provide the security benefits that gVisor offer.
- cedws 9 months ago
  
  Thanks, but what is a “Protect Zone” at a technical level? Why does it provider stronger isolation than syscall filtering?
- raesene9 9 months ago
  
  How would you say it compares to Firecracker?
raesene9 9 months ago

If you want better isolation than is provided by Linux namespaces et al, then yep something like gVisor or Firecracker (https://firecracker-microvm.github.io/) provide a likely better level of isolation.
sys_call 9 months ago

gVisor runs a userspace kernel that proxies syscalls to a shared host kernel. Running an "application kernel" in userspace impacts performance because it goes through two schedulers. Virtual machine isolation is more restrictive because it doesn't share any kernel state with other containers. We have a whitepaper that compares the performance of gVisor and Stylorite/Edera if you want to see the differences http://arxiv.org/abs/2501.04580

TechDebtDevin 9 months ago

Cookie consent card wont disappear. Brave mobile.

elboulangero 9 months ago

Same with Firefox on Android...
- shellwizard 9 months ago
  
  No problem here. FF Android + uBO hard-mode

asicsp 9 months ago

Settings

Building a Linux Container Runtime from Scratch

Keyboard Shortcuts