Faster filesystem access with Directfs
gvisor.devAccessing local file systems from a container? What heresy is this? Containers must all be stateless webscale single-"process" microservices with no need of local file systems and other obsolescent concepts.
Next thing you know someone will run as many as two whole "processes" in a container!
Having dispensed with that bit of bitter sarcasm; solving their local filesystem performance/security problems is great and all, but what I'd like to see for containers is to utilize an already invented wheel of remote block devices; ah la iSCSI and friends. I dream of getting there with Cloud Hypervisor or some such where every container has a kernel that can network transparently mount whatever it has the credentials to mount from whatever 'worker' node it happens to be running on.
In k8s that already exists via CSI[0] but kubelet is handling the setup/teardown signaling and it requires 3rd party provisioner daemon so higher level than container runtime (runsc in this case).
Yes. I know. K8s has delivered the moral equivalent of what we've had built-in to our OS kernels[1] since before some of the people reading this were born, and they've only had to add two layers of complexity, fragility and inscrutability on top of k8s itself, one of which is a third party dependency.
This is my excited face. :|
[1] 2005: https://lwn.net/Articles/131747/
No k8s has not delivered that. It's built an orchestration layer on top of iSCSI, NVMeOF or whatever "remote disk" tech the kernel has implemented and abstracted that from devs which was the whole point of k8s.
> abstracted that from devs which was the whole point of k8s
That may be the point, but the actual impact is "devs" became "devops" and now spend some multiple of their time actually developing software puzzling over operations abstractions.
This would mean that every container has its own buffer cache, you can no longer have intentional shared state (K8s secrets, shared volumes, etc.), and must construct block overlays instead of cheap file overlays. You’re definitely losing some of the advantages a container brings.
There are other advantages — low fixed resource costs, global memory management and scheduling, no resource stranding, etc. — but the core intent of gVisor is to capture as many valuable semantics as possible (including the file system semantics) while adding a sufficiently hard security boundary.
I’m not saying moving the file system up into the sandbox is bad (which is basically what a block device gives you), just that there are complex trade-offs. The gVisor root file system overlay is essentially that (the block device is a single sparse memfd, with metadata kept in memory) but applied only to the parts of the file system that are modified.
A container, being basically a chroot, consumes a rather small amount of resources, mostly as space in namespace and ipfilter tables.
If your containers use many of the same base layers (e.g. the same Node or Python image), the code pages will be shared, as they would be shared with plain OS processes.
Running several processes in a container is the norm. First, you run with --init anyway, so there is a `tini` parent process inside. Then, Node workers and Java threads are pretty common.
Running several pieces of unrelated software in a container is less common, that's true.
Containers are a way to isolate processes better, and to package dependencies. You could otherwise be doing that with tools like selinux and dpkg, and by setting LD_nnn env variables. Containers just make it much easier.
> Running several processes in a container is the norm.
I'm highly aware. The reason the word "process" is quoted in my highly down-voteable comment is the misuse of the term "process" by Docker et al. to mean "application." Google the "one process per container" mantra to see what I mean. Somehow the Docker crowd were oblivious to the 60+ year old concept of and terminology related to OS processes when they promulgated their guidance on how containers should be used.
I try not to indulge too many hang-ups in life, but that particular bit of damage is insufferable.
I quite like containers to limit/reserve the ram/cpu use for certain processes. For example imagine a tiny service used by Few concurrent users that needs a SQL db, a app server, and a reverse proxy (For ssl/caching) in front. I'm quite happy to put stuff like this on a tiny VM with 1vCPU and 1GB RAM. Mnthly cost ~$5 for compute. I typically reserve/limit 64MB/128MB for nginx, 384mb/512mb for mariadb, and 256mb/384mb for the app server (PHP etc).Also I have cpu share reservation/limits too. Of course it requires tuning the configs, but it runs great(verified with load testing and actual use). If you put the same software on the same host with no reservations/limits there are situations where latencies grow a lot or the whole thing freezes because one component consumes too much resources. If anyone knows any non-container lie overhead ways to partition a single vcpu and a gig of ram like this I'd be interested to hear about it.
Systemd has a mechanism[0] for configuring those limits.
I believe you can limit a unit to 1 vCPU and 256MB of memory by using something like the following:
[Service]
CPUQuota=100% # 100% of a core
MemoryLimit=256MB
Red Hat has some documentation[1] as well if the systemd stuff is too oblique.
[0]: https://www.freedesktop.org/software/systemd/man/systemd.res...
[1]: https://access.redhat.com/documentation/en-us/red_hat_enterp...
> If anyone knows any non-container lie overhead ways to partition a single vcpu and a gig of ram like this I'd be interested to hear about it.
You can use cgroups[1] to do this, because that's what your container runtime is doing. People don't know this because they think these features have something to do with their container runtime and that's what they use, so no one discovers it.
Plus, the user facing tools for cgroups are slightly hideous. And that won't ever get fixed for the reasons previously stated. Sigh.
Also, I'm sure a lot of people would appreciate learning about your tuning techniques, containers or otherwise. Consider writing it up.
[1] circa 2007...
You can set up same limits via systemd units as they are using same interfaces as containers. They can be set up via systemd overrides to existing services so they are pretty straightforward
Just be beware of same problems like swap trap (limiting only memory and not memory and swap will just make apps hitting the memory limit start to swap like hell)
These designs always seem so complex... And one overlooked feature of any API could totally break the sandbox.
Whereas a simple 'we run everything in a VM' seems much simpler and less fragile.
'We run this process in a VM-like mode where Linux syscalls aren't allowed but instead we define a new syscall-like interface which goes to privileged host code' seems like a good compromise. But in this case, that host code should have special abilities to mmap files into the address space of the 'VM' to make IO fast and efficient.
One way to do this would be to use undefined instruction traps to enter a debugger, which could then implement a syscall-like API. That would make it portable to any OS, yet ultra fast.
This article is not very good at explaining what is it they are actually describing. Is directfs just a way to access hosts local fs? If so than my understanding of it is that they used to use rpc to access local fs before (horrible overhead) to sandbox it. Now they've just replaced a part of the operating system filesystem API that resolves paths to file descriptors with their tool so once a file descriptor is obtained the container can talk directly to the fs.
To me this resolves a very narrow use case where you have to run untrusted containers on trusted hosts. This is a very narrow use case. I imagine main target users for this are people that want to offer a service like fargate and run multiple customers on a single host. Why would they want to do that instead of separating customers with VMs? My suspicion is this has something to do with the increasing availability of very energy efficient arm servers that have hundreds of cores per socket. My impression is traditional virtualisation on arm is rarely used (I'm not sure why as kvm supports it, arm since armv8.1 has hw support for it). So "containers to the rescue".
Personally I'd much rather extra security to enable untrusted containers access to the hosts fs is implemented in the container runtime, not as a separate component. Or if the "security issues" it addresses perhaps even in the hosts operating system?
> Personally I'd much rather extra security to enable untrusted containers access to the hosts fs is implemented in the container runtime, not as a separate component. Or if the "security issues" it addresses perhaps even in the hosts operating system?
Isn’t that exactly what the original gofer/RPC solution is? The gvisor container runtime operates in userland to ensure that compromises in the runtime don’t result in an immediate compromise of the system kernel.
But running in userland and intercepting syscalls that do IO always has significant performance implications, because all your IO now needs multiple copy operations to get into the reading process address space, because userland process generally can’t directly interact with each other address space (to ensure process isolation), without asking the kernel to come in to do all the heavy lifting.
So if you want fast local IO, you have find a way to allowing the untrusted processes in the container to make direct syscalls, so that you can avoid all the additional high latency hops in userland, and let the kernel directly copy data into the target processes address space.
To magically allow the container runtime to provide direct host fs access itself, with native level performance, that would require the runtime to be operating as part of the kernel. Which is exactly how normal containers work, comes with a whole load of security risks, and is ultimately the reason gvisor exists.
I still don’t know why Google has gvisor and AWS has firecracker. Isn’t the firecracker approach strictly better than Google’s approach?
Firecracker is hardware-based virtualization. gVisor is not virtualization at all but more like advanced sandboxing - it intercepts syscalls and proxies them on processeses behalf. That means gVisor is slower on i/o (which this new feature is trying to solve) but it also means it’s easier to implement and operate and you can run it in more environments (for examples in VMs where nested virtualization is not supported).
What are the reasons these days to not enable nested virtualization? I know AWS doesn’t.
Afaik their hardware just didnt support it, not sure why it’s still not supported at this day and age.
Performance used to be a problem with nested virt but afaik both hw and software have caught up
If you want to join us in the peanut gallery, AWS originally "adapted" Google's crosvm for firecracker.
gVisor, if not using hw-backed virtualization, has absolutely horrendous performance because of, amongst other things, ptrace, which is one reason why this blogpost exists.
Note that ptrace is only one platform and it’s no longer even the default. It’s been replaced by systrap. When running on bare metal, the KVM platform provides the best performance: https://gvisor.dev/docs/architecture_guide/platforms/
Outside of the peanut gallery we just roll our own VMM, VMX and friends is well established at this point why settle for a hacky impl.?
This sort of feels like seeing someone riding a bike and saying: why don’t they just get a car? The simple fact is that containers and VMs are quite different.
I’m responding to what I believe is the intent of the comment, but I will also point out that on a literal level it doesn’t make sense. Whether something uses VMX and friends or not is a red herring, as gVisor also “rolls it own” VMM [1] and certainly makes use of VMX and friends.
[1] https://github.com/google/gvisor/tree/master/pkg/sentry/plat...
Apologies my reply was toward Firecracker, I appreciate gVisor is a sandbox solution/KVM shim rather than a true VMM.
Ah, now it makes more sense to me. Thank you for the clarification.
Firecracker may be better but it's irrelevant if I cannot use it in my environment.
In particular firecracker runs on bare metal or VMs that support nested virtualization, which unfortunately is not widely available in the clouds (and bare metal is expensive)
Firescracker is good and all but if one wants to use it, one has to change its ecosystem, it’s communication with other servers, why change your entire ecosystem for one tool or just build a tool to fit your ecosystem, and really like the concept of firecracker-containerd but still need some modifications and also I wouldn’t expect Google to put their entire Cloud Run, App engine under the hands of aws (even tho it’s FOSS)
Firecracker does not work with long running process. It's only good for function as a service / serverless stuff.
What is your definition of long running processes?
AWS Fargate (containers as a service for ECS/EKS) uses Firecracker under the hood, and you can easily have the container up for weeks, and probably even for months.
Similarly, Fly.io also uses Firecracker, and again, you can have weeks/months long uptime on containers.
This is a step back.
The reason to have this in a separate process is so it can be audited "to death" because the code base is small.
gvisor itself is so big that doing an exhaustive audit is out of the question. Google has mostly switched to fuzzing because the code bases have all become too bloated to audit them properly.
The reason you have gvisor is to contain something you consider dangerous. If that contained code managed to break out and take over gvisor, it is still contained in the kernel level namespaces and still cannot open files unless the broker process agrees. That process better be as small as possible then, so we can trust it to not be compromisable from gvisor.
EDIT: Hmm looks like they aren't removing the broker process, just "reducing round-trips". Never mind then. That reduces the security cost to you not being able to take write access away at run time to a file that was already opened for writing.
The reason you can focus auditing on the second process is because you have a security architecture that enables that. Of course the security mechanisms you’re relying on there need to be exercised and occasionally fall apart too (meltdown, MDS, etc.).
Process isolation is not the only tool that you have to build a secure architecture. In this case, capabilities are still being limited by available FDs in the first process (as well as seccomp and the noting namespacing and file system controls), and access to FDs is still mediated by the second process. There is no such thing as “being able to take access away … to a file that was already opened” as this is simple not part of the threat model or security model being provided. You still need to be diligent about these security mechanisms as well.
The idea that Google has given up and just does fuzzing is nonsense. Fuzzing is a great tool, and has become more common and standardized — that’s all. It is being added to the full suite of tools.
As I understand it, the new model is that the process gets an opened fd passed by the broker and can then read and write to it as fd permissions allow.
The old model howevwr was that read and write were translated to rpc calls to the broker. In that model you can take write access away even after you have given it to a process, because you have not actually given it. All writes still go through the broker process.
> The old model howevwr was that read and write were translated to rpc calls to the broker.
In the old model, reads/writes were not translated to RPCs. Only for regular files, the broker was donating FDs to the sentry (userspace kernel) and the sentry was allowed to perform read(2)/write(2) directly. This was done as a performance optimization long back.
What is different with directfs is that now the broker additionally donates FDs for other types of files as well (directories, sockets, etc) and the sandbox is allowed to operate on those FDs with more syscalls like mkdirat, symlinkat, etc. This drastically increases the independence of the sandbox is performing filesystem operations, so it does not need to invoke the broker via RPCs.
As described, the sentry is still constrained to operating on only the container filesystem via namespaces and other Linux security primitives.
Am new to these kernel space but isn’t writes operation more security at risk than Reads if it is why not break gofer into 2 categories one writes, one reads embed the one with reads with sentry user space, this may not show any significant performance in real world use but it gets both benefits
> writes operation more security at risk than reads
I think, in the context of security, this is like asking if it's worse to die by a car or die by a bus.
Lol at least one is recoverable
Security exists because of the meaning of the bits. If those bits represent credentials to your bank account, then "recoverable" hits different.
When you think of security you gotta think of Confidentiality, Integrity and Availability.
If you make reads less secure writes, then you'd be weakening the Confidentiality aspect.
One would only need to read your password via some unsecured hole, once.
The rest of the identity theft and pillaging your accounts would require no security weaknesses, just things working correctly in presence of legitimate credentials.
The risk here is that there's a bug in kernel that can enable dos / local code execution by the caller. Also like others pointed out - reads can be equally harmful if you read ssh private keys and whatnot.
What is directfs? The linked webpage doesn't say
The gVisor sandbox doesn't provide direct access to the local file system of the host machine. It routes file requests over RPC to the outside Gofer server running on the host machine. The Gofer server reads the files on the host machine and ships the data back to the sandbox over RPC. This setup is understandably slow.
Linux allows one process to send an opened file descriptor to another process over a domain socket with the SCM_RIGHTS message [1]. The DirectFS setup is basically letting the Gofer process to open a file on the host machine and ships the file descriptor to the sandbox process. The sandbox can then read and write directly on the local file system using the file descriptor.
How the heck can this be securely isolated? Well, via the magic of the pivot_root and umount Linux commands. First, Gofer only sends file descriptors of the files permitted to be accessed by the sandbox, like the files under /sandbox/foobar/. Second, the Gofer process does a pivot_root to change its own file system root "/" to "/sandbox/foobar/." It then does an umount on its old "/" to make it completely unaccessible to any opened file descriptors. This prevents someone using the opened file descriptor to change directory to ../.., ../../etc/passwd or to somewhere in the old root's directories.
I believe this is how it works, based on the reading of the blog post.
I found this [1]
"We recently landed support for directfs feature in runsc. This is a filesystem optimization feature. It enables the sandbox to access the container filesystem directly (without having to go through the gofer). This should improve performance for filesystem heavy workloads.
You can enable this feature by adding `--directfs` flag to the runtime configuration. The runtime configuration is in `/etc/docker/daemon.json` if you are using Docker. This feature is also supported properly on k8s.
We are looking for early adopters of this feature. You can file bugs or send feedback using this link. We look forward to hearing from you!
NOTE: This is completely orthogonal to the "Root Filesystem Overlay Feature" introduced earlier. You can stack these optimizations together for max performance."
[1] https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE/m/pqI...
I think it's a gVisor-specific concept. The page says:
> Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner.
So, it's likely this is not a filesystem, but just an implementation detail.
Yes, it's a gVisor feature. They basically utilize SCM_RIGHTS[0] Linux api to open files from the gofer process outside of sandbox and then pass opened fds into the sandbox.
Not to be confused with DirectStorage, which is a DirectX API that lets the video card load textures from NVME SSD local storage more efficiently.
I was expecting something about GPUs as well.
IMO it doesn’t make much sense to call things that run on the CPU “direct.” Direct access to resources is the assumption if you are running on the CPU, right?
"Direct" here is more analogous to the Direct as in DirectX and Direct3D.
directfs has nothing to do with DirectX.
I find it deeply ironic this needs to be said here.
As the other commenter wrote, I was replying directly to the parent and was referring to DirectStorage. I use DirectX apis daily haha
DirectStorage does, thought.
Ah... Okay, I think I see how the comment should have been read now...?
I will blame whoever named directfs for using a confounding name one way or the other. :V
Author of the blog here.
Point taken. In retrospect, directfs is not a good name as it gives the impression of a new filesystem implemention. I should have named this more along the lines of "direct access mode" as some of you have pointed out.
Thanks for the feedback.
I think the comment which was in response to mine is pretty ambiguous, it isn’t obvious to me what they mean by “here” (in the article, in the comment I responded to, or in my comment?).
When will gVisor be able to run processes in a Secure Enclave?
I thought gVisor was DOA. I guess this post confirms it.