Abusing Privileged and Unprivileged Linux Containers

101 points by gtank 10 years ago · 53 comments

Reader

rwmj 10 years ago

I think what Intel are doing with Clear Containers is really interesting. They are encapsulating containers inside VMs, avoiding the security problems of containers.

To do this efficiently they've had to make a bunch of changes on the VM side so the overhead is much smaller than an ordinary VM (of the order of 150ms and 20MB of RAM). I've also been looking at this and am hoping to give a talk about it at the KVM Forum in August (http://events.linuxfoundation.org/events/kvm-forum).

colemickens 10 years ago

Windows Containers with Hyper-V isolation are similar. https://msdn.microsoft.com/en-us/virtualization/windowsconta...
sillysaurus3 10 years ago

150ms for which operation?
- philips 10 years ago
  
  150ms to boot the VM that is running the container processes with the stripped down Linux Kernel, lkvm, and DAX. This is about what we observe in the rkt Clear Containers "stage1": https://coreos.com/blog/rkt-0.8-with-new-vm-support/
  One note: you can run multiple "container processes", like redis and a redis dashboard, inside of these Clear Container VMs. This means in the case of Kubernetes we will only incur the cost of the startup time and Kernel/init overhead once per pod instead of once per process.
  - sillysaurus3 10 years ago
    
    This is going to sound amusing at best, but would you clarify what it means to boot?
    To justify the question a bit: booting traditionally meant physically turning a system on. The boot time included BIOS initialization, a concept now blurred by the advent of virtualization.
    150ms is such an absurdly short amount of time that I'm left wondering what booting is in this context.
    
    wmf 10 years ago
    
    VMs usually do have BIOS (sometimes you can see it flicker on the screen) but like NeutronBoy said, the hypervisor just creates the virtual hardware devices in a pre-initialized state so the BIOS has to do almost no work and it completes in a fraction of a second. Clear Containers boots even faster by not using BIOS; the hypervisor directly loads the kernel and initrd into RAM. So in this case "booting" means starting the kernel, mounting the root filesystem (accelerated using DAX), running init, starting dockerd, etc.
    
    rwmj 10 years ago
    
    Actually Clear Containers does have a BIOS, because Linux requires one eg to read E820 data and to set up the virtual video. However the one Clear Containers uses (from kvmtool) is extremely minimal -- it's literally enough of a BIOS just to answer the int calls that modern Linux makes at boot and that's all. IIRC it's hundreds of lines of code only.
    
    mikewhy 10 years ago
    
    > 150ms is such an absurdly short amount of time that I'm left wondering what booting is in this context.
    Clear Linux was announced about a year ago, and it does boot absurdly quickly
    https://clearlinux.org
    https://lwn.net/Articles/644675/
    
    philips 10 years ago
    
    It does this thanks to a technology called DAX and the fact that systemd boots really fast.
    https://www.kernel.org/doc/Documentation/filesystems/dax.txt
    
    rwmj 10 years ago
    
    DAX is a small part of it, but Intel made many changes throughout the stack, mostly to the Linux kernel.
    
    NeutronBoy 10 years ago
    
    > This is going to sound amusing at best, but would you mind defining what it means to boot?
    I imagine that most of the time would be in mocking some/all of the hardware interfaces to present to the VM, and running your init processes (and all that entails for whatever OS you're running).
    
    e12e 10 years ago
    
    For perspective, I just timed (not very well) how long it takes windows to run the c-program "exit": "int main() { return 0; }", compiled with gcc 4.8.1, -O3 -std=c11 -Wall, stripped[1]. From a warm disc cache it takes ~3ms. From cold(er) it takes ~19s.
    Taking a 50x hit to run "exit" from a container doesn't sound bad, but it doesn't sound all that far fetched either.
    [1] time util from pstools, as installed by scoop.sh - similar to why gcc (not eg msvc - it's all in my path atm, no work needed :)
    
    e12e 10 years ago
    
    ~19ms, obviously, not 19 seconds.
geggam 10 years ago

Because running a process in a VM doesnt meet what requirement ?
- rwmj 10 years ago
  
  Your question is confusing, what does it mean?
  If you meant "running a process in a VM meets what requirement?", the requirement is security, which as the paper here proves is not available with simple containers running as processes on the host.

subway 10 years ago

Containerization in Linux is fugly. There is no core concept of containers in the kernel, you just have a set of loosely integrated namespaces abused by the likes of lx[cd] and docker.

gshulegaard 10 years ago

I don't share your opinion. The Kernel exposes a collection of primatives (including but not limited to: cgroups, namespaces, and copy-on-write storage[1]) which can be used to create isolated sandboxes. The kernel itself doesn't bind the primatives together because I believe Linus would consider that "User space"...and I would agree.
Instead this is left up to other tools like LXC. Also note, that higher level features such as network support are also left up to the higher level tool.
Docker and LXC have core differences in vision of what a container should be [2]. Also, Docker used to be based on LXC, but have since done their own library libcontainer which handles the interaction with the kernel primatives.
To me, Docker's philosophy and libcontainer implementation is...as you say, fugly, but LXC's approach and implementation is not.
I also don't think of the kernel exposing primatives and letting user space tools bind them together as inherently bad. I actually prefer it this way and think it leaves the kernel cleaner/leaner/better off.
[1] http://www.slideshare.net/jpetazzo/anatomy-of-a-container-na...
[2] https://www.flockport.com/lxc-vs-docker/
- tptacek 10 years ago
  
  The original design wasn't intended to provide that kind of isolation, and the primitives that are exposed are retrofit; every new containerization design needs an audit that captures the entire exposed functionality of the Linux kernel.
  You can just skim this paper to see the problems: non-namespaced identifiers leak in procfs, UID "slides" expose containers to each others resource limits, there are non-namespaced non-containerized kernel functions exposed to root inside of containers, and so on.
  - gshulegaard 10 years ago
    
    That's interesting...it was my impression that some of the kernel features were added specifically as a result of the kernel patches that were originally part of the OpenVZ project. Once the kernel adopted official primatives the original OpenVZ patches were deprecated. It was also at this time that LXC started with some of the same developers from the OpenVZ project.
    I could be wrong...but that path dependency seems to indicate that while they were implemented as more general kernel features...one of their motivating use cases was container isolation.
    Can anyone more informed clarify the history for me?
    
    tptacek 10 years ago
    
    I'm not evaluating the container features in isolation. Considered by themselves, they might be perfectly coherent. The problem is that every feature of the kernel with a namespace of any sort needs to be aware of those container features, and namespaces leak into each other unexpectedly, because most of them are very old and were implemented long before anyone considered containerization.
    
    duskwuff 10 years ago
    
    To the best of my knowledge, the container features in the vanilla kernel today (cgroups, as used by LXC, docker, etc) originated at Google, where they were used more for resource allocation than for containerization per se. The kernel patches developed by Virtuozzo/Parallels for OpenVZ were never upstreamed, and were considerably different in design from cgroups.
    
    cyphar 10 years ago
    
    They're talking about namespaces. Cgroups are not an isolation mechanism, and there have been significant rewrites of the core since Google worked on them. Most of the namespace work came from Odin (Parallels) as well as Virtuozzo and others.
zxcvcxz 10 years ago

Fugly? Compared to what alternatives? The offerings from Microsoft are even fuglier.
- jclulow 10 years ago
  
  In illumos, a descendant of (Open)Solaris, we have a first class container primitive called "zones". In SmartOS, the Joyent-backed distribution of illumos, we also have support for running an entire Linux userland (e.g. Ubuntu or CentOS) in this substrate.
  You can have the best of both worlds: a secure container substrate, designed from the ground up as a coherent whole like Jails; and the vast packaging ecosystem provided by Ubuntu.
- voidz 10 years ago
  
  FreeBSD jails, I would think.
  - subway 10 years ago
    
    See also: Solaris zones.
    
    hinkley 10 years ago
    
    Yeah when that giant Oracle boat finally turns we are all in for it.
    
    cyphar 10 years ago
    
    illumos Zones then.
  - Scramblejams 10 years ago
    
    Unless I've missed something (and I may have!), FreeBSD's jails have a very respectable security track record. Really, really want to make use of them.
    I can't give up Debian's package system, though, so I'm left hoping that kFreeBSD will amount to something someday and I use Xen or KVM in the meantime... :-(
    
    Freaky 10 years ago
    
    > I can't give up Debian's package system, though
    Why not? What would you miss from it?
    
    voltagex_ 10 years ago
    
    I run Debian Testing and FreeBSD 10. I haven't found too much from Debian that I can't get in FreeBSD 10. I could even run a Debian/kFreeBSD jail if I really wanted to.
    What really does my head in is that a default Debian install can pull down 2 megabytes a second from a server over SFTP, and a default FreeBSD 10 server can only do ~800 kilobytes per second (FreeBSD 9 was worse).
    
    Freaky 10 years ago
    
    > What really does my head in is that a default Debian install can pull down 2 megabytes a second from a server over SFTP, and a default FreeBSD 10 server can only do ~800 kilobytes per second
    Shouldn't be that much of a difference. You might try OpenSSH from ports, maybe the HPN patches will help if you're on a high latency connection.
  - zxcvcxz 10 years ago
    
    But do enterprise companies use FreeBSD jails regularly? AFAIK they're basically used as toys by developers.
    
    2trill2spill 10 years ago
    
    Jails are used on the Playstation 4 and with some 36 Million PS4s sold so far that's a huge use of jails in production. Here's a quote from an article talking about it,
    "We can prove the existence of FreeBSD jails being actively used in the PS4's kernel through the auditon system call being impossible to execute within a jailed environment"
    This quote is from: https://cturt.github.io/ps4.html
    
    educar 10 years ago
    
    It's obvious that your parent was talking about servers in production. Isn't this entire thread about that?

gtankOP 10 years ago

URL mistake. Direct link: https://www.nccgroup.trust/globalassets/our-research/us/whit...

mytummyhertz 10 years ago

author here. hi world!

ak217 10 years ago

Hi Jesse! We got an audit from you last year (I was the one who pushed to get you to specifically look at our PaaS containerization). As a result I spent a while wrangling with Ubuntu LXC unprivileged containers, and now I know more about cgmanager than I wanted to.
Glad to see you've added a lot of detail to your research. It's very necessary!
- mytummyhertz 10 years ago
  
  :D
vishvananda 10 years ago

Very thought-provoking whitepaper. As someone who has been working on securing containers for the past year or so, it gave me some additional avenues to pursue.
- mytummyhertz 10 years ago
  
  awesome! glad to hear :D
sbierwagen 10 years ago

.
- mytummyhertz 10 years ago
  
  that's a weird way to spell hilarious
  edit: (original comment this was in reply to said 'that's an unfortunate username')

zxcvcxz 10 years ago

>As such, it discloses the names and PIDs of all processes running on the system...

So I don't really see how this is considered a big vulnerability, unless the goal is security by obscurity, but then we could go even further and obfuscate the whole system.

>NET_RAW abuse

Hard to blame LXC/Docker for something that has to do with the configuration of the bridge, plus for some setups this is desired functionality.

>DoS

Some of these are interesting but I don't see how filling up the diskspace is a problem with containers and not operating systems in general, and I feel like a lot of these DoS attacks are all just basic OS limitations but I don't know enough to make an informed statement.

Sanddancer 10 years ago

Security by obscurity is relying purely on nondisclosure of information. Minimizing information leakage is sound practice. PIDs, names, etc, can give a lot of information as to the configuration of the app running in your container -- how often external processes are run, potentially vulnerable software that you may be using in utilities, such as an old version of imagemagick, etc. While there's no substitute for keeping your system up to date, frustrating an attacker's ability to get information on your system is also pretty standard practice.
Regarding NET_RAW, this is a case where you want reasonable defaults. Needing raw sockets is an exceptional condition for most container setups, and again, gives a greater threat exposure. Even ignoring the potential for things like ARP spoofing, filling up a MAC table on a lot of switches makes them fail over into being essentially rackmount hubs, which can allow for even greater amounts of service denial and information leakage.
Filling up disk space is an area that is problematic with Linux-based containers because in order to keep a process gone awry, or a malicious process from using up all disk space, you have to do things like set up fixed-sized loopback filesystems ahead of time, which impose performance and space constraints that makes your containers less flexible than containers under Solaris zones, for example. Under ZFS, you can directly configure a container to only be able to use x amount of space, without needing to set up loopback devices or other complexities. This allows you to set up limits, but at the same time, means that if a dataset needs it, you just need to run a single command to give it more space.
Yes, a lot of these issues can be easily mitigated, however, they're all symptoms of poor defaults. A good container system should help manage and mitigate these sorts of issues, so they only need to be thought of once, instead of by everyone implementing them.
- justincormack 10 years ago
  
  Raw sockets are there for ping. Hopefully we can remove this as distros switch to ICMP sockets finally.

ck2 10 years ago

Oh lovely, it contains PoC

Hope there was previous disclosure.

mytummyhertz 10 years ago

yup, everything here was disclosed previously

X86BSD 10 years ago

Linux does not have containers. It has namespaces and cgroups. Jails (FreeBSD) and zones (Illumos) are containers. Please, stop claiming containers exist on Linux.

cyphar 10 years ago

It's like Linux capabilities. The kernel community has a very odd view of history and generally refuses to learn from others' efforts.

Settings

Abusing Privileged and Unprivileged Linux Containers

Keyboard Shortcuts