Process Isolation on NetBSD with chroot(2)

Why use a chroot(2) on NetBSD? The most basic reason, for me, is I just like working with NetBSD. It's comfortable. I may write a post going into more detail why I feel that way, but we'll leave this at, "I just think it's neat" for now.

There's also a question of how much isolation do you need? To me, the main advantages of containers, jails or zones are the existing tooling for deployment pipelines and having not-quite-vms to play with, rather than security. And those are primarily advantages on larger teams and systems. I don't need nor want any of that complexity for my own little webserver. Ansible and rsync are my deployment tools. The server software is written in a memory safe language, which should cut down on RCE exploits. And if someone does get in, I want them to find themselves in an annoying box of limited utility, but it doesn't need to be maximally isolated.

For a lot of small, self-hosted infrastructure, big complicated orchestration systems will be harder to work with.

Another reason is to just learn what chroot(2) is, as it's an evolutionary step in the development of process isolation tools. Illumos zones, FreeBSD jails and Linux containers are all technologies that are a take on "how do we push isolation further than chroot(2)?" Which means chroot(2) is a good starting point for building a mental model of the next generations tools.

What is process isolation?

The stability and security of Unix operating systems is built on process isolation. The zeroth level is the robust memory isolation that every process gets. No process can see directly into another process' memory space. This is the thing that allows us to have any kind of security. It also allows for stability, because one errant program cannot corrupt the data of everything else on the system (which was true for MacOS Classic and MS-DOS, and caused a lot of system crashes).

The next layer is the user metaphor in a Unix OS. If we run a process as unprivileged users, we can limit the process' access to the filesystem. Two processes running as two different, unprivileged users can prevent each other from accessing their files.

The isolation provided by separate users is tied to the filesystem because the filesystem is the basic, unifying user interface metaphor for a Unix OS. This is what "everything is a file" means. The OS kernel presents the filesystem to both organize and expose data stored on disk, but also to expose hardware interfaces. The simplest granularity for designating access to files in the filesystem is users, followed by groups of users. Because hardware is represented as files, the same access control mechanism works for device driver interfaces and data.

The isolation that this provides is a lot, and possibly as much as you need. The unprivileged user can be cut off from hardware access and limited to where on the filesystem that can write files, if anywhere at all. Hopefully, sensitive files will be unreadable.

However: this can be taken a step further. The chroot(2) system call and its related wrapper command chroot(8), can be used to limit the view a process has on the filesystem. chroot should be read as short for change root directory. This means that sensitive files and device files can be kept out of a chroot, which can limit attack vectors if a process running in that chroot is compromised.

If there are no device files, then there is no accessing hardware from inside the chroot. If password hashes and private keys are outside the chroot, they cannot be read. If there are no set UID binaries in the chroot, flaws in those cannot be used for privilege escalation. If there are no compilers or interpreters, those cannot be leveraged. Which means an attacker will find the characters "../" to be of no use.

This does have limits. Running ps ax in a chroot still shows the complete process table for the whole system. A process in a chroot uses the same network interfaces as the rest of the system. Only the filesystem view is limited.

My genchroot.sh Script

This is the core of building the chroot on the filesystem. The chroot is just a directory, so the script expects a directory as the first parameter. It then builds a not-quite minimal root directory structure inside that supports running binaries generated by the go compiler. There are a number of extras in there that I added for debugging while polishing the process. Those can be removed, but they're pretty harmless. I'll note them.

The script was written assuming it will be run as root. I feel like that's cleaner than scattering doas invocations everywhere. It also allows me to have ansible run it as root.

I'm going to break up the script with commentary, so let's get started. All scripts are available in copy-pastable blocks on this page.

#!/bin/ksh

Yes, I use pdksh. I primarily run NetBSD and OpenBSD depending on hardware support. It's guaranteed to be there. But really, it's because David Korn got his book signed by KoЯn. @prahou did warn me it would turn into a whole lifestyle, and here I am, not using bash anymore.

It's a good shell. Maybe try it.

CHROOT_BASE=$1

if [ -z "$CHROOT_BASE" ]
then
	echo "Missing chroot basedir!"
	exit 1
fi

if [ ! -d $CHROOT_BASE ]
then
	mkdir -p $CHROOT_BASE
fi

mkdir -p $CHROOT_BASE/bin
mkdir -p $CHROOT_BASE/lib
mkdir -p $CHROOT_BASE/run
mkdir -p $CHROOT_BASE/dev
mkdir -p $CHROOT_BASE/etc
mkdir -p $CHROOT_BASE/usr/libexec

cp /bin/ksh $CHROOT_BASE/bin/ksh

This is some straight forward boiler plate to setup the skeleton directory structure.

The one thing missing here is permissions on run/ and possibly adding a tmp/ directory. I ended up fixing the run/ directory ownership in my ansible scripts, which allows me to tailor it to use mode 0755 and be owned by the unprivileged user that the service runs as.

The other option that's worth considering is using the sticky bit via mode 1777 on the run/ directory. This is what you would use on a tmp/ directory. It means that users can only delete or rename files they own in the directory. Which allows the directory to have wide open (777) permissions and prevent users from messing with each others files, see: the sticky(7) manpage

This next for loop is trickier and more complicated than I'd like. I will be able to get rid of the i386 stuff with NetBSD 11, as the 32-bit compatibility libs are getting broken out into a separate set.

I used readelf -d path/to/binary to get the list of libraries to add to the chroot. This is also one of the reasons for doing this with programs written in go: its compiler links against very few dynamic libraries. This is also why I use /bin/ksh above. The almquist shell (/bin/sh) needs libedit and libterminfo (which then probably needs the terminfo db).

I didn't use ldd for digging up library dependencies, because on NetBSD 10.1, it doesn't like the ELF headers that go generates. ldd will recurse through the dependencies on a binary, so it will find all of them. You can compare readelf and ldd output to figure out if you've got multiple layers of dependencies. In which case, you might consider switching to a different isolation system—either tooling that builds more complete chroot environments or a more advanced isolation system.

for glob in libc.so libpthread.so libresolv.so
do
	REALLIB=$(find /lib /usr/lib -type f -name "${glob}*" \
		| grep -v i386 | head -1
	)
	SDIR=$(dirname $REALLIB)
	REALLIB=$(basename $REALLIB)
	cp $SDIR/$REALLIB $CHROOT_BASE/lib/$REALLIB

The first find command in the loop is using -type f to find the real library binary, and then copy it into the chroot. Again: definitely do NOT hardlink to the external libraries!

	find /lib /usr/lib -type l -name "${glob}*" | \
		grep -v i386 | grep -v $REALLIB | \
		sed 's%/.*/%%' | sort | uniq | \
	while read liblink
	do
		(cd $CHROOT_BASE/lib; ln -s $REALLIB $liblink)
	done
done

The second find command use -type l to get a list of the symlinks to a given library that act as aliases (and which an ELF binary might use to reference a library it is dynamically linked against). So we'll need to re-create those symlinks in the chroot.

There are two fun little flourishes here. The sort | uniq is turning the stream into a set rather than just a stream or list. Any one line can only appear once. The second flourish is to pipe into the stdin of a while loop. This is totally legal in all POSIX shells. It's also a functional (if potentially a bit slow) workaround for find ... | xargs ... pattern commands where there are files with spaces in them that get split by xargs and cause errors. Instead of using xargs, run the command on one file at a time with the while loop, and you can guard the variable with the filename with double-quotes. fork(2)/exec(2) overhead can build up if you are doing this on a lot of files, which is why it can be slow.

cp /libexec/ld.elf_so $CHROOT_BASE/libexec/ld.elf_so
(cd $CHROOT_BASE/usr/libexec/; ln -s ../../libexec/ld.elf_so)

We need the ELF loader. The system uses this code to exec(2) new processes. This has to be copied, and MUST NOT be hardlinked. It's possibly more dangerous to hardlink than libc. Everything uses it.

The question "what is the ELF loader?" is too much to get into here, but I would highly encourage readers to spend some time in that rabbit hole. It's a key piece of the OS and ties together how the compiler and the linker build executable binaries, how those binaries share code and how processes start up.

cp /usr/bin/env $CHROOT_BASE/bin/env
cp /usr/bin/id $CHROOT_BASE/bin/id

These are all optional. I added them in for debugging. Should be mostly harmless. Also safe to remove.

Previous versions of the script also included the rescue binary. NetBSD includes a statically linked rescue binary that's basically its version of busybox. It's main use is to be sure you have all the tools you might need to repair a very broken system in single user mode without libc present. That turns out to also be a very handy set of properties for utilities in a chroot. You only need one file copied in, and links setup so it can be called by different names.

In a way, all of them are always included, even if you only make some of the links. Which is why I removed it once I'd gotten everything figured out.

You might copy in some of the base utilities, but you'd be surprised what you can accomplish with the POSIX shell built-ins. One such hack is using echo * instead of ls. Read up on your shell, get creative. (But also note: these tricks are for deliberately limited environments; use your utilities in conventional ways in full environments so others can better understand.)

if [ ! -c $CHROOT_BASE/dev/null ]
then
	if [ -e $CHROOT_BASE/dev/null ]
	then 
		rm $CHROOT_BASE/dev/null
	fi
	/sbin/mknod $CHROOT_BASE/dev/null c 2 2
	chmod 0666 $CHROOT_BASE/dev/null
fi

We need one device file, the bit bucket. It's an important convention on Unix systems, and you'll miss it pretty quickly. It should be the only one you need, though.

cp /etc/resolv.conf $CHROOT_BASE/etc/resolv.conf

Finally, we need to configure what DNS server to use. The machines that I wrote this script for are all on static network configurations, so we can just copy the system resolv.conf file. There's no resolvconfconf funny business nor anything worse complicating the DNS configuration. resolvconfconf is meant to solve problems for dynamically configured networking. This crops up for desktop users and servers in cloud environments without static IPs. I still use a lot of static IPs, and I like that NetBSD stays out of my way here.

Running the Service, part 1, rc.d

Another thing that I like about BSDs is the init system. It's very minimal and just runs a series of shell scripts. Most of the modern niceties for BSD init scripts come from shell script libraries that ship with the OS and are used for the boiler plate in all the scripts. NetBSD's /etc/rc.subr library knows how to jump into a chroot to run a service for you, so this script is very short.

The rc.subr(8) and rc(8) man pages may be helpful references for the discussion below.

#!/bin/sh
#
# PROVIDE: caddy
# REQUIRE: DAEMON

. /etc/rc.subr

name="caddy"
rcvar="$name"
command="/caddy_start.sh"

name just sets up the name of the service, and determines the prefix used in variables below. rcvar is used to match against /etc/rc.conf to determine if the service should be active or skipped, see rc.conf(5)

The path for the command variable is relative to the root of the chroot, not the absolute path. It's what gets run to start the service in the chroot. I'm using a shell script shim, which is covered below.

procname="/bin/caddy"
pidfile="/bits/www/run/caddy.pid"
required_dirs="/bits/www/ /bits/www/data/caddy /bits/www/config/caddy"
required_files="/bits/www/caddyfile /bits/www/bin/caddy"
caddy_chroot="/bits/www"

caddy_chroot needs to be the same value that was passed to genchroot.sh and used as CHROOT_BASE.

The caddy configuration files and directories are stuff that I setup separately via ansible. I'm not going to go into caddy configuration here. All we really need to know is that it needs a data directory, config directory, the binary itself and a caddyfile, its configuration file.

We tell the rc script about these so it can do a quick sanity check and make sure they all exist before trying to start the service. This is optional.

caddy_user="www"
caddy_group="www"
caddy_groups="www"

This is where we tell the rc script what user and groups to execute command as inside the chroot. If you try to run a webserver or some other service that wants to open a port less than 1024, you'll run into trouble. I've got a post coming soon about recompiling the kernel for this server. Disabling the privileged port check was the deciding factor to go with a custom kernel.

If you are just going to open a high port (like 8080 or 8443), this will work fine. With a stock kernel, and opening 80 and 443, you'll need to run the webserver as root.

# 384MB in kilobytes
caddy_memlimit=393216

caddy_precmd()
{
	ulimit -m $caddy_memlimit
}

This is where we can get a bit more isolation with setrlimit(2). If you have a function called ${name}_precmd in your rc file, it gets run as root right before the service gets started. In this case, we use the shell built in ulimit to set a maximum memory usage. I've got a small VPS, so I do want a limit. From personal experience with caddy, 384MB should be plenty and not cause problems.

load_rc_config $name
run_rc_command "$1"

This last line is the one that actually runs the service and does all the magic.

The second to last line is left as an exercise for the reader

Running the Service, part 2, inside the chroot

Inside the chroot, I used a shim shell script to do some minimal environment setup and start caddy. I clear a couple variables with unset, and export some relevant variables for the chroot. This is the main reason for the shim.

The two XDG variables are expected by caddy and are functionally a part of its configuration.

#!/bin/sh

unset SU_FROM
unset LOGNAME
unset ENV

export LD_LIBRARY_PATH=/lib
export PATH=/bin
export USER=www
export HOME=/

export XDG_DATA_HOME=/data
export XDG_CONFIG_HOME=/config

(exec /bin/caddy run -c /caddyfile --pidfile /run/caddy.pid >/dev/null 2>&1) &

Finally, the last line runs caddy. I use a subshell to partially daemonize caddy, as it expects to run under a supervision suite. It's a bit dirty, but it works just fine.

This is how the webserver that delivered this page is started and isolated.

Published: 1771617000