A modern Proxmox Docker architecture with disposable VMs, VirtIO-FS, and ZFS

Published: 2026-06-07 , Revised: 2026-06-20

TL;DR The Linux kernel's security model is constantly evolving. In 2026, my Docker-in-LXC nesting became increasingly fragile and needed a replacement. Here I describe a state of the art architecture for Proxmox. The post outlines deploying lightweight VMs via cloud-init linked clones, isolating services in rootless Docker namespaces, and using VirtIO-FS with native VFS idmapped mounts for resource efficient ZFS storage passthrough.

Motivation

Back in 2021, I documented my approach to running Docker inside unprivileged LXC containers on Proxmox. At that time, my hypervisor was quite resource-restricted (a 2013 Xeon with 32GB of memory). This made unprivileged LXC the obvious choice. It worked well for years (and it continues to work, largely - read on).

The Linux kernel has moved on. With the adoption of cgroup v2, stricter AppArmor profiles, and tighter UserNS restrictions, running Docker inside LXC slowly turns into a battle against the kernel's security model.

In 2023, I wrote about a successor system architecture for running Mastodon in rootless Docker behind an Nginx proxy. I have used this setup extensively for cloud VMs and at work. The system concept utilizes the improved isolation of VMs while preserving resource efficiency of isolating individual services within unprivileged, rootless Linux namespaces.

The logic behind this is that not every 20MB application requires the memory overhead of its own dedicated VM. Conversely, mixing unrelated services (like Nextcloud and Immich) into the same Docker daemon is a security anti-pattern. If a dependency chain attack compromises a library in your photo indexing app, that attacker should not also gain read/write access to your entire Nextcloud document repository. Enforcing the principle of separation of concerns becomes even more critical today because of the rise of dependency chain attacks.

What those previous posts did not cover is how to use this new architecture with the hypervisor's storage layer.

If your storage sits directly on the Proxmox Hypervisor (e.g. attached as a JBOD), there is a need for a clean way to share specific paths into isolated VMs. In a homelab, VirtIO-FS bind-mounts are preferable to NFS sharing. NFS is often slow and difficult to configure with user and permission mapping. Imagine, for example, one system needs read/write access (e.g., Nextcloud handling automatic photo uploads from client phones), while another system strictly needs read-only access (e.g., Immich indexing those same photos), all with custom uid- and gid-mapping for each services, - this is the sweet spot I am exploring with this blog post.

This guide outlines the architecture concept from the ZFS and Hypervisor configuration, up to the application layer. To understand this setup, a grasp of persistent versus ephemeral data is good to have, along with an understanding of how to layer down these components from the hypervisor into a nested rootless Docker environment. If you don't know what I am talking about, you may still find individual parts interesting for copying or best practice. Pick what is relevant to you.

Specifically, I address the following documentation and concept gaps:

Ephemeral storage: ZFS snapshots can easily grow in size when used with Docker. I show how to separate the persistent data from ephemeral, disposable Docker OS Image content found in ~/.local/share/docker. This separation keeps backups small.
VFS ID-mapping: Network file systems can cause UID/GID mismatches and add network overhead. With VirtIO-FS, I use the Linux kernel's Virtual File System to translate the hypervisor's UID to the guest's unprivileged UID. This avoids exposing the host file structure. I utilize the X-mount.idmap fstab option for this. Documentation on this specific implementation is not easy to find. It builds upon the idmapped mounts feature introduced by Christian Brauner in Linux 5.12 ¹ and its later integration with util-linux v2.39 into the standard mount utility ².
Linked clones: I initially considered template creation unnecessary. A manual VM installation only takes about 30 minutes. However, treating VMs as disposable cattle provides a structurally better architecture. It saves disk space, prevents configuration drift, and minimizes human error. And it works perfectly with ZFS cloning. I will show the commands to create a dedicated tooling VM, how to customize a .qcow2 image, and the steps to deploying an IP-agnostic template via Cloud-Init in Proxmox.
Isolation: Combining systemd rootless user namespaces with discrete VirtIO-FS mounts allows running multiple applications on a single VM. These applications remain isolated and cannot access each other's data or docker environments. I will provide some guidance on how to decide when to use Docker inside rootless namespaces versus one dedicated VM for a single service.

Architecture Overview#

You may have seen the core proxy concept in my previous posts. We use a standard setup of Nginx on the VM host as a central reverse proxy. It forwards traffic through localhost to individual services, which run in their own rootless Docker namespaces. I consider this the typical economical setup. The resources of a single VM are shared with multiple largely isolated services.

At the base, we have a central ZFS pool that hosts all data. I use a common setup design with 3 hardware pools:

rpool/bpool - my Proxmox bootpool, consisting of a mirror of two small SSD drives
tank_ssd - my VM/service drive, a ZFS mirror of two SSDs. All fast data goes here, including VMs, logs, temporary file folders etc.
tank_hdd - my data drive for the slow and big data. Currently, this is a 6x8TB raidz2.

                                                 [ Web/LAN ]
                                                     |
                                        443 / 80 (Port Forwarding)
                                                     |
+-----------------------------------------------------------------------------------------------------------+
| Proxmox Hypervisor                                 v                                                      |
|                                                                                                           |
|  +-----------------------------------------+               +-----------------------------------------+    |
|  | VM 1: Heavyweight (Single-Tenant)       |               | VM 2: Lightweight (Multi-Tenant)        |    |
|  | Pattern: Rootful Docker                 |               | Pattern: Rootless User Namespaces       |    |
|  |                                         |               |                                         |    |
|  |            [ Nginx / SSL ]              |               |             [ Nginx / SSL ]             |    |
|  |                   |                     |               |                 |      |                |    |
|  |          127.0.0.1:8080                 |               |    127.0.0.1:8081      127.0.0.1:8082   |    |
|  |                   v                     |               |                 v      v                |    |
|  |    +-----------------------------+      |               |     +-------------+  +-------------+    |    |
|  |    | systemd: docker.service     |      |               |     | User:       |  | User:       |    |    |
|  |    | (Daemon runs as root)       |      |               |     | funkwhale   |  | immich      |    |    |
|  |    |                             |      |               |     | UID: 2001   |  | UID: 2002   |    |    |
|  |    | +-------------------------+ |      |               |     |             |  |             |    |    |
|  |    | | Nextcloud Container     | |      |               |     | +---------+ |  | +---------+ |    |    |
|  |    | | App UID: 33 (www-data)  | |      |               |     | | Docker  | |  | | Docker  | |    |    |
|  |    | +-------------------------+ |      |               |     | | App:1000| |  | | App:1000| |    |    |
|  |    +--------------|--------------+      |               |     | +----|----+ |  | +----|----+ |    |    |
|  |                   |                     |               |     +------|------+  +------|------+    |    |
|  |                   |                     |               |            |                |           |    |
|  |       [ /srv/nextcloud/data ]           |               |   [ /srv/media/fw ] [ /srv/media/im ]   |    |
|  +-------------------|---------------------+               +------------|----------------|-----------+    |
|                      | VirtIO-FS                                        | VirtIO-FS      | VirtIO-FS      |
|                      v                                                  v                v                |
| ........................................................................................................  |
| : ID-Mapping Translation Layer (Linux Kernel on Hypervisor)                                            :  |
| :                                                                                                      :  |
| :  X-mount.idmap=b:1005:33:1                  X-mount.idmap=b:1005:2100999:1  ...b:1005:2200999:1      :  |
| :....................|..................................................|................|.............:  |
|                      |                                                  |                |                |
|                      v                                                  v                v                |
|  +---------------------------------------------------------------------------------------------------+    |
|  | ZFS Storage (tank_ssd & tank_hdd)                                                                 |    |
|  |                                                                                                   |    |
|  |   [ /secure ] (Persistent / Encrypted / Backed Up)                                                |    |
|  |     - /media/secure/nextcloud/data (Owned by Hypervisor UID 1005)                                 |    |
|  |     - /media/secure/00_Alex/Music  (Owned by Hypervisor UID 1005)                                 |    |
|  |                                                                                                   |    |
|  |   [ /ephemeral ] (Disposable / Unencrypted / No Backup)                                           |    |
|  |     - /vm-200-disk-1 (/var/lib/docker)                                                            |    |
|  |     - /vm-201-disk-1 (/mnt/ephemeral_docker/funkwhale)                                            |    |
|  +---------------------------------------------------------------------------------------------------+    |
+-----------------------------------------------------------------------------------------------------------+

Now, for deploying VMs and Services, you have to make a decision whether to use the Heavyweight type: A single service in a single VM; or the Lightweight variant: A single VM with multiple nested rootless users all with their own nested docker systemd's. If you want some guidance on that decision, read what I wrote behind the dropdown below.

Decide what type of VM-Deployment you need

I use docker (or podman) in both types, which makes migration between the two variants relatively easy. So this is largely a system design question for me. For instance, I decided to use the Heavyweight variant (single service on a dedicated VM) for the following:

Nextcloud
Gitlab
Home Assistant
Mailcowdockerized

These are all mission critical services in my Homelab. I wanted them to be maximally isolated. They also require a lot of special Firewall rules, which are easy to configure on the IP-level of a single VM. Also, Nextcloud has r/w access to a large part of my underlying ZFS storage layer. If another service in that VM became compromised, that access privilege would offer a pretty large attack surface that I wanted to avoid.

The second variant, Lightweight, I use for smaller and dedicated feature-services:

Immich: Only needs read-access to the portion of my Nextcloud user's InstantUpload/Camera folders
Funkwhale: Only needs read-access to the portion of my Nextcloud user's music folders
Grafana/Invidious/Miniflux etc.: These don't need any access to my tank_hdd, they are very small and wasting VM overhead for each of them would be a pitty

How many VMs of each type you create is up to you. I have several VLANS that I use to segment my user network access. There's a private development VLAN (Gitlab, MQTT, Grafana). Then there's a "guest" services, akin to a demilitarized zone, for internal services like Funkwhale, Invidious, Miniflux (etc.) that need to be accessed by a large portion of my private network users. Having these services organized in separate VLANs simply makes it easier to manage firewall rules.

The key is not how to assign services, but to follow the same organization pattern on each level of the hierarchy. On each VM, by its way of organization, you will know immediatly where to find the data and how to deal with service updates/backups. Being consistent is the main benefit that makes such a hierarchical system easy to manage.

Creating a qcow2 VM template#

To utilize disposable VMs, we must first build a Cloud-Init template. In homelabs and small enterprises, there is a habit of treating servers as "Pets". Lovingly hand-crafted and manually patched systems that we tend to individually. A modern architecture considers compute as Cattle. Numbered, identical, and replaced when sick. If you are still the "Pet"-type, I strongly advise to switch to the Cattle-approach, for your peace of mind.

By separating the state (the persistent data on ZFS) from the compute (the ephemeral "Cattle" running the OS), we eliminate configuration drift. Rather than clicking through a Debian ISO installer and maintaining a server for years, we download an official pre-built qcow2 cloud image. However, instead of doing this on the hypervisor, we use a disposable tooling VM. This adheres to the "Infrastructure as Markdown" ³ philosophy. The documentation is the server. We spin up a temporary VM, use it to build our customized Debian 13 template, and then destroy it. When Debian 14 arrives, we just run these commands again.

Create the Tooling VM#

[Proxmox Hypervisor] First, download the vanilla cloud image to your Proxmox ISO directory on your Proxmox hypervisor.

cd /var/lib/vz/template/iso/
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2

Create the Tooling VM.

qm create 999 \
    --name build-vm \
    --memory 2048 \
    --net0 virtio,bridge=vmbr1,tag=40 \
    --agent 1
qm importdisk 999 debian-13-genericcloud-amd64.qcow2 tank_ssd

Attach the disk and configure boot.

qm set 999 --scsihw virtio-scsi-single
qm set 999 --scsi0 tank_ssd:vm-999-disk-0
qm set 999 --boot order=scsi0
qm set 999 --serial0 socket --vga serial0

Setup Cloud-Init (so we can log into it).

adjust your ipconfig 192.168.40.1 is my "service" VLAN, 192.168.40.1 is my service Gateway

qm set 999 --ide2 tank_ssd:cloudinit
qm set 999 --ciuser debian
qm set 999 --cipassword strong_password
qm set 999 --ipconfig0 ip=192.168.40.99/24,gw=192.168.40.1

Resize the disk so we have room to work, then start it

qm disk resize 999 scsi0 +10G
qm start 999

To pass the .qcow2 file back and forth between the hypervisor and the Tooling VM, we map the Proxmox ISO directory via VirtIO-FS:

In the Proxmox GUI, go to Datacenter -> Resource Mapping -> Directories. Click Add (ID: proxmox_iso, Path: /var/lib/vz/template/iso/).
Go to VM 999 -> Hardware. Add VirtioFS (Select proxmox_iso, Tag: proxmox_iso).
Under Processors, ensure NUMA is checked (required for VirtIO-FS).

Customize the VM Image#

[Tooling VM] SSH into your newly booted Tooling VM.

Dedicated SSH User vs. Direct Root Login

Notice that we created a standard user (debian) in the Cloud-Init configuration rather than enabling root access.

Allowing direct SSH access to the root account is a security anti-pattern. It exposes your most privileged account and breaks monitoring. The standard practice is therefore to connect via SSH using an unprivileged user, and then elevate to root using sudo -i. Since all we ever do on these VMs is administration, I decided to not add a password here for the user. This is up to you.

Once logged in, elevate to root and mount the Proxmox ISO directory.

sudo -i
mkdir -p /mnt/iso
mount -t virtiofs proxmox_iso /mnt/iso

Locale Errors in Cloud Images

Cloud images are stripped bare to save space. When installing tooling like libguestfs, you will often be notified with locale warnings. Exporting this C-fallback locale fixes this:

apt-get update && apt-get install -y locales
export LANG=C.UTF-8 LC_ALL=C.UTF-8
update-locale LANG=C.UTF-8 LC_ALL=C.UTF-8

Install the customization tools and prepare a copy of the image.

apt update && apt install -y libguestfs-tools qemu-guest-agent
apt purge -y mdadm && apt autoremove -y

cd /mnt/iso

Make a copy to work on.

cp debian-13-genericcloud-amd64.qcow2 debian-13-template-v1.qcow2

Inject the QEMU Guest Agent into the disk image and wipe any unique system states (like MAC addresses or Machine IDs).

# Force libguestfs to use the direct backend (avoids nested virtualization errors)
export LIBGUESTFS_BACKEND=direct

# Inject the agent
virt-customize -a debian-13-template-v1.qcow2 \
  --install qemu-guest-agent

# Clean the image so it is ready for cloning
virt-sysprep -a debian-13-template-v1.qcow2

Create the template, destroy the tooling VM#

[Proxmox Hypervisor] Log out of the tooling VM 999. Back on the Proxmox Host, we destroy the tooling VM and lock the modified image into a permanent Base Template (ID 9000).

Destroy the Tooling VM permanently:

qm stop 999
qm destroy 999 --purge

Create the final Base Template VM. We use ostype l26 for Linux 2.6+ ⁴. I used tag=40 for my VLAN - adjust this according to your setup.

qm create 9000 \
    --name "debian-13-template" \
    --memory 2048 --net0 virtio,bridge=vmbr1,tag=40 \
    --agent 1 --ostype l26

Import the customized disk to the persistent, encrypted storage.

qm importdisk 9000 /var/lib/vz/template/iso/debian-13-template-v1.qcow2 encrypted_zfs

Attach hardware, enable SSD emulation and TRIM.

qm set 9000 \
    --scsihw virtio-scsi-single \
    --scsi0 encrypted_zfs:base-9000-disk-0,ssd=1,discard=on

Configure Cloud-Init drives and Boot Order.

qm set 9000 --ide2 encrypted_zfs:cloudinit
qm set 9000 --boot order=scsi0
qm set 9000 --serial0 socket --vga serial0
qm set 9000 --cpu host
qm set 9000 --cores 2

Apply your default sysadmin credentials. These will be inherited by all clones.

qm set 9000 --ciuser alex
qm set 9000 --sshkeys "ssh-ed25519 AAAAC3NzaC1... your_public_key_here"

Convert the VM into a read-only template.

Disk Sizing, growpart, and the dmesg GPT Warning

You may have noticed that we kept the template's disk size small (2GB). This is intentional. When we deploy a new VM, this template is cloned. We can then resize it to our needs (see Mounting to the VM) before booting.

When this newly resized VM boots, expect to see a warning in your hypervisor's dmesg log:

GPT:Primary header thinks Alt. header is not at the end of the disk.

This is harmless. When Proxmox expanded the virtual disk from 2GB to 30GB, the hypervisor's kernel instantly noticed that the backup GPT header was sitting at the 2GB mark instead of the end of the disk.

However, Debian Cloud-Init images include growpart. When the new VM boots, growpart detects the extra space, moves the GPT headers to the new 30GB boundary, and expands the root partition. By the time you SSH into the VM, the problem is fixed. This process leaves the warning behind in the hypervisor logs.

Template base snapshot

When you create a VM template (like our base VM 9000) and deploy linked clones from it, Proxmox creates a ZFS snapshot called @__base__. This snapshot acts as the read-only anchor for all cloned VMs.

If you run a zfs list -t snapshot, you will see it:

NAME                                          USED  AVAIL  REFER  MOUNTPOINT
tank_ssd/secure/base-9000-disk-0@__base__       88K     -   678M  -

Unlike standard (and automated Sanoid snapshots) which are harmless to prune, do not ever delete the '@__base__' snapshot. If you destroy it, every linked clone VM will become corrupted and unbootable.

To distinguish between VMs and templates, tags are a great Proxmox feature to visually label different groups in your overview:

proxmox tagging

The last step: Create linked clone from template 9000 to a new VM 200.

qm clone 9000 200 --name nextcloud-vm --full 0

creating a clone of VM 9000 with ID 200 create full clone of drive ide2 (encrypted_zfs:vm-9000-cloudinit) allocated target volume 'encrypted_zfs:vm-200-cloudinit' create linked clone of drive scsi0 (encrypted_zfs:base-9000-disk-0)

Summary: Creating a new VM is as simple as cloning ID 9000, selecting Linked Clone and starting it! It takes seconds and uses almost zero storage. For the customizations needed to adding an ephemeral Docker disk, read on.

Persistent vs. ephemeral data#

Obviously, I create and manage my zpools on the Hypervisor (Proxmox). The VM/service pool (tank_ssd) separates ZFS datasets into two main hierarchies:

tank_ssd/secure - Persistent data that is encrypted, snapshotted, and backed up offsite.
tank_ssd/ephemeral - Ephemeral data (VM base images, Docker storage layers), not backed up, not snapshotted, not encrypted.

This was not always the case. However, I realized that mixing ephemeral data (temporary files, logs, docker base image layers) with persistent data (configuration files, content) is an not a good idea (see the 12-Factor principle). It makes backups difficult and resource intensive. ZFS snapshots increase in size without any benefit. This is also the main benefit of docker/podman. They help to abstract away and layer the ephemeral vs. persistent data. Once you've used this system for a while, it becomes a driver of productivity, because at each level of the system you will know what is precious and must be kept, and what is just temporary data passing by.

[Proxmox Hypervisor] Lets say you already have tank_ssd. First, create the persistent secure dataset.

zfs create \
    -o encryption=aes-256-gcm \
    -o keyformat=passphrase \
    -o compression=on tank_ssd/secure

Add it to your Promxox storage configuration:

nano /etc/pve/storage.cfg

Add:

zfspool: encrypted_zfs
        pool tank_ssd/secure
        content rootdir,images
        mountpoint /tank_ssd/secure
        sparse 1

Afterwards, create the ephemeral storage (for Docker etc.), with encryption off:

zfs create -o compression=lz4 tank_ssd/ephemeral

We will later make sure that no private service data is written to this storage layer. Add this to storage.cfg:

zfspool: ephemeral_docker
        pool tank_ssd/ephemeral
        content images
        sparse 1

Reload storage afterwards with:

We will later add backup=0 to any VM storage path that uses ephemeral_docker and map rootless docker user homes (~/.local/share/docker) to it. If you use sanoid/syncoid ⁵, for ZFS snapshot automation, you can now head to your /etc/sanoid/sanoid.conf and only add tank_ssd/secure.

VirtIO-FS and ID-Mapping#

In a standard setup, sharing files from the hypervisor to a VM usually means setting up an NFS or SMB server on the host and connecting to it via the virtual network. This introduces network stack overhead, caching issues, and permission mapping headaches. Previously, in my Docker-in-unprivileged-LXC Setup, I used lxc.idmap, to mount Hypervisor directories to Docker running in LXCs. This is not available anymore with VMs.

Similar to lxc.idmap, VirtIO-FS bypasses the network completely. It is a shared file system that lets virtual machines access a directory tree on the host through a socket, with the virtiofsd daemon. With the Linux Virtual File System (VFS) ID-mapping feature, we can get the same effect as with lxc.idmap. For example, my music files are owned by UID 1005 on my hypervisor. But the isolated Nextcloud user inside my VM requires UID 33 (www-data). VFS ID-mapping translates these permissions on the fly at the kernel level with little to zero performance penalty.

virtiofsd --translate-uid

The virtiofsd binary natively supports --translate-uid, but when running as a root daemon (the Proxmox default), it crashes unless inside a namespace. ⁶ I therefore used the X-mount.idmap feature of the standard mount utility instead, which I found more reliable.

Umbrella Directories

QEMU has a limit on how many PCI devices you can attach to a single VM. Additionally, for every VirtIO-FS share you attach, Proxmox spawns a dedicated virtiofsd background process. If you attach 10 different service directories to a single VM, the VM can hang during boot because QEMU trips over a race condition by waiting for 10 daemons to establish their socket files.

If you need to mount multiple hypervisor folders to a VM, the solution is to create a single umbrella directory on the hypervisor first, bind-mount all required directories into it, and then pass just this single umbrella directory to the VM.

[Proxmox Hypervisor] Create the umbrella folder for the multi-tenant Docker VM (ID 201).

mkdir -p /mnt/virtiofs_mapped/vm201_media

..or for the single-tenant Nextcloud VM (ID 200 in my example):

mkdir -p /mnt/virtiofs_mapped/nc_data

Create subdirectories for the isolated services.

mkdir -p /mnt/virtiofs_mapped/vm201_media/funkwhale/music
mkdir -p /mnt/virtiofs_mapped/vm201_media/immich/photos

VFS ID-Mapping with `/etc/fstab`#

Now we bind-mount our actual ZFS datasets into those subdirectories. For ID-mapping, we apply the X-mount.idmap parameter. If you have difficulties imagining this (me too!), here's a diagram:

[Hypervisor]           [VFS ID-Map]            [VM 201]
UID 1005 (Alex)   -->  b:1005:999:1   -->      UID 999 (Funkwhale)

The syntax is b:<host_uid>:<guest_uid>:<count>. We are telling the hypervisor to take files owned by Host UID 1005, and make them appear to this specific mount point as Guest UID 33.

[Proxmox Hypervisor] Edit /etc/fstab and map Nextcloud data folder (Host 1005) to Nextcloud VM (Guest 33) as read and write:

/media/secure/nextcloud/data /mnt/virtiofs_mapped/nc_data none bind,noauto,X-mount.idmap=b:100000:0:33\040b:1005:33:1\040b:100034:34:65502 0 0

\040: This is the octal code for a space character. It is required because fstab uses spaces as column delimiters.
b:1005:33:1: Map Host UID 1005 to Guest UID 33 (Nextcloud/www-data).
b:100000:0:33: Map Host 100000 (standard SubUID start) to Guest 0-32 range.
b:100034:34:65502: Map the remaining range.

Similarly, this is my mapping for Music (Host 1005) to Funkwhale (Guest 999) as Read-Only

/media/secure/00_Alex/02_Music/01_Music /mnt/virtiofs_mapped/vm201_media/funkwhale/alex/01_Music none bind,noauto,ro,X-mount.idmap=b:1005:999:1 0 0
/media/secure/00_Anne/02_Music /mnt/virtiofs_mapped/vm201_media/funkwhale/anne/01_Music none bind,noauto,ro,X-mount.idmap=b:1005:999:1 0 0

b:1005:999:1: Map Host UID 1005 to Guest UID 999 (rootless Funkwhale user).
ro: Read-only - Funkwhale does not need to modify files.
noauto: Do not auto-mount on boot - see my notes on ZFS Encryption and manual unlocking below.

Apply or refresh the mounts on the hypervisor afterwards with:

Mounting to the VM#

[Proxmox Hypervisor] We only need to attach the single umbrella directory to the VM. First register the directory in Proxmox:

pvesh create /cluster/mapping/dir \
    --id vm201_media \
    --map node=parrot,path=/mnt/virtiofs_mapped/vm201_media

Then Attach it to the VM as virtiofs0.

qm set 201 --virtiofs0 dirid=vm201_media

[VM 201] Inside the VM 201, mount the tag in /etc/fstab:

vm201_media /srv/media virtiofs ro,relatime 0 0

Note

I did not add nofail to it because I wanted the VM to fail to boot should any of the VirtIO mounts be unavailable, just as a precaution.

When you ls -alhn /srv/media/funkwhale inside the VM, the files report as owned by 999, while remaining untouched on the hypervisor's ZFS pool.

Example ls.

ls -alhn /srv/media/funkwhale/alex
total 147K
drwxr-xr-x   5   0   0   5 May 15 15:43 .
drwxr-xr-x   4   0   0   4 May 15 15:44 ..
drwxr-xr-x 103 999 999 103 Dec 28 17:46 01_Music

More Examples

For my Nextcloud VM, the steps are 1:1. If you cat /etc/pve/mapping/directory.cfg, you get the following shown on my Proxmox:

nc_data
    map node=parrot,path=/mnt/virtiofs_mapped/nc_data

nc_alex
    map node=parrot,path=/mnt/virtiofs_mapped/nc_alex

vm201_media
    map node=parrot,path=/mnt/virtiofs_mapped/vm201_media

Here is my VM 200 (Nextcloud) /etc/fstab:

nc_data /srv/nextcloud/data/nextcloud/data virtiofs rw,relatime 0 0
nc_alex /mnt/data/00_Alex virtiofs ro,relatime 0 0

Here, I did not use the umbrella-directory strategy yet and mapped multiple folders via individual virtio-fs sockets.

Boot Conditions#

If you implement this, you will run into two Linux race conditions. Here is how to fix them.

ZFS vs. fstab Boot Race

[Proxmox Hypervisor] During hypervisor boot, systemd parallelizes tasks. It will read the /etc/fstab and execute bind mounts before ZFS finishes importing pools. Your bind mounts will attach to empty placeholder directories.

If your datasets mount automatically (a standard ZFS setup), append x-systemd.requires=zfs-mount.service to your fstab bind options. This forces the bind mount to wait for ZFS.

If you use manual passphrases (a manually decrypted ZFS setup), you must use the noauto flag in fstab, as I did, so the system doesn't mount them at boot. You then use an unlock.sh script after booting:

#!/bin/bash

zfs mount -l tank_hdd/data
zfs mount -l tank_ssd/secure

# Find all virtiofs mappings in fstab and mount them
awk '$2 ~ /^\/mnt\/virtiofs_mapped/ {print $2}' /etc/fstab | xargs -r -n1 mount

echo "ZFS Unlocked and VFS Bind Mounts attached."

VirtIO-FS 'Warm Reboot' Bug

If you type reboot inside a VM using VirtIO-FS, the VM kernel gracefully halts, but the KVM process on the hypervisor does not actually die; it just resets. Consequently, the virtiofsd socket daemon never restarts. When the guest boots back up, virtiofsd socket daemon gets out of sync with the guest kernel. This causes the VM's mount sequence to permanently hang. Here's a related discussion ⁷. I was able to reliably reproduce this behaviour.

Therefore, never use reboot for VirtIO-FS VMs. Always use Shutdown -> wait for it to stop -> Start. This cleanly destroys and recreates the socket daemons.

Rootless Docker Service Isolation#

With the VMs deployed and persistent ZFS data safely mounted as read-only (or read-write where necessary), we can now configure the software.

For the "Lightweight" multi-tenant VM, we want to run several unrelated services (Funkwhale, Immich, Miniflux, etc.). Mixing them into a single root-level Docker daemon conflicts with the zero-trust principle. Instead, we use systemd to spawn individual rootless Docker daemons for dedicated service accounts. If the Miniflux container is compromised, the attacker is still inside an unprivileged user with no access to the Docker socket of the other services or the underlying OS.

Podman vs Docker

You may point to Podman here because Podman is natively rootless and doesn't require a daemon and additional workarounds to run rootless. I agree. However, Docker in rootless user nesting is also robust and possible, and I decided to stick to the docker-compose.yml because it is still the industry (and homelab) standard. This minimizes the time I need to translate examples or templates. Docker in rootless and Podman are mostly interchangeable at this level. Both run processes in isolated Linux namespaces on your host VM (not the Hypervisor!).

Figure: htop on the VM 201 (multi-tenant docker): Different dockerd processes running under different unprivileged user accounts.

To keep our hypervisor backups small, we ensure that the Docker image layers (which rootless Docker stores in ~/.local/share/docker) are written to our ephemeral_docker ZFS dataset, not the VM's backed-up root disk.

VM configuration with ephemeral Docker disk#

[Proxmox Hypervisor] Let's provision our multi-tenant VM (ID 201). We clone the template, attach an unbacked-up 20GB ephemeral disk for Docker, configure the network via Cloud-Init, enable NUMA (required for VirtIO-FS), and boot.

# Clone the template and add the ephemeral Docker disk
qm clone 9000 201 --name docker-vm --full 0
qm set 201 --scsi1 ephemeral_docker:20,ssd=1,backup=0

# Configure network, enable NUMA, expand the OS disk, and boot
qm set 201 --ipconfig0 ip=192.168.40.77/24,gw=192.168.40.1 \
           --nameserver 192.168.40.1 --searchdomain local.example.com

# Set network segment (vmbr1, vmbr0 etc.), and VLAN TAG
qm set 202 --net0 virtio,bridge=vmbr1,tag=40

qm set 201 --memory 2048 --numa 1
qm disk resize 201 scsi0 50G
qm start 201

[VM 201] SSH into the new VM and elevate to root. Since this is a fresh clone, we wipe the cached cloud-init state and fix the locales (I had to do this - YMMV).

ssh alex@192.168.40.77
sudo -i

# Clean cloud-init caches
cloud-init clean

# Fix locales to prevent generation errors
apt-get update && apt-get install -y locales rsync xfsprogs
export LANG=C.UTF-8 LC_ALL=C.UTF-8
update-locale LANG=C.UTF-8 LC_ALL=C.UTF-8
sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen
locale-gen

We format the newly attached virtual disk for docker as XFS. I choose XFS over ext4 here because it is excellent for container workloads due to its dynamic inode allocation and parallel write performance.

Do not copy & paste /dev/sda or /dev/sdb

Because the Linux kernel probes SCSI devices asynchronously during boot, device names (sda, sdb) are not deterministic. Your OS disk might be sdb and your empty Docker disk might be sda.

First, list your block devices to identify the empty disk. Look for the disk matching the size you just provisioned (e.g., 20G or 50G) that has no partitions (no branches beneath it), and no mount points.

lsblk
> NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
> sda       8:0    0   50G  0 disk                 <-- This is our empty 50G disk
> sdb       8:16   0   50G  0 disk 
> ├─sdb1    8:17   0 49.9G  0 part /               <-- This is the OS disk
> ├─sdb14   8:30   0    3M  0 part
> └─sdb15   8:31   0  124M  0 part /boot/efi

Format it:

Before we mount it, we must get the UUID of the partition.

blkid /dev/sda
# Output: /dev/sdb: UUID="c323dab0-9e23-4b2d..." BLOCK_SIZE="512" TYPE="xfs"

Mount order and UUID

Do not use /dev/sdb or /dev/sda in your /etc/fstab. The Linux kernel probes and initialize block devices (like virtio-scsi disks) asynchronously to speed up boot times ⁸. If systemd tries to mount /dev/sdb before the virtual controller finishes initializing the device node, the mount will fail, and your VM will halt the boot process. I experienced this randomly and solved it by using the UUID= format. This forces systemd to create a device dependency and wait for the udev subsystem to announce that the specific disk is fully ready.

Add this to /etc/fstab:

UUID=c323dab0-9e23-4b2d-adc8-2e166c8ad53c /mnt/ephemeral_docker xfs defaults,discard 0 0

Note

For Heavyweight instances (rootful Docker on the VM host, e.g. VM 200 in my case - Nextcloud), use instead:

UUID=c323dab0-9e23-4b2d-adc8-2e166c8ad53c /var/lib/docker xfs defaults,discard 0 0

Create the /var/lib/docker folder, reload mount points and mount the disk.

mkdir -p /var/lib/docker
systemctl daemon-reload
mount -a

Install standard Docker via the official repository. Follow the docs.

See the commands (as of 2026-06-09)

# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/debian
Suites: $(. /etc/os-release && echo "$VERSION_CODENAME")
Components: stable
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/docker.asc
EOF

sudo apt update && \
    sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Creating Service Accounts (Rootful vs. Rootless)#

At this stage, your path diverges depending on the VM type:

For a Single-Tenant VM (e.g., Nextcloud):

[VM 201] We create an unprivileged user, give them ownership of their directory, and simply add them to the docker group. The service runs via the standard root-level daemon.

useradd -r -s /bin/bash -m -d /srv/nextcloud -U nextcloud
usermod -L nextcloud
usermod -aG docker nextcloud

# to login as the newly created user, use:
# sudo -u nextcloud -H bash
# or 
# su - nextcloud

For a Multi-Tenant VM (e.g., Funkwhale, Immich):

[VM 201] We disable the root daemon, install the rootless dependencies, and configure user namespaces. Read the full story here. Here I just list the direct commands:

systemctl stop docker.socket docker.service
systemctl disable --now docker.socket docker.service
rm /var/run/docker.sock

apt-get install -y uidmap dbus-user-session systemd-container docker-ce-rootless-extras slirp4netns

Create the dedicated user and lock the account:

useradd -r -s /bin/bash -m -d /srv/funkwhale -U funkwhale
usermod -L funkwhale

Align User Identities

If you are migrating between systems, it makes sense to keep the UID/GID of the previous owners, so you don't need to chown files.

First, on the old box, get the ID of the user files you are migrating:

id funkwhale
# Example output:
# > uid=998(funkwhale) gid=998(funkwhale) groups=998(funkwhale)

Create the user on the new VM using these numbers:

groupadd -g 998 funkwhale
useradd -u 998 -g 998 -m -d /srv/funkwhale -s /bin/bash funkwhale
usermod -L funkwhale

# to login as the newly created user, use:
# machinectl shell funkwhale@

You can then use rsync with --numeric-ids to keep GID/UIDs 1:1 intact on transfer. Further rsync tip that may come handy: Add -x to tell it to skip any mount points.

If you are forced to change the UID on the new VM (e.g. gid/uid already exists), drop --numeric-ids and use --chown=1001:1001 (e.g.). This will override group and user ownership to the new number.

[VM 201] We now create a folder on our ephemeral XFS disk, assign it to the service account, and bind-mount it over the default rootless Docker path.

mkdir -p /mnt/ephemeral_docker/funkwhale
chown funkwhale:funkwhale /mnt/ephemeral_docker/funkwhale

sudo -u funkwhale mkdir -p /srv/funkwhale/.local/share/docker

Append the bind mount to /etc/fstab:

/mnt/ephemeral_docker/funkwhale /srv/funkwhale/.local/share/docker none bind 0 0

Run afterwards:

systemctl daemon-reload & mount -a

Rootless Docker will write to ~/.local/share/docker, and the Linux kernel will redirect writes to our unbacked-up /dev/sdb disk.

Configuring Rootless Namespaces#

[VM 201] Next, we allocate subuid and subgid ranges for the new user. Use this one-liner on the VM (root) to find the next available ID block and assign it. It will update /etc/subuid and /etc/subgid on the host.

Info

If you prefer Ansible, use this playbook and run it with:

ansible-playbook -i inventories/hosts 1_setup_rootless_user.yml -K

The steps without Ansible. Replace funkwhale with your Linux service user name in the command below.

USER_NAME="funkwhale"; RANGE_SIZE=65536; next_start=$(awk -F: '{print $2 + $3}' /etc/subuid /etc/subgid 2>/dev/null | sort -n | tail -1); next_start=${next_start:-100000}; echo "Using start_id=$next_start"; usermod --add-subuids ${next_start}-$((next_start + RANGE_SIZE - 1)) --add-subgids ${next_start}-$((next_start + RANGE_SIZE - 1)) "$USER_NAME"

IPv4 Forwarding

Rootless Docker relies on slirp4netns to route internal container traffic. By default on Debian, IPv4 forwarding is disabled. This causes rootful containers to lose internet access (WARNING: IPv4 forwarding is disabled. Networking will not work.). While slirp4netns bypasses the Kernel IP forwarding rules on rootless, you may still get this warning. I fixed this, to silence the warning and prevent networking edge cases. The below steps also allow the ping command, which is often used by containers for healthchecks.

At the VM level:

echo "net.ipv4.ip_forward = 1" > /etc/sysctl.d/99-docker.conf
echo "net.ipv4.ping_group_range = 0 2147483647" >> /etc/sysctl.d/99-docker.conf
sysctl -p /etc/sysctl.d/99-docker.conf

See ¹⁰.

[VM 201: funkwhale namespace] Finally, log into the service account using machinectl and install the daemon:

machinectl shell funkwhale@
dockerd-rootless-setuptool.sh install

Add the required variables to the user's ~/.bashrc as instructed by the installer, reload the shell.

[VM 201] Then enable lingering so the daemon starts on boot:

# CTRL+D
loginctl enable-linger funkwhale

Migrating Data to rootless

If you are migrating existing data from a Rootful Docker setup (or an LXC) into Rootless Docker or GID/UID-mapped nested users, you will face GID and UID mismatches. For example, a PostgreSQL container internally expects to run as UID 70. For our rootless namespace, UID 70 is mapped to a high-number subuid on the host (e.g., 2100069).

Do not try to calculate this math yourself. I found the following a better way.

First, as the root user of the VM, claim ownership of migrated files for your service account:

chown -R funkwhale:funkwhale /srv/funkwhale/data

Then, log in as the service account (machinectl shell funkwhale@). Start a temporary container as root, mount the database directory, and chown it from the inside:

# Postgres needs to own the files as UID 70 inside the container.
# Rootless Docker will automatically calculate the correct host subuid!
docker run --rm -v $(pwd)/data/postgres:/data -u root alpine chown -R 70:70 /data

When you inspect the files from the VM host, you will see they are owned by the mapped subuid, and your database will boot.

You still need to setup the actual docker-compose.yml. I use this pattern:

funkwhale@docker-vm:~$ tree -L 3
.
├── data
│   ├── media
│   │   ├── __sized__
│   │   └── attachments
│   ├── music
│   │   ├── alex
│   │   └── anne
│   ├── postgres
│   └── redis
│       └── dump.rdb
└── docker
    └── docker-compose.yml

All persistent data is referenced in docker-compose.yml from the ~/data folder. Docker volumes are not used, unless for temporary or cache volumes.

Example docker-compose.yml

This is not a complete docker-compose.yml. It illustrates the bind-mount pattern for persistent data to the ~/data folder of the user who owns the docker socket (e.g. /srv/funkwhale/data).

services:
  postgres:
    restart: unless-stopped
    env_file: .env
    image: postgres:15-alpine
    container_name: funkwhale-postgres
    volumes:
      - /srv/funkwhale/data/postgres:/var/lib/postgresql/data

  redis:
    restart: unless-stopped
    env_file: .env
    image: redis:7-alpine
    container_name: funkwhale-redis
    volumes:
      - /srv/funkwhale/data/redis:/data

  celeryworker:
    restart: unless-stopped
    image: funkwhale/api:${FUNKWHALE_VERSION:-latest}
    container_name: funkwhale-celeryworker
    depends_on:
      - postgres
      - redis
    env_file: .env
    ...
    volumes:
      - /srv/funkwhale/data/music:/music:ro
      - /srv/funkwhale/data/media:/media
  ...
  front:
    restart: unless-stopped
    image: funkwhale/front:${FUNKWHALE_VERSION:-latest}
    container_name: funkwhale-front
    ...
    volumes:
      - /srv/funkwhale/data/music:/music:ro
      - /srv/funkwhale/data/media:/media:ro
      ...
      - "${STATIC_ROOT}:/usr/share/nginx/html/staticfiles:ro"
    ports:
      - "127.0.0.1:8081:80"

The funkwhale container binds to port :8081 on the host (localhost only). nginx on the host listens on 80/443,does the SSL termination, and forwards packets to this port.

We have now reached the maximum depth of our nesting system. Below, I will describe some of the systems I have in place to help me organise it.

Automation, Reverse Proxy, and Organization#

Organisation, system, and structure are the backbone to setting up such a multi-level, file and user spanning system in the first place. When separating persistent data, ephemeral storage, hypervisors, VMs, and user namespaces, I find it critical to have a system in place to keep track of it.

For this, local Git repositories are a great tool that I adopted more and more over time. While ZFS snapshots give me peace of mind for disaster recovery, I consult these Git repositories to understand and think about the changes I make to my infrastructure.

Git-Based Configuration Tracking#

I apply the tracking pattern at three levels:

Hypervisor: One repository tracking /etc/pve, /etc/fstab, ZFS configs, and network settings.
VMs: One repository per VM tracking /etc/nginx, /etc/fstab, and cron jobs.
User Namespaces: One repository per service user tracking docker-compose.yml and service configuration.

To automate this, I place a filelist.txt and a sync script in /root/config/ (or ~/.config/ for unprivileged users). Every time I modify a system configuration, I run the script, commit, and push.

Git tracking template

My tracking template is pretty simple:

[VM 201] A filelist.txt example for a Docker VM:

## storage & network
/etc/fstab
/etc/resolv.conf
/etc/hostname

## rootless docker mapping
/etc/subuid
/etc/subgid

## cron jobs
/etc/cron.daily/restart_invidious
/etc/cron.daily/nextcloud_update

## nginx configuration
/etc/nginx/.htpasswd
/etc/nginx/sites-available
/etc/nginx/sites-enabled

My get-files.sh script to pull live files into the repo folder. This replicates file paths relatively in the local repo:

#!/bin/bash
set -eu
REPO_PATH="/root/config/"

echo "Copying files from filelist.txt..."
# Safely ignore empty lines and comments, preserve paths and permissions
grep -vE '^\s*(#|$)' "$REPO_PATH/filelist.txt" | rsync -a -r --files-from=- / "$REPO_PATH"

echo "Completed."

An optional restore-files.sh script to interactively push files back from the repo to the OS:

#!/bin/bash
set -eu
REPO_PATH="/root/config"
AUTO=false

echo "Starting interactive restore process..."
while IFS= read -r item; do
    item="${item%/}"
    repo_item="$REPO_PATH$item"
    parent_dir=$(dirname "$item")

    if [[ ! -e "$repo_item" ]]; then
        echo "[-] Skipping $item (Not found in repo)"
        continue
    fi

    if [[ "$AUTO" == false ]]; then
        read -p "Restore $item? [y/N/all/quit]: " choice < /dev/tty
        case "$choice" in
            [yY]|[yY][eE][sS]) ;;
            [aA]|[aA][lL][lL]) AUTO=true ;;
            [qQ]|[qQ][uU][iI][tT]) echo "Aborting."; exit 0 ;;
            *) echo "    Skipped."; continue ;;
        esac
    fi

    echo "    Restoring $item..."
    mkdir -p "$parent_dir"
    rsync -a "$repo_item" "$parent_dir/"
done < <(grep -vE '^\s*(#|$)' "$REPO_PATH/filelist.txt")
echo "Restore complete."

How to git push in isolated namespaces

Also, instead of fiddling with SSH Agent forwarding while switching between sudo -i and different user accounts, I really recommend using GitLab Deploy Keys (or GitHub Deploy Keys). They are limited in scope to a single repository, easy to set up via ssh-keygen, and don't require the constant tending that Personal Access Tokens do. See a small post about it here ⁹.

Example filelist.txt

[Proxmox Hypervisor] Here is the filelist.txt for tracking my Proxmox host:

## container & VM configs
/etc/pve/lxc/100.conf
/etc/pve/lxc/108.conf
/etc/pve/qemu-server/9000.conf
/etc/pve/qemu-server/200.conf
/etc/pve/qemu-server/201.conf

## storage & network config
/etc/pve/storage.cfg
/etc/fstab
/etc/pve/mapping/directory.cfg
/etc/network/interfaces

## datacenter & cluster
/etc/pve/datacenter.cfg
/etc/pve/user.cfg
/etc/pve/nodes/parrot/config

## Sanoid (ZFS Snapshots)
/etc/sanoid/sanoid.conf

## custom modules & tuning
/etc/modules-load.d/modules.conf
/root/prox_config_backup.sh

Info

Managing VMs via this filelist.txt script works for me as a simple way to track infrastructure. The logical next step for larger environments is managing these linked clones via Terraform or Ansible. You can read how I automate the inside of these rootless environments using Ansible here.

Nginx VM Setup and SSL#

Because rootless linux namespaces are constrained to ports > 1024, I use Nginx on the VM host to handle SSL termination and proxy traffic to the isolated rootless users over localhost. For example, Nextcloud might bind to 127.0.0.1:8080, while Funkwhale binds to 127.0.0.1:8081.

Nginx Reverse Proxy Examples

[VM 200] My Nextcloud /etc/nginx/sites-available/nextcloud.conf (single-tenant VM):

map $http_upgrade $connection_upgrade {
    default upgrade;
    ''      close;
}

types {
    text/javascript mjs;
    application/json map;
}

server {
    listen 80;
    server_name cloud.local.example.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    http2 on;
    server_name cloud.local.example.com;

    ssl_certificate /srv/nextcloud/ssl/wildcard.local.example.com.fullchain;
    ssl_certificate_key /srv/nextcloud/ssl/wildcard.local.example.com.key;

    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;

    add_header Referrer-Policy "no-referrer" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Permitted-Cross-Domain-Policies "none" always;
    add_header X-Robots-Tag "none" always;
    add_header X-Download-Options "noopen" always;
    add_header Strict-Transport-Security "max-age=15552000; includeSubDomains" always;

    location /push/ {
        proxy_pass http://127.0.0.1:7867/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        client_max_body_size 0;
        proxy_request_buffering off;
        proxy_read_timeout 36000s;
    }

    location ~ \.(?:mjs|js\.map)$ {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        add_header Content-Type "text/javascript" always;
    }

    location /.well-known/carddav { return 301 $scheme://$host/remote.php/dav; }
    location /.well-known/caldav  { return 301 $scheme://$host/remote.php/dav; }
}

[VM 201] My /etc/nginx/sites-available/funkwhale.conf (Lightweight Multi-Tenant VM):

upstream fw {
    # depending on your setup, you may want to update this
    server 127.0.0.1:8081;
}
map $http_upgrade $connection_upgrade {
    default upgrade;
    ''      close;
}

server {
    listen 80;
    listen [::]:80;
    server_name funk.local.example.com;
    location / { return 301 https://$host$request_uri; }
}
server {
    listen      443 ssl;
    listen [::]:443 ssl;
    server_name funk.local.example.com;


    # SSL
    ssl_certificate     /etc/nginx/ssl/wildcard.local.example.com.fullchain;
    ssl_certificate_key /etc/nginx/ssl/wildcard.local.example.com.key;
    # ssl_trusted_certificate /etc/letsencrypt/live/exam

    # logging
    access_log /var/log/nginx/funk.local.example.com.access.log;
    error_log /var/log/nginx/funk.local.example.com.error.log warn;

    # Security related headers

    # If you are using S3 to host your files, remember to add your S3 URL to the
    # media-src and img-src headers (e.g. img-src 'self' https://<your-S3-URL> data:)

    add_header Content-Security-Policy "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; img-src 'self' data:; font-src 'self' data:; object-src 'none'; media-src 'self' data:";

    # compression settings
    gzip on;
    gzip_comp_level    5;
    gzip_min_length    256;
    gzip_proxied       any;
    gzip_vary          on;

    gzip_types
        application/javascript
        application/vnd.geo+json
        application/vnd.ms-fontobject
        application/x-font-ttf
        application/x-web-app-manifest+json
        font/opentype
        image/bmp
        image/svg+xml
        image/x-icon
        text/cache-manifest
        text/css
        text/plain
        text/vcard
        text/vnd.rim.location.xloc
        text/vtt
        text/x-component
        text/x-cross-domain-policy;

    location / {
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Port $server_port;
        proxy_redirect off;

        # websocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;

        client_max_body_size 100M;
        proxy_pass   http://fw/;
    }
}

SSL certificates setup

Obviously, I use a Split-Brain DNS setup. For SSL certificates, I do not want my individual VMs communicating with Let's Encrypt directly and thereby violating zero-trust. Instead, I use my central firewall (OPNsense/pfSense) to handle the DNS-API challenges, and use a simple script (like ssl_get) triggered via cron to pull the wildcard certificates internally to the VMs.

But: Evaluate yourself whether wildcard certs are an acceptable risk in your security context.

Automated Container Updates#

I prefer automated daily updates for the majority of my services. The logic is: I would rather accept a few minutes of downtime after a failed automated update than run outdated, vulnerable code. Given the rise of dependency chain attacks, and considering that rootless namespaces largely limit the attack surface, I prefer to stay on the latest patch release (major.minor.patch).

See my examples on how I automate this using standard /etc/cron.daily/ scripts.

[VM 200] For a Rootful Docker setup (like Nextcloud), the cron script I use is shown below. Note the inclusion of docker image prune -f. Without this, the ephemeral 20GB Docker disk would fill up fast with old image layers.

#!/bin/sh
# /etc/cron.daily/nextcloud_update
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Ensure we execute as the service user
if [ "$(id -u)" -eq 0 ]; then
    exec sudo -H -u nextcloud $0 "$@"
fi

# Pull and apply updates
/usr/bin/docker compose -f /srv/nextcloud/docker/docker-compose.yml pull
/usr/bin/docker compose -f /srv/nextcloud/docker/docker-compose.yml up -d

# Update Nextcloud apps inside the container
/usr/bin/docker compose -f /srv/nextcloud/docker/docker-compose.yml exec -u www-data app \
    php occ app:update --all

# Clean up
/usr/bin/docker image prune -f

[VM 201] For a Rootless Docker setup (like Invidious or Funkwhale), cron runs commands without a full systemd login session, which means the Docker environment variables are missing. I therefore explicitly declare the XDG_RUNTIME_DIR in the script below so that the user's rootless socket is available:

#!/bin/sh
# /etc/cron.daily/restart_invidious

if [ "$(id -u)" -eq 0 ]; then
    exec sudo -H -u invidious $0 "$@"
fi

# Set rootless docker environment variables for Cron
export XDG_RUNTIME_DIR="/run/user/$(id -u)"
export DOCKER_HOST="unix://$XDG_RUNTIME_DIR/docker.sock"

docker compose -f /srv/invidious/docker/docker-compose.yml pull
docker compose -f /srv/invidious/docker/docker-compose.yml up -d
docker image prune -f

Automated VM Updates#

I use Ansible to keep my VMs up-to-date (apt). See my post with the Ansible script here. It includes all necessary checks for making sure that processes in rootless namespaces and nested docker binaries are restarted on host updates.

Conclusions#

At first, this architecture may appear complex. Passing storage from ZFS through hypervisor VFS ID-maps, into VirtIO-FS sockets, attaching it to disposable cloud-init VMs, and finally handing it down to rootless systemd namespaces - what a mess!

In reality, however, the complexity is heavily abstracted. Every layer is compartmentalized into neat, manageable chunks. Because of the strict user-namespacing and Git configuration tracking, I always know exactly what layer of the hierarchy I am working on.

I consider this architecture the sweet spot for homelabs and small enterprises. It bridges the gap between "Pet" servers (where everything is hand-crafted and requires manual effort) and full Kubernetes clusters (I have worked with K8s and set up clusters, but I still consider these overkill for non-distributed systems).

There are no surprises or undocumented tweaks. By separating the persistent state from the ephemeral compute, I have a system that is robust, reproducible, secure, and requires quite little attendance. A design for the next decade!*

(* we'll see)

Resource Comparison

After migrating my stack (4 LXCs to 4 VMs), I looked at the new resource usage.

Storage:

zfs list
NAME                                USED  AVAIL  REFER  MOUNTPOINT
tank_ssd                            121G   740G   104K  /tank_ssd
tank_ssd/ephemeral                 4.96G   740G    96K  /tank_ssd/ephemeral
tank_ssd/ephemeral/vm-200-disk-0    148M   740G   148M  -
tank_ssd/ephemeral/vm-201-disk-0   4.77G   740G  4.77G  -
...
tank_ssd/secure                     116G   740G   248K  /tank_ssd/secure
tank_ssd/secure/base-9000-disk-0    678M   740G   678M  -
...
tank_ssd/secure/vm-200-cloudinit    288K   740G   108K  -
tank_ssd/secure/vm-200-disk-0      16.2G   740G  13.1G  -
tank_ssd/secure/vm-201-cloudinit    288K   740G   108K  -
tank_ssd/secure/vm-201-disk-0      3.80G   740G  2.82G  -
...
tank_ssd/secure/vm-9000-cloudinit   124K   740G   108K  -

This output nicely shows ZFS' thin provisioning. The persistent VM OS disks consume little actual space (e.g., just 2.82G for the multi-tenant VM 201), while all disposable Docker image layers have their own unbacked-up ephemeral datasets. The base template base-9000-disk-0 consumes only 678MB because it utilizes a stripped-down Debian cloud image combined with ZFS compression and thin provisioning. This results in maximizing resource efficiency. It also enables linked clone deployments with minimal storage overhead. Great for expensive SSDs.

Memory usage comparison Figure: Hypervisor memory usage before and after the migration.

The memory graph shows the increased memory footprint from about 30 GiB to 49 GiB after migrating my four unprivileged LXC to four full VMs. Interpreting this metric is a bit tricky. I allocated slightly more RAM to the new VMs, so the guest kernels utilized the extra space for caching. Proxmox reports this as used memory from the hypervisor's perspective, even though the guest OS would free it if an application requested it.

Figure: CPU Pressure Stall on Proxmox.

While memory usage increased, CPU Pressure Stall actually decreased. Once my final LXC was migrated and shut down around June 14, the background noise and resource use on the hypervisor quieted down. The hypervisor isolates VM workloads better than the shared-kernel LXC architecture did. This results in a smoother overall system.

LLM disclaimer

This architecture is the result of a multiple months spanning endeavor to conceptualize, test, and document a robust successor to my legacy LXC setups. I want to explicitly mention that while this system is running successfully in my homelab, I am always learning. I will update this document if the system changes. Also to note: I did utilize LLMs to help drag up the complexity and synthesize the documentation, based on my raw terminal notes, into this readable format. I manually edited, tested and redacted each line by myself. If you spot an error or a better way to do something, let me know in the comments!

If you find this blog post too difficult or complex to follow, download the raw format here and drop it into your favourite LLM, to let it guide you.

Changelog

2026-06-20

Add Resource Comparison after LXC->VM migration

2026-06-13

Add link to Ansible playbook for rootless namespace creation
improve formatting (add linebreaks for long commands)

2026-06-11

Add docker-compose.yml example

2026-06-07

Initial post.