Moving a 6-node NETLAB+ cluster off VMware to Proxmox

The cluster worked fine. That was the problem.

Six Lenovo SR630 nodes, 196 cores, about 2.75 TB of RAM, running NETLAB+ for live coursework: Cisco, Palo Alto, Security+, CySA+, ethical hacking, A+. Students logged in every week and spun up real pods on it. Nothing was broken.

Then Broadcom ended perpetual licensing and moved VMware to per-core subscription, and the number stopped making sense. At Broadcom’s 2024 list, our 196 cores penciled out to roughly $26,000 a year on vSphere Foundation ($135/core) or about $69,000 on the full Cloud Foundation bundle ($350/core), and the rates have only climbed since. Opening renewal quotes across the industry ran two to five times prior spend; negotiated deals often settled nearer 1.3 to 2x. Either way, over a five-year cycle it’s a low-hundreds-of-thousands line item to keep running software that was already installed and already doing the job. On a community-college lab budget, that’s not a renewal. It’s a wall.

So the cluster moved to Proxmox. The hypervisor swap was the easy part. The parts worth writing about came after: restoring hundreds of gigs of pod images over the public internet, and a Proxmox 9 upgrade that detonated the environment and sent me rebuilding it from my own notes.

The money

	VMware (Broadcom subscription)	Proxmox VE
License model	Per-core annual subscription; perpetual licenses ended early 2024	AGPLv3, no license cost
List price per core/yr (2024 launch list)	$135 (vSphere Foundation) to $350 (Cloud Foundation)	$0
196 cores, annual	$26,000 to $69,000 (before renewal multipliers)	$0
196 cores, 5-year	$130,000 to $345,000	$0

This was never “VMware bad.” It did the job for years and the feature set was never the complaint. The problem is that a vendor can re-price a working platform overnight and your only move is pay or leave. That’s not a technology risk, it’s an architecture risk that happens to land on an invoice.

Proxmox covered what the lab actually used: KVM/QEMU, clustering, live migration, and a real Linux host underneath instead of an appliance you poke at through a sanctioned API. Licensing went to zero. The work didn’t disappear, it moved from defending a renewal to running the platform. That’s the better bill.

How the VMs came across

Everyone asks this first, so: there was no VMDK conversion on my end. A normal self-hosted VMware-to-Proxmox move does involve qemu-img or an OVF import; mine didn’t, because NDG, the company behind NETLAB+, does that ingest upstream and distributes everything as Proxmox Backup Server snapshots from a server they host. You add their PBS as a datastore, point it at the netlab namespace, and pull: the NETLAB-VE management appliance first, then every course pod, restored into qcow2 on a node-local SSD datastore.

So “migrating” the student workloads wasn’t a lift-and-shift. It was a clean re-pull onto rebuilt infrastructure. Which sounds simple, right up until you do the math on moving that much data over a WAN.

Pulling pods over the internet is the slow, flaky part

The NETLAB-VE management appliance alone restores as nine virtual disks, one of them 100 GB, and every course pod stacks more behind it. Each disk comes down with pbs-restore from NDG’s backup server over the public internet, slow and flaky in equal measure. One appliance disk, mostly empty, still logged a pbs-restore speed of 3.57 MB/s; the big data disks were the real wait. When a transfer dropped, it looked like this:

progress 4% (read 1291845632 bytes, zeroes = 9% ...)
restore failed: connection reset
error before or during data restore, some or all disks were not completely restored.
TASK ERROR: ... pbs-restore ... failed: exit code 255

A reset partway through doesn’t unwind cleanly. pbs-restore clears its own temp qcow2 volumes, but it leaves the VM config behind (the task logs state is NOT cleaned up), so you delete that by hand before retrying the whole disk set from zero. Multiply that across every pod for every course, on a back-to-school deadline.

The lesson that stuck: when the images live on a server you don’t control, your migration timeline is bounded by the link to it, not by your local switch. Stage early, restore in parallel where the storage can take it, and assume a few will die partway and need a cleanup before the retry.

This is also where LACP earns its keep and lets you down in the same breath. A bond across the 1G links gives you resilience and aggregate throughput across many flows, but a single restore stream is one flow and rides one link. Bonding is not a speed-up for one big transfer, and expecting it to be is a good way to lose an afternoon staring at a port graph wondering why the other link is idle.

The network

I treated this as an infrastructure project, not a hypervisor swap, and the network is why. Each host had a dual-port 10G NIC split across two switches: one port carried management to the Cisco as a plain access port, the other was the dedicated cluster link into a UniFi 24-port running flat with no VLANs, doing nothing but corosync traffic. The VM data path was four 1G ports bonded with LACP (802.3ad) and trunked on the Cisco, carrying the lab and production VLANs. Splitting it that way kept VM placement flexible and kept the cluster heartbeat off the busy links.

The unglamorous rule that saved me later: write down the entire physical-to-logical map before it turns into tribal knowledge. Six hosts times onboard NICs, 10G add-in cards, bonds, bridges, VLAN roles, and switch ports is enough combinations that “I’ll remember it” is a lie you only tell yourself once.

Corosync is the piece people skip and regret. It’s the cluster heartbeat, and it cares about latency and jitter, not bandwidth, which is the whole reason it got that dedicated switch to itself. Run the heartbeat across the same links as bulk VM or restore traffic and the cluster is fine right up until those links saturate, which is the exact moment you need it steady.

The stakes are concrete: with six nodes, quorum is four, and if corosync can’t hold a majority the cluster freezes, no starting VMs, no config writes, until quorum is back. Six being even sharpens it, since a clean three-three split leaves neither half with the votes. So the nodes were spread across separate UPS units: a single UPS failure drops a minority, not half the cluster, and the heartbeat stayed on its own uncontended link. A purist would note it was a single corosync ring where two independent rings is the resilient answer; on a teaching-lab budget I spent what I had on isolation.

Then I upgraded to Proxmox 9 and it blew up

After the cluster was stable and the pods were in, I upgraded to Proxmox 9. Proxmox 9 moves the base OS from Debian 12 (Bookworm) to Debian 13 (Trixie). NDG’s NETLAB+ stack expected Bookworm.

It just stopped working. The NETLAB-VE web portal came up dead, nothing in it would function, and there was no clean error to chase. I raised it in a NETLAB+ workshop session and got the answer that would have saved me a weekend: the docs carried a note not to upgrade because it breaks on the newer base OS, and I’d read right past it. Trixie had shipped partway through the project and NDG wasn’t going to pivot their supported stack to chase it. So the fix was not heroics. It was a rebuild: back to Proxmox 8, pinned where NETLAB+ wanted it, pods restored again.

Here’s the part worth underlining. The rebuild was fast, and not because I’m fast. It was fast because I’d written down every step the first time: install order, storage layout, bond config, VLAN tags, the pod restore sequence. The first build was archaeology. The rebuild was a checklist. That gap is the entire argument for documenting infrastructure while you build it instead of swearing you’ll do it after.

The lesson under the lesson: a major hypervisor upgrade is not a patch cycle when a second vendor’s product sits on your base OS. Their support matrix sets your schedule, and the line that matters is easy to skim past when you’re feeling confident. I’d read the docs. I just didn’t catch the one note that said don’t make this exact jump, and that’s the expensive way to learn the difference between Bookworm and Trixie.

Keeping NETLAB and Proxmox agreeing with each other

NETLAB+ keeps its own database of pods and VMs. Proxmox keeps the actual VMs. They get out of sync the moment an operation fails partway or gets done in the wrong place. Delete a pod in NETLAB and the underlying SPOD VMs can linger on Proxmox. Delete on Proxmox first and NETLAB shows ghosts. Templates won’t delete until the SPODs cloned from them are removed first. Clone placement has its own failure mode (could not update datastore selection for vm 758 (index out of bounds)), and pods marked “absent” show up in one system but not the other.

The fix is boring and order-dependent: power off the pod’s VMs, delete them on Proxmox by their SPOD id, then clear the pod in NETLAB. The tooling won’t enforce the order for you. Pods clone as linked clones, differentials and pointers off a template, which is great for storage and exactly why the delete order bites when you get it wrong.

Where it landed

A 6-node Proxmox cluster, a dedicated 10G heartbeat network, LACP on the production links, NETLAB+ integrated, every course pod deployed, and $0 in hypervisor licensing. Students launch labs without knowing the hypervisor changed underneath them, which is one of the best success metrics a teaching platform can have after a migration.

The real win wasn’t the license savings, though those are real. It’s that when something breaks now, I’m looking at a Linux host with a shell and config files I can read instead of an appliance and a support ticket. The failure modes are mine to reason about. When you’re the one keeping it online, that beats the line item.

Would I do it again? Yes. I did, twice.