I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear.
TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two consumer boxes talk fast enough to run tensor-parallel inference and FSDP workloads across both machines: ~95 Gb/s bidirectional raw RDMA, ~7 µs one-way latency, a MiniMax-M2.7 TP=2 inference run that does not fit on one box, and a Gemma 3 27B LoRA FSDP step falling from 1359 s over Ethernet to 126 s over 4-HCA USB4 RDMA.

- ~48 Gb/s per direction (~95 Gb/s bidi total) sustained
ib_write_bw, 4-HCA aggregate at 1 MiB / 8 QPs with IOMMU off — vs~2.3 Gb/sover the onboard 2.5 GbE and~9 Gb/sfor soft-RoCE on top ofthunderbolt-netat the per-rail level. - ~7 µs one-way
ib_write_latat 64 B, single QP — vs~28 µsover RXE/2.5 GbE and~65 µsover RXE/TBnet.
DISCLAIMER: this is research code, most of it AI-generated, and it loads experimental kernel modules on machines I was willing to crash repeatedly. I made an effort to understand enough of it to keep it on-track, but there are almost certainly false assumptions and sharp edges throughout. No warranty, no support promise, not production software.