Running a 754-Billion Parameter LLM on a 16GB RAM Consumer PC
"Saying it's impossible is not engineering. Saying we don't know how yet is science."
MoE-on-a-Potato is an experimental project dedicated to testing the extreme limits of running massive Mixture-of-Experts (MoE) Large Language Models on consumer-grade, budget hardware.
We successfully ran GLM-5.1 (a 754B parameter model, 176GB GGUF size) on a Ryzen 5 5600G (6 Cores / 12 Threads) CPU, Vega 7 iGPU, and 16GB DDR4 RAM without crashing, establishing a scientific proof of concept for low-memory MoE disk-streaming inference.
๐ Project Dashboard Preview
For a fully interactive performance breakdown, memory scales, and expert cache projections, open the built-in HTML dashboard: ๐ Interactive Phase 3 Dashboard
๐ ๏ธ Experimental Phases & Key Findings
Our journey progressed through three scaling phases, measuring token generation speed, load times, and system peak RAM usage.
๐ Multi-Phase Performance Comparison
| Metric | Phase 1: DeepSeek (16B MoE) | Phase 2: GLM-4.7-Flash (30B MoE) | Phase 3: GLM-5.1 (754B MoE) ๐ |
|---|---|---|---|
| Active Parameters | 2.4B | 3B | 40B |
| Model GGUF Size | 10.4 GB | 18.5 GB | 176.0 GB |
| File Location | Main SSD | Main SSD | Secondary NVMe Partition |
| Model Load Time | 7.8 seconds | 29.0 seconds | 492.1 seconds (~8.2 Min) |
| Context Warmup Time | Negligible | Negligible | 242.0 seconds (~4.0 Min) |
| Prompt Processing Speed | 34.56 t/s | 2.71 t/s | 0.16 t/s (6.45 s/token) |
| Token Generation Speed | 25.15 t/s | 6.59 t/s | 0.05 t/s (20.89 s/token) |
| Peak System RAM Usage | 4.28 GB | 4.41 GB | 8.34 GB (Limit: 16 GB!) |
| Actual Model RAM Footprint | ~2.50 GB | ~3.00 GB | ~6.80 GB |
๐ In-Depth Technical Breakdown
1. Phase 1: Proof of Concept โ DeepSeek-V2-Lite-Q4_K_M (16B MoE)
- Goal: Build a custom
llama.cppcompilation optimized for AVX2 and Vulkan backend to offload processing to the integrated Radeon Vega 7 iGPU. - Outcome:
- By configuring memory mapping (
--mmap), we kept the RAM footprint to 4.28 GB (a %47 memory saving over full-RAM loading). - iGPU offload (
-ngl 10) increased token generation speed to 25.15 t/s (a 33.1% speedup compared to CPU-only).
- By configuring memory mapping (
2. Phase 2: Scaling Up โ GLM-4.7-Flash-Q4_K_M (30B MoE)
- Goal: Run a 30B model (18.5GB) that physically exceeds the available free system RAM (~12-13GB after OS/APU overhead).
- Outcome:
- Full RAM loading was skipped due to guaranteed OOM crash. Under
mmap, the model booted successfully with only 6.14 GB peak RAM. - iGPU offload (
-ngl 10) pushed token generation to 6.59 t/s while lowering the CPU-side RAM usage to 3.0 GB (net 51% RAM saving).
- Full RAM loading was skipped due to guaranteed OOM crash. Under
3. Phase 3: The Ultimate Test โ GLM-5.1-IQ1_M (754B MoE)
- Goal: Run a colossal 754-billion parameter model (176GB GGUF split files) on our 16GB RAM potato PC.
- Storage Setup: The weights were hosted on
/media/osman/CC46433D46432792, which is an NTFS-formatted partition of our PCIe Gen3 NVMe SSD. Under Linux, mounting NTFS partitions via FUSE/ntfs-3g creates driver overhead, throttling sequential reads to ~650 MB/s and adding CPU wait times. - Outcome:
- 0 Crashing / 0 OOM Errors: The model initialized and mapped 176GB of virtual memory with a maximum system RAM footprint of only 8.34 GB!
- Inference Speeds: Generated tokens at 0.05 t/s (20.89 seconds per token) with prompt processing at 0.16 t/s.
๐ Scientific Validation: The SSD Bottleneck
Our Phase 3 benchmark provided empirical proof that physical memory capacity is no longer the hard limit for local MoE execution; instead, the bottleneck is SSD read bandwidth.
During token generation, 8 routing experts (~40B active parameters) must be read from the disk per token. In IQ1_M (1.6 bits/weight average), this equates to approximately 10 GB of weights per forward pass.
The minimal 5.5-second delta represents the CPU compute overhead (AVX2 layer execution). This tight correlation proves that local MoE performance scale is directly proportional to SSD throughput.
๐ฎ Next Step: Expert Caching / Pinning Layer
To address the SSD I/O bottleneck, we proposed an Expert Cache/Pinning Layer architecture for llama.cpp and GGML.
Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (
- Pinning just 12 out of 64 routed experts (only ~18% of the model parameters) in fast memory (GPU VRAM or locked physical RAM via
mlock) achieves a %73 Cache Hit Rate. - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds to 0.24 t/s (~4.2 seconds/token).
- Pinning 24 experts achieves a 85% hit rate, projecting speeds to 0.43 t/s (~2.3 seconds/token)โmaking a 754B model genuinely usable for slow offline batch agentic tasks.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Expert Cache (mlock) โ
โ (12 Hot Experts - 73% Hit) โ
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
โผ โผ
[Cache Hit (73% of tokens)] [Cache Miss (27% of tokens)]
Read from RAM/VRAM Page-in from NVMe SSD
Latency: ~0.1s Latency: ~15.3s
Detailed implementation proposals, code structures, and GGML modifications can be found in the expert_cache_design.md file.
๐ Project Directory Map
monitor.py: Real-time logging of CPU, system RAM, and VRAM utilization.expert_profiler.py: Analyzes expert routing distributions under local execution.expert_cache_design.md: Architecture design and GGML code changes for MoE hot-expert pinning.experiments/:faz1_deepseek_v2_lite/: Logs and benchmark summaries for the 16B model.faz2_glm_4.7_flash/: Performance metrics and meta-conversation logs for the 30B model.faz3_glm_5.1_iq1_m/: Metric reports, dashboard logs, and the HTML dashboard for the 754B model.
โDon't tell us it won't work. We'll build it, run it, and benchmark it.โ ๐