GitHub - infiniteregrets/kv-psi: Use Linux Pressure Stall Information to trim an LLM KV cache

2 min read Original article ↗

PSI KV Governor

PSI KV Governor is a small reference implementation for using Linux Pressure Stall Information to trim an LLM KV cache when the system is under memory pressure.

Requirements

  • Linux with PSI enabled: cgroup memory.pressure or /proc/pressure/memory
  • Python 3.10+
  • llama.cpp build dependencies for the runner
  • a GGUF model, for example models/SmolLM2-135M-Instruct-Q2_K.gguf

Check PSI:

cat /proc/pressure/memory
PYTHONPATH=src python benchmarks/pressure_bench.py --preflight-only

Basic Usage

Run the reference simulator:

PYTHONPATH=src python -m psi_kv_governor.cli simulate

Build the llama.cpp runner:

scripts/build_llama_runner.sh

Download the small benchmark model if needed:

python scripts/download_demo_model.py

PSI Benchmark

Run both variant orders. This matters because PSI avg10, cache, and zram/swap state can carry over from the first pressure run into the second.

PYTHONPATH=src python benchmarks/pressure_bench.py \
  -c 2048 \
  -n 1536 \
  --keep 64 \
  --tail 256 \
  --min-prune 64 \
  --pressure-mib 6000 \
  --pressure-step-mib 1024 \
  --pressure-warmup-s 10 \
  --variant-cooldown-s 45 \
  --out-dir data/bench-pressure/fixed-first

PYTHONPATH=src python benchmarks/pressure_bench.py \
  --variant-order psi-first \
  -c 2048 \
  -n 1536 \
  --keep 64 \
  --tail 256 \
  --min-prune 64 \
  --pressure-mib 6000 \
  --pressure-step-mib 1024 \
  --pressure-warmup-s 10 \
  --variant-cooldown-s 45 \
  --out-dir data/bench-pressure/psi-first

Recent Jetson result:

run variant decoded tok/s prunes final KV external PSI some/full
fixed-first fixed 1536 94.00 0 1547 1.61/1.61
fixed-first PSI 1536 88.80 4 1291 4.14/3.94
psi-first PSI 1536 96.16 2 1004 2.46/2.33
psi-first fixed 1536 89.76 0 1547 5.56/5.56

Result directories:

  • data/bench-pressure/real-psi-6000m-1536tok-cooldown
  • data/bench-pressure/real-psi-6000m-1536tok-cooldown-psi-first