Local LLM Inference Optimization: The Complete Guide

Note: This post was drafted with significant AI assistance, synthesizing notes, bench results, and scripts from the l3ms homelab toolkit and the series of model-running posts on this site. The experiments, numbers, and failure modes documented here are real - the synthesis and prose are AI-assisted.

Preface

Over the past year I've written posts on running gpt-oss-120b, Qwen3-Coder-Next, Gemma 4 26B, Qwen3.6-35B-A3B, and Gemma 4 MTP locally on consumer hardware. Each post has its own notes, failure modes, and tuning results - but the same lessons keep appearing: enable XMP, pin to P-cores, quantize your KV cache, don't trust the power profile.

This is my attempt at a master reference. Instead of re-discovering flags in every new model post, I want one doc to link back to. If you're hitting a performance wall, starting from scratch, or just want to understand what each knob actually does - start here.

The scope is intentionally wide. The numbers in this guide come from one machine: an RTX 4070 12 GB, i5-12600K, 32 GB of DDR5-6000, Linux, CUDA, and recent llama.cpp builds. The Apple Silicon, AMD, multi-GPU, and server sections are useful starting points, but I haven't tested those setups myself.

When I give a number, it came from this box. Things I still need to test are called out as such.

1. TL;DR: Start Here

If you want maximum control and performance: use llama.cpp directly. This guide assumes that path.
If you want desktop UX, model browsing, and a good local OpenAI-compatible endpoint: LM Studio is perfectly reasonable.
If you want multi-user serving, batching, and production throughput: evaluate vLLM.
If you are on Apple Silicon: compare llama.cpp Metal with mlx; unified memory changes the sizing math.
If TG is bad on MoE models: check RAM speed before touching flags. Enabling XMP took my machine from roughly one-third speed back to normal.
If you hit VRAM limits: reduce context, quantize KV cache, lower --parallel, then tune layer placement.
If you use MTP speculative decoding: benchmark draft acceptance and KV cache precision together; raw TPS is not enough.
If you are running a single-user homelab: prefer --parallel 1, explicit context sizing, and static placement once you have a stable config.

1.1 Where to Jump In

This is a reference, not a linear tutorial. Start with the part that matches the problem:

MoE generation is slow: check RAM speed, then layer placement and P-core pinning.
The model does not fit or dies later in a session: start with --fit, context and KV cache, then the known OOM causes.
Vision fails at load or on the first image: go to Vision / Multimodal. The projector and image batch need their own headroom.
MTP is no faster than normal decoding: check draft acceptance and KV precision, not just reported TG.
You use LM Studio or Ollama: the hardware, OS, and security sections still apply. Most llama.cpp flags do not.

1.2 Safe Starting Profiles

These are conservative baselines for a single-user server. They are starting points, not universal optimums; model architecture still changes the memory math.

Avoid the exciting failure modes: leave at least 512 MiB of fit headroom, test GGML_CUDA_GRAPH_OPT before keeping it, exclude E-cores on hybrid Intel CPUs, and benchmark MTP KV precision per model.

2. Optimization Priority Checklist

Ordered by typical impact. Each item links to the section with the full explanation.

3. What to Measure Before Tuning

Optimization only makes sense if you know which phase is slow.

Do not optimize from a single short prompt. Short prompts hide KV cache costs, long-context VMM growth, and parallel-slot allocation. Benchmark at the context length you actually serve.

4. Glossary

5. The Inference Landscape

5.1 Why Run Locally?

Privacy: prompts never leave your machine.
Cost: no per-token bill, just the hardware and electricity you are already paying for.
Control: any model, any quant, any parameters. No deprecations, no rate limits, no pricing changes.
Offline: works without internet.
Experimentation: swap models, tune parameters, run evals without API contracts.

5.2 Cloud vs Local - Honest Tradeoffs

Most serious users end up with both: cloud APIs for frontier tasks, local for everything privacy-sensitive, experimental, or routine.

5.3 Local Inference Tools

This guide focuses on llama.cpp. Most concepts (KV cache, quantization, layer placement) generalize across tools.

5.4 Backends Within llama.cpp

This document assumes CUDA. Where behavior differs on Vulkan or CPU-only, it's marked [Vulkan] or [CPU]. If you're on AMD/Intel GPU, most flag logic is the same but CUDA-specific env vars don't apply.

6. Hardware

6.1 The Memory Hierarchy - The Most Important Mental Model

Token generation speed is limited by how fast the runtime can stream active weights through the memory hierarchy. Rough bandwidth numbers:

VRAM (GPU on-die)              ~600–1000 GB/s
Unified memory (Apple Silicon)  ~200–400 GB/s
System RAM                       ~50–200 GB/s   (varies hugely by speed and channel config)
NVMe SSD                          ~5–7 GB/s
SATA SSD / HDD                    ~0.5–3 GB/s

Dense models: every token reads all active weights. Model must fit in VRAM for full-speed inference. Any spill to system RAM causes a large TG drop.

MoE models often stream expert weights from system RAM, making memory bandwidth critical. On my machine, enabling XMP took generation from roughly one-third speed back to normal.

Check that your RAM is running at its rated speed:

sudo dmidecode -t memory | grep -E "Speed|Configured"
# "Configured Memory Speed" must match your XMP/EXPO profile speed.
# If it doesn't, enable XMP/EXPO in BIOS.

Check this before touching llama.cpp flags. It takes a minute and can save hours of tuning the wrong thing.

6.2 GPU / VRAM

VRAM is the primary inference resource. More VRAM = more layers on GPU = faster inference.

iGPU display trick (desktop NVIDIA): route display through the motherboard video output instead of the GPU. This frees 500–1000 MB VRAM the GPU was using for desktop composition.

6.3 CPU

For dense models (all in VRAM): CPU is nearly idle during inference. Core count has minimal impact.

For MoE hybrid: CPU executes expert forward passes for every token. Expert compute is the TG bottleneck. Intel hybrid (12th gen+): P-cores and E-cores have significantly different throughput for matrix operations. E-cores drag TG down by 20–30%. Always pin to P-cores:

taskset -c 0-11 llama-server ...   # i5-12600K: cores 0-11 are P-cores

Thread count: set --threads to P-core count, leaving 1–2 for the OS. More threads than P-cores is counterproductive.

6.4 What Is "Good Enough" Throughput?

For coding agents: TG dominates. First-token latency matters less once the context is warm. Every t/s gained compounds across a full session.

7. OS Choice

7.1 Linux

Highest-performance path for CUDA inference.

~15–20% TPS advantage over Windows in practice: leaner CUDA driver overhead, better scheduling under sustained load, direct memory control.
Full access to CPU governor, NUMA, huge pages, cgroups, headless mode.
Recommended distributions: CachyOS (real-time kernel, best power management tuning surface), Ubuntu/Debian (easiest CUDA packages), Arch (rolling, latest drivers).

7.2 Windows

Reasonable; CUDA support is solid.

Some overhead from Windows scheduler and CUDA runtime.
Power plan: set to "High Performance" or "Ultimate Performance". Default "Balanced" throttles CPU under sustained inference load.
NVIDIA Control Panel → "Power management mode" → "Prefer maximum performance".
WSL2: close to native for CUDA; some VRAM overhead from virtualization. Generally acceptable if dual-boot isn't an option.

7.3 macOS

Different runtime stack - CUDA guidance does not apply.

Metal backend activates automatically.
Apple Silicon unified memory: CPU and GPU share the same physical pool. A machine with 128 GB RAM effectively has 128 GB "VRAM". Transforms the dense model size ceiling.
Memory bandwidth is excellent (~400 GB/s on M3 Max); competitive with mid-range NVIDIA for the models it can run.
mlx framework is worth evaluating alongside llama.cpp for Metal workloads.

7.4 Linux OS Tuning

These settings have measurable impact on TG throughput.

CPU Governor and Power Profile

# Must be "performance" on all cores
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# EPP must also be "performance" - governor alone is not enough on intel_pstate
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference

# Verify actual P-core frequency near max boost
grep "cpu MHz" /proc/cpuinfo | sort -rn | head -6

The power-profiles-daemon trap: KDE and GNOME ship with power-profiles-daemon, which can set a non-performance HWP (hardware P-state) mode on some boots. The insidious symptom: all sysfs checks (scaling_governor, energy_performance_preference, cpu MHz) report "performance" - but TG runs 20–30% below expected, and it varies between boots. The degradation happens at the hardware MSR level where standard tooling doesn't look.

Fix - replace with tuned-ppd:

# Arch / CachyOS
sudo pacman -S tuned-ppd       # removes power-profiles-daemon automatically
sudo systemctl enable --now tuned
sudo tuned-adm profile throughput-performance
# Reboot. TG will be stable and correct across all boots.

# Ubuntu / Debian
sudo apt install tuned
sudo systemctl enable --now tuned
sudo tuned-adm profile throughput-performance

Transparent Huge Pages

cat /sys/kernel/mm/transparent_hugepage/enabled
# Recommended: [always]
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Headless Mode

# Stop desktop compositor - frees 200-400 MB RAM + compositor VRAM
sudo systemctl isolate multi-user.target
sudo sync && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

# Restore when done
sudo systemctl isolate graphical.target

Without a display server, use zellij in a TTY for split panes: zellij in any TTY gives full terminal multiplexing without X or Wayland.

8. Why llama.cpp, and How to Build It

8.1 Why Not Ollama?

Ollama uses llama.cpp internally but exposes a minimal, fixed-default configuration surface. Every flag in §10-§17 of this guide - layer placement, KV quantization, fit parameters, batch sizes, CUDA env vars - is unavailable or unexposed in Ollama.

Ollama is the right choice for quick setup and model management. If you're reading this guide, you've outgrown it.

8.2 Building from Source

Always build from source. Distro packages are outdated and not compiled for your GPU. MoE inference performance improves significantly with each llama.cpp release.

CUDA build:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
mkdir build && cd build

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DGGML_NATIVE=ON \
  -DGGML_LTO=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89
#  89 = RTX 40-series | 86 = RTX 30-series | 75 = RTX 20-series | 61 = GTX 10-series

cmake --build . --config Release \
  --target llama-server llama-bench llama-fit-params llama-cli --parallel

Vulkan build (AMD, Intel GPU):

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_VULKAN=ON \
  -DLLAMA_CURL=ON \
  -DGGML_NATIVE=ON
# CUDA-specific build flags do not apply here

Keep your build updated. Pull and rebuild regularly - especially before benchmarking a new model.

8.3 Key Binaries

9. Model Selection and Quantization

9.1 Dense vs MoE - Choose Your Tuning Strategy

9.2 Quantization Reference

UD (Unsloth Dynamic) quants: allocate higher bits to attention-sensitive layers and lower bits to robust layers. Better perplexity than uniform quants at the same average bit width. Generally the best choice when available.

Rule of thumb: use the highest quant that fits your VRAM + RAM budget. Q5_K_XL or UD-Q5_K_XL is a strong default. Drop to Q4 only when necessary.

9.3 Quantization-Aware Training (QAT)

Standard Post-Training Quantization (PTQ) quantizes weights after the model is fully trained. When going down to 4-bit, this rounding process can throw away critical precision, leading to regressions in reasoning, logic, and acrostic constraints.

Quantization-Aware Training (QAT) bypasses this degradation by modeling low-precision rounding noise during the training or fine-tuning process. This enables the model weights to adapt to the low-bit limits.

Accuracy Recovery: In the Gemma 4 QAT builds I tested, 4-bit QAT behaved much closer to Q8 than ordinary post-training 4-bit quantization.
VRAM Savings: A 26B MoE model in standard Q8_0 or dynamic Q5 consumes ~18 GB, spilling heavily to system RAM on a 12GB card. The QAT Q4 model size drops to ~14.2 GB, allowing the vast majority of the model layers to load directly into VRAM for full GPU speed.

10. Layer Placement - The Core Optimization for MoE

For dense models fully in VRAM: use -ngl 99 and skip to §11.

For MoE hybrid setups, layer placement is where most performance lives. The goal: keep as many blocks as possible on GPU (especially early layers and attention), while offloading expert weights to RAM.

10.1 `--n-gpu-layers` (`-ngl`)

How many transformer blocks to load onto GPU. Start at 99 (all layers). Drop if CUDA OOM.

llama-server -m model.gguf --n-gpu-layers 99    # all on GPU
llama-server -m model.gguf --n-gpu-layers 37    # 37 on GPU, rest on CPU

10.2 `--n-cpu-moe`

Integer count: keep the named number of MoE layers' expert weights on CPU. Quick coarse control.

--n-cpu-moe 31   # first 31 MoE layer experts on CPU

⚠️ RAM ceiling: for a ~60 GB model, putting all experts on CPU tries to load ~60 GB into RAM. On a 64 GB system this will hard-crash the machine. Always use llama-fit-params (§10.4) to find safe values before setting this high.

10.3 `--override-tensor` (`-ot`) - Fine-Grained Placement

Per-tensor, per-layer placement via regex. Most control, most complexity.

# All expert projections in all layers → CPU:
--override-tensor ".ffn_(up|down|gate)_(ch|)exps=CPU"

# Layers 5+ experts → CPU; keep layers 0-4 fully on GPU:
--override-tensor "blk\.(5|[6-9]|[0-9][0-9]+)\.ffn_(up|down|gate)_(ch|)exps=CPU"

The shared expert gotcha: some models (Qwen3.5-122B, certain gpt-oss variants) have two expert tensor families:

Routed experts: ffn_{up,down,gate}_exps
Shared expert (always active, 1 per layer): ffn_{up,down,gate}_shexp

A pattern matching only _exps leaves _shexp on GPU, silently consuming VRAM and causing CUDA OOM. The safe pattern captures both:

# (ch|) matches both _exps (routed) and _shexp (shared):
--override-tensor ".ffn_(up|down|gate)_(ch|)exps=CPU"

Safe to include (ch|) even for models without shared experts - it's harmless and future-proofs the pattern.

10.4 `--fit on` - Recommended Starting Point

Auto-probes free VRAM at startup, computes optimal -ngl + -ot placement automatically. Zero manual tuning.

llama-server \
  -m model.gguf \
  --fit on \
  --fit-ctx 65536 \   # minimum context to guarantee fits; KV cache for this ctx is accounted for
  --fit-target 512 \  # VRAM headroom in MiB to leave free
  ...

--fit-target note: CUDA's VMM pool grows as context fills. --fit-target 128 can look great in a short bench and OOM later, so I use at least 512 MiB for a persistent server. Current llama.cpp builds also include mmproj memory in the fit calculation (PR #21489). My vision profile still uses 2048 MiB because it has been stable on a 12 GB card; older builds need that headroom set manually.

Dry run without starting a server:

llama-fit-params \
  -m /path/to/model.gguf \
  -fitt 512 \     # fit-target MiB
  -fitc 65536     # fit-ctx tokens
# Outputs something like: -c 65536 -ngl 49 -ot "blk\.8\.ffn_...=CPU,..."

This output is what you hardcode for static placement.

10.5 Static vs Dynamic Placement

For a stable homelab server, derive placement once with llama-fit-params and hardcode it in your run script. Use --fit on when testing new models or after hardware changes.

11. Context and KV Cache

11.1 `--ctx-size` - Choosing Context Length

The KV cache grows linearly with context and lives in VRAM. Large context on small VRAM can push expert layers off GPU.

Approximate KV VRAM usage (varies by model, head count, and layer structure):

On a 12 GB card at 128k context with f16 KV, 8 GB goes to KV cache - leaving only ~4 GB for model weights and attention. Context choice directly affects how many GPU layers you can afford.

Practical guidance: coding sessions work well at 64k; long-context RAG may need 128k+; vision inference is typically safest at 64k on 12 GB VRAM.

11.2 KV Cache Quantization (`-ctk`, `-ctv`) [CUDA]

Quantizing the KV cache halves (q8_0) or further reduces (q4_0) its VRAM footprint.

-ctk q8_0 -ctv q8_0

The compounding effect: on a 12 GB card with 64k context, switching f16 → q8_0 KV frees ~2 GB. That 2 GB lets llama-fit-params keep one to two additional GPU layers - translating directly to higher TG. Confirmed on Qwen3-Coder-Next: q8_0 KV at 64k unlocked 2 extra GPU layers and added ~2 t/s TG vs f16 KV.

At short bench contexts (512 tokens), the KV cache is tiny and this effect is near-zero. Always test at your real serving context length.

11.3 `--parallel` - Concurrent Inference Slots

Each slot maintains its own KV cache. --parallel 4 multiplies KV VRAM by 4.

--parallel 1   # single user homelab: reclaims KV VRAM for model weights

On gpt-oss-120b, dropping --parallel 4 → --parallel 1 freed ~540 MiB VRAM - enough for one more GPU layer and +1 t/s TG.

11.4 `--flash-attn on` (`-fa`)

Reduces attention memory traffic and avoids materializing the full attention matrix, which makes long-context inference much more practical on constrained VRAM. No meaningful downside on CUDA in my testing.

--flash-attn on

Always enable. Required for certain KV quantization types on some configurations.

[Vulkan]: Flash attention support varies by driver version. Verify before relying on it.

12. Batch Sizes

12.1 `--batch-size` (`-b`) - Prompt Processing Throughput

Controls how many tokens are processed in one forward pass during prefill. Higher = better PP throughput; more VRAM required.

--batch-size 2048   # high throughput (~maximum for most models)
--batch-size 1024   # balanced; good default
--batch-size 512    # conservative; use for vision or tight VRAM

Reduce if you hit CUDA OOM during the prefill phase specifically.

12.2 `--ubatch-size` (`-ub`) - Physical Micro-Batch

Physical sub-batch within a logical batch. Must be ≤ --batch-size.

--ubatch-size 512   # typical default

Vision/multimodal critical: an image tokenizes to several hundred tokens. If --ubatch-size < image token count, llama.cpp throws an assertion during vision inference. Use --ubatch-size 512 or higher and test with your actual image sizes. On 12 GB VRAM, --batch-size 256 --ubatch-size 512 is a stable vision baseline.

13. Sampling Parameters

Sampling controls the probability distribution at each decode step. These affect output quality and - via vocabulary truncation (top-k) - slightly affect speed.

Start with model card defaults. Most GGUF releases specify tested values. Use those before experimenting.

13.1 `--temp` - Temperature

0.0: greedy / deterministic. Best for coding agents where reproducibility matters.
0.7: standard creative chat.
1.0: no rescaling; follows raw model distribution. Most modern instruction-tuned models are calibrated for this.

13.2 `--top-k` - Vocabulary Truncation

Keeps only top K most probable tokens before sampling.

0: full vocabulary (default; most diverse; slowest to compute)
100: safe performance cap - confirmed no measurable quality loss on coding tasks (gpt-oss-120b)
20–64: model-specific tighter caps

13.3 `--top-p` - Nucleus Sampling

Filters to tokens whose cumulative probability ≥ p. Applied after top-k.

1.0: no filtering (gpt-oss, Sarvam defaults)
0.95: standard for chat/code (Gemma 4, Qwen3 defaults)

13.4 `--min-p`

Filters tokens below min-p × max_token_probability.

0.0 off (most models)
0.01 light floor (Qwen3-Coder-Next)

13.5 `--repeat-penalty`

1.0: no penalty. Recommended for code - code naturally repeats patterns (variable names, keywords) and penalizing them degrades output.
1.1–1.3: mild penalty for prose.

Recommended Defaults by Model Family

14. Threading and CPU Control

14.1 `--threads` (`-t`)

CPU threads for the token generation phase (expert compute in hybrid MoE setups).

--threads 10   # P-core count minus 1-2 for OS headroom

More threads than available P-cores is counterproductive - they contend for the same memory bus and typically reduce TG.

14.2 `--threads-batch`

CPU threads for the PP (prefill) phase. PP is a burst workload; you can set this to full thread count.

--threads-batch 12

14.3 `taskset` - P-Core Pinning [Linux]

Most reliable way to keep inference off E-cores on Intel 12th gen+ (and other hybrid architectures):

taskset -c 0-11 llama-server ...   # pin all process threads to P-cores

Verify your P-core range from CPU documentation or lstopo. On Intel 12600K, cores 0–11 (6 P-cores × 2 threads) are the P-cores; 12–15 are E-cores.

14.4 `--poll`

Controls CPU spin aggressiveness while waiting for GPU kernel completion.

0: yield/sleep
100: busy spin

On hybrid CPU+GPU inference, this is flat. GPU kernel execution and PCIe transfer dominate synchronization. Confirmed across multiple sweeps - within noise at all poll levels. Leave at default 50 or set to 0 to reduce idle CPU load. Do not tune this.

14.5 `--numa`

NUMA affinity modes: distribute, isolate, numactl.

Single-socket systems: skip. There is only one NUMA node; these modes provide no benefit and can hurt performance. Use taskset -c for affinity instead.

Relevant only on dual-socket server hardware (AMD EPYC, Intel Xeon) where NUMA topology is real.

15. Memory Control

15.1 `--no-mmap`

Without this, llama.cpp uses memory-mapped I/O. Expert weight accesses during decode are non-sequential - the OS page fault handler triggers repeatedly for cold pages, adding latency jitter to TG.

With --no-mmap, the entire model loads into RAM before inference begins. No page faults.

--no-mmap   # recommended for all hybrid MoE and persistent server setups

Tradeoff: longer startup. Worth it for any persistent server.

15.2 `--mlock`

Pins model pages in RAM, preventing the OS from swapping them under memory pressure.

--mlock

Important when vm.swappiness is high (many Linux distributions default to 60–150 with ZRAM) or when running close to the RAM ceiling. Without it, a swap event mid-session can make TG appear to stall. Skip only if RAM is critically tight.

16. Priority and Process Settings

16.1 `--prio`

Scheduling priority for the inference process. Scale 0–3.

--prio 2   # high priority; reduces OS scheduling jitter on TG

16.2 `--no-warmup`

Skips initial kernel warmup pass at startup (compiles CUDA kernels on first real request instead).

--no-warmup   # reduces startup time; safe for persistent servers

17. CUDA-Specific Settings [CUDA]

17.1 Environment Variables

export GGML_CUDA_GRAPH_OPT=0    # Baseline; compare against 1 on your real workload

LLAMA_SET_ROWS was in my older scripts, but current llama.cpp does not read it. I removed it.

GGML_CUDA_GRAPH_OPT caveat: test 0 against 1 on a real session, not a short prompt. Graph re-capture can increase memory use as context grows, and some models get slower with it enabled. I leave it off unless it wins the long run.

17.2 Notable Build Flags

17.3 cuBLAS - Tested and Closed

GGML_CUDA_FORCE_CUBLAS=ON forces CUDA BLAS routines over the default GGML MMQ (mixed-precision matrix quantization) kernels.

Tested on mxfp4 and Q4 models: slower than default. GGML MMQ has native mxfp4/Q4 paths tuned for consumer decode batch sizes (1–16 tokens). cuBLAS is optimized for large datacenter batches. Result: ~45 t/s PP regression, no TG improvement. Default build wins on consumer hardware. May be worth re-evaluating on 24+ GB cards where larger batch sizes make cuBLAS more competitive.

18. Speculative Decoding & MTP (Multi-Token Prediction)

Autoregressive token generation (TG) is memory-bandwidth bound: the GPU must read all active model weights from memory for every single token it generates. Speculative decoding bypasses this bottleneck by utilizing a lightweight "draft" model to guess upcoming tokens, which the base model verifies in a single forward pass.

On models trained with Multi-Token Prediction (MTP) heads (like Gemma 4 or Qwen 3.6), we use native MTP speculative drafting to achieve massive speedups.

18.1 MTP Drafting Configuration Flags

Instead of pairing the base model with an unrelated draft model, mainline llama.cpp supports native companion MTP draft models:

--spec-draft-model: Path to the companion MTP GGUF file (e.g. mtp-gemma-4-26B-A4B-it.gguf ~460MB).
--spec-type draft-mtp: Tells llama-server to run in MTP verification mode.
--spec-draft-n-max: The maximum candidate sequence length drafted per iteration.
- For larger models (e.g., Gemma 4 26B), set to 2. Higher values introduce computational overhead that hurts TG.
- For lighter models (e.g., Gemma 4 12B), set to 4 to capture longer draft runs.

18.2 Target and Draft KV Cache Precision

MTP has two distinct caches. -ctk and -ctv set the target model cache; -ctkd and -ctvd set the draft model cache. Treat them as separate tuning decisions.

In my Gemma 4 MTP tests, quantizing the target cache with -ctk q8_0 -ctv q8_0 drove draft acceptance close to zero. Switching it to f16 used more VRAM but kept acceptance above 70%, which was much faster overall. Qwen may behave differently, so I test this per model.

Start with a full-precision draft cache, then compare q8_0 and f16 for the target cache:

# Memory-saving target cache; full-precision draft cache
-ctk q8_0 -ctv q8_0 -ctkd f16 -ctvd f16

# Gemma baseline that preserved acceptance in my tests
-ctk f16 -ctv f16 -ctkd f16 -ctvd f16

Record acceptance rate, TG, and VRAM. Saving memory is pointless if the draft model stops landing tokens.

18.3 Speculative Performance Gains

Tested on a single RTX 4070 12GB:

Gemma 4 26B Baseline: 38.5 tok/s
Gemma 4 26B QAT + MTP: 100.60 tok/s (2.6x speedup)
Gemma 4 12B QAT + MTP: 120.80 tok/s (2.0x speedup)

19. Vision / Multimodal

19.1 `--mmproj`

Path to the multimodal projector file:

--mmproj /path/to/mmproj-BF16.gguf

Typically 1–3 GB. Allocates in VRAM at startup alongside the model.

19.2 OOM Failure Modes on Constrained VRAM

Failure 1 - mmproj allocation: current llama.cpp includes projector weights and compute buffers in --fit. If it still fails at load, check for an older build, other processes using VRAM, or a hardcoded placement created without the projector.

Fix: update llama.cpp and run --fit with the projector supplied. My 12 GB profile uses --fit-target 2048 and has been stable.

Failure 2 - batch assertion: image token count exceeds --ubatch-size. An image can tokenize to several hundred tokens; if the batch is too small, llama.cpp asserts.

Fix: use --ubatch-size 512 or higher.

19.3 Safe Vision Profile (12 GB VRAM)

This is the profile I actually use. You may be able to lower the 2048 MiB margin on a current build, but I prefer the headroom.

llama-server \
  -m model.gguf \
  --mmproj mmproj.gguf \
  --ctx-size 65536 \
  --fit on --fit-ctx 65536 --fit-target 2048 \
  -ctk q8_0 -ctv q8_0 \
  --flash-attn on \
  --batch-size 256 --ubatch-size 512 \
  --no-mmap --mlock \
  --parallel 1

Separate text and vision servers on different ports if running both workloads from the same GPU.

20. ik_llama.cpp Fork [Advanced]

For highly specialized environments, the ikawrakow/ik_llama.cpp fork exists. It focuses on MoE-specific kernel optimizations (such as fused MoE kernels).

However, it is not covered in detail in this guide because:

No Upstreaming: Nothing developed in ik_llama.cpp is expected to make it upstream officially or directly.
Specialized Tuning: It serves as a specialized option for custom, architecture-specific tuning once you have maximized standard configurations.

21. Security Notes

Local inference servers are still HTTP services. Do not expose llama-server, LM Studio, or any model gateway directly to the public internet without authentication, firewalling, and rate limits.

Minimum safe defaults:

Bind to localhost for local tools unless you explicitly need LAN access.
Put a reverse proxy with authentication in front of anything reachable outside the machine.
Assume prompts, outputs, and tool calls may appear in app logs, shell history, reverse proxy logs, or frontend histories.
Treat model files like software dependencies: check license terms, source, and expected file hashes when possible.
Keep separate endpoints for trusted local agent workflows and anything exposed to other devices.

If you need remote access, prefer a private VPN, Tailscale, WireGuard, or a locked-down tunnel over opening the raw inference port.

22. Diagnostic Checklist

Run before benchmarking or when TG is unexpectedly low.

# 1. RAM speed - most common culprit for MoE TG underperformance
sudo dmidecode -t memory | grep -E "Speed|Configured"
# "Configured Memory Speed" must match your XMP/EXPO rated speed

# 2. CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Expected: performance

# 3. EPP - must be "performance" (governor alone is not enough on intel_pstate)
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
# Expected: performance (not balance_performance, not powersave)

# 4. Actual CPU frequency - P-cores should be near rated max boost
grep "cpu MHz" /proc/cpuinfo | sort -rn | head -6

# 5. Free VRAM - start inference from near-empty
nvidia-smi | grep MiB

# 6. Thermal - sustained load under throttle temperature
cat /sys/class/thermal/thermal_zone*/temp
# In millidegrees; 80000 = 80°C; throttle typically starts 85–105°C depending on chip

# 7. Background CPU hogs
ps aux --sort=-%cpu | head -10

# 8. Swap activity (high vm.swappiness systems)
cat /proc/vmstat | grep -E "pswpin|pswpout"
# Growing non-zero values = model weights being swapped mid-session

# 9. PCIe link speed [CUDA]
nvidia-smi -q | grep -A 3 "PCIe Generation"
# Expected: Current Gen = 3 or 4

# 10. Active tuned profile (if using tuned-ppd)
sudo tuned-adm active
# Expected: throughput-performance