Local LLM inference on AMD GPUs and Apple Silicon — no ROCm, no MLX, one binary.
35B parameter model running locally — Zig + Vulkan/Metal, no ROCm, no MLX
Supported Platforms
| Platform | GPU | Backend | Status |
|---|---|---|---|
| Linux | AMD RDNA4 (RX 9070, AI PRO R9700) | Vulkan | Primary — hand-tuned shaders |
| Linux | AMD RDNA3 (RX 7900 XTX, etc.) | Vulkan | Supported |
| macOS | Apple Silicon (M1, M2, M3, M4, M5) | Metal | Supported — native MSL shaders |
Start Here
Works the same on Linux (AMD GPU) and macOS (Apple Silicon):
git clone https://github.com/zolotukhin/zinc.git cd zinc zig build -Doptimize=ReleaseFast # On RDNA4 Linux, enable cooperative matrix export RADV_PERFTEST=coop_matrix # skip on macOS # Verify GPU, shaders, and runtime ./zig-out/bin/zinc --check # See which models fit this machine ./zig-out/bin/zinc model list # Download a model ./zig-out/bin/zinc model pull qwen3-8b-q4k-m # Run a prompt (--chat applies the model's chat template for instruct models) ./zig-out/bin/zinc --model-id qwen3-8b-q4k-m --prompt "Hello" --chat # Or open the chat UI in your browser ./zig-out/bin/zinc chat
The server exposes the built-in chat UI at / and an OpenAI-compatible API at /v1.
What Works Today
- Single-stream CLI inference on the validated models listed below
- OpenAI-compatible
/v1API with streaming - Built-in browser chat UI with thinking mode support
- Managed model workflow:
list,pull,use,active,rm zinc chat— start server and open browser in one command- AMD path: RDNA4-tuned Vulkan shaders (wave64, cooperative matrix, fused ops)
- Apple Silicon path: native Metal shaders (MSL, zero-copy mmap, simdgroup ops)
- Auto-detection: ZINC picks the right backend (Vulkan or Metal) at build time
Still Rough
- Continuous batching and multi-tenant serving are still roadmap work
- The supported-model list is intentionally narrow
- Apple Silicon performance tuning is ongoing (RDNA4 path is more mature)
The Problem
Consumer GPUs have the hardware for fast LLM inference — bandwidth, compute, VRAM — but the software doesn't use it:
- AMD RDNA3/RDNA4: ROCm doesn't support them. vLLM requires ROCm. llama.cpp's Vulkan path has no RDNA-specific tuning. These $500–1500 cards sit idle.
- Apple Silicon: MLX and llama.cpp Metal work, but leave performance on the table. No engine is built from scratch around Metal's strengths (unified memory, simdgroup ops, zero-copy mmap).
The Solution
ZINC builds an inference engine tuned for the hardware you actually have.
Hand-tuned shaders for each platform. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup reductions, zero-copy model loading, and Metal pipeline tuning. Not a generic backend that happens to run — built to extract real performance from each GPU.
One binary, no driver stack. No ROCm, no CUDA, no Python. Build with Zig, point at a GGUF, run inference. The right backend (Vulkan or Metal) is selected automatically at build time.
Drop-in compatible. OpenAI-compatible API, built-in chat UI, managed model catalog. Point your existing client at it and it works.
Supported Models
The list below matches the current managed model catalog, not a broader wishlist.
-
Qwen3.5 35B-A3B UD Q4_K_XL — supported on AMD RDNA4 32 GB and Apple Silicon
-
Qwen3.6 35B-A3B UD Q4_K_XL — experimental on AMD RDNA4 32 GB and Apple Silicon
-
OpenAI GPT-OSS 20B Q4_K_M — supported on Apple Silicon
-
Qwen3 8B Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
-
Gemma 4 31B Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
-
Gemma 4 12B (26B-A4B MoE) Q4_K_M — experimental on AMD RDNA4 32 GB and Apple Silicon
-
Use
zinc model list --jsonfor machine-readable model metadata -
Current throughput and latency numbers live on the public benchmarks page: zolotukhin.ai/zinc/benchmarks
Quantization formats: Q4_K, Q5_K, Q6_K, Q8_0, Q5_0, MXFP4, F16, F32
Quick Start
Prerequisites
| Tool | Install |
|---|---|
| Zig 0.15.2+ | ziglang.org/download |
| Vulkan loader + tools | apt install libvulkan-dev vulkan-tools (Linux) or brew install vulkan-loader vulkan-headers (macOS) |
glslc on Linux |
apt install glslc |
| Bun for tests and the docs site | curl -fsSL https://bun.sh/install | bash |
Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.
Build ZINC
git clone https://github.com/zolotukhin/zinc.git cd zinc # Build the CLI and server # macOS: shaders are skipped # Linux: shaders are compiled automatically zig build -Doptimize=ReleaseFast
The binary is placed in zig-out/bin/zinc. Compiled SPIR-V shaders go to zig-out/share/zinc/shaders/.
Use ReleaseFast for any performance measurement or server deployment. Plain zig build is not a fair throughput baseline.
Run a Preflight Check First
Before your first prompt, run --check. The target state is a clean READY [OK] run with no warnings.
# General machine + Vulkan + shader preflight ./zig-out/bin/zinc --check # Recommended on RDNA4 before measuring performance export RADV_PERFTEST=coop_matrix ./zig-out/bin/zinc --check # Check one exact GGUF file ./zig-out/bin/zinc --check -m /path/to/model.gguf # Check one managed catalog model by id ./zig-out/bin/zinc --check --model-id qwen35-35b-a3b-q4k-xl
--check verifies:
- host environment and RDNA4-specific shell hints
- compiled shader assets
- Vulkan device discovery and the selected GPU
- GGUF metadata when you pass
-m /path/to/model.gguf - managed-model compatibility when you pass
--model-id <id> - estimated single-GPU VRAM fit for the current runtime
If --check reports warnings, treat them as setup work to finish before judging runtime behavior. For the full walkthrough, see Running ZINC and Hardware requirements.
Choosing Models
The README keeps the supported-model section concise and leaves the full managed-model workflow to the docs.
Use these for model selection, cache management, and API details:
Run a Prompt
./zig-out/bin/zinc -m /path/to/model.gguf --prompt "The capital of France is"Run the Server
Start the server — no --prompt flag means server mode:
./zig-out/bin/zinc -m /path/to/model.gguf -p 8080
Then open http://localhost:8080/ in your browser for the built-in chat interface.
Use the API
ZINC exposes an OpenAI-compatible API at /v1.
For the actual request examples and SDK usage, use the website docs instead of the README:
- Running ZINC for CLI, server mode, and first-run examples
- Serving HTTP API for
curl, OpenAI SDK examples, endpoint behavior, and response shapes
The built-in chat UI is served at /, the API is under /v1, and the health endpoint is /health.
Development
For building, testing, debugging, benchmarking, graph export, and contributing — see the Development Guide (web version).
Quick start:
zig build -Doptimize=ReleaseFast # build zig build test # run all tests ./zig-out/bin/zinc --check # verify GPU/runtime setup
See also: CONTRIBUTING.md · Code of Conduct
Architecture
Benchmarks
The tables below are pulled directly from the latest published artifact at zolotukhin.ai/zinc/benchmarks. Latest refresh: 2026-05-02 (both targets). Numbers are median tok/s across the suite's runs on a fresh boot, ZINC and llama.cpp on the same hardware, weights, and prompt.
AMD RDNA4 — Radeon AI PRO R9700 (Vulkan)
| Model | ZINC prefill | llama.cpp prefill | ZINC % | ZINC decode | llama.cpp decode | ZINC % |
|---|---|---|---|---|---|---|
| Qwen 3 8B (dense) | 115.85 | 84.73 | 137% | 60.52 | 109.59 | 55% |
| Qwen 3.5 35B A3B (MoE+SSM) | 74.56 | 168.13 | 44% | 103.72 | 96.76 | 107% |
| Qwen 3.6 35B A3B (MoE+SSM) | 72.20 | 179.97 | 40% | 99.40 | 104.48 | 95% |
| Gemma 4 26B A4B (MoE) | 56.03 | 334.08 | 17% | 102.82 | 106.81 | 96% |
| Gemma 4 31B (dense) | 44.35 | 135.78 | 33% | 44.53 | 32.17 | 138% |
| GPT-OSS 20B (MoE) | 85.64 | 267.35 | 32% | 83.41 | 181.72 | 46% |
Apple Silicon M4 Max (Metal)
| Model | ZINC prefill | llama.cpp prefill | ZINC decode | llama.cpp decode | ZINC % decode |
|---|---|---|---|---|---|
| Qwen 3.5 35B A3B (MoE+SSM) | 29.5 | 124.42 | 46.36 | 77.70 | 60% |
| Qwen 3.6 35B A3B (MoE+SSM) | 25.2 | 130.32 | 38.13 | 79.88 | 48% |
| Gemma 4 12B (MoE) | 27.4 | 279.51 | 43.68 | 92.17 | 47% |
| Qwen 3 8B (dense) | 24.7 | 74.67 | 32.74 | 83.67 | 39% |
| GPT-OSS 20B (MoE) | 20.0 | 177.01 | 25.98 | 119.85 | 22% |
| Gemma 4 31B (dense) | 19.5 | 73.28 | 5.16 | 25.14 | 21% |
Where we stand vs llama.cpp
- Ahead of llama.cpp: Qwen 3 8B prefill on RDNA4 (1.37x), Qwen 3.5 35B-A3B decode on RDNA4 (1.07x), Gemma 4 31B dense decode on RDNA4 (1.38x).
- Within striking distance (≥90% of llama.cpp): Qwen 3.6 35B-A3B decode on RDNA4 (95%), Gemma 4 26B A4B decode on RDNA4 (96%).
- Active gap: Qwen 3.5/3.6 35B-A3B prefill on RDNA4 sits at ~40% of llama.cpp because the entire batched prefill path is gated off for any model with
n_experts > 0orssm_d_inner > 0. The wire-up that closes this is documented in the cycle-50 field report. - In flight: Metal prefill is uniformly bottlenecked (20–30 tok/s) because the per-token Metal path doesn't amortize weight reads across prompt tokens. Metal decode is improving fastest on the larger A3B models (46.4 tok/s on Qwen 3.5 35B-A3B); the Gemma 4 31B decode floor at 5.16 tok/s is the active optimization target.
For local benchmark commands, harnesses, and methodology, see:
Current Status
| Component | Status |
|---|---|
| Vulkan infrastructure | Done |
| GGUF parser + model loader | Done |
| GPU detection (RDNA3/4) | Done |
| Native BPE tokenizer (from GGUF) | Done |
| GLSL compute shaders (16) | Done |
| Compute graph + architecture builders | Done |
| Forward pass (decode loop) | Working — 103.72 tok/s on RDNA4 and 46.36 tok/s on Apple M4 Max for Qwen 3.5 35B-A3B |
| Forward pass (prefill loop) | Working — 70+ tok/s on RDNA4 long-context for Qwen 3.5/3.6 35B-A3B; Metal prefill in flight |
| GPU SSM shaders + cmd batching | Done — RDNA decode is 99+ tok/s on Qwen 3.5/3.6 35B and 83 tok/s on GPT-OSS 20B |
| HTTP server + OpenAI API | Done — Qwen 35B-A3B raw API ~100 tok/s on RDNA4 and ~46 tok/s on Apple M4 Max |
| Continuous batching | Phase 4 |
| TurboQuant KV compression | Phase 5 |
Validated on AMD Radeon AI PRO R9700 (RDNA4): Vulkan 1.3 init, GGUF parsing, 21 GB model loaded to VRAM, 723-node MoE graph built, coherent inference output verified against CPU reference.
Next Steps
The next push is closing the prefill gap to llama.cpp on hybrid MoE-plus-SSM models:
- Wire
mul_mm_q4kinto SSM proj prefill — the tiled Q4_K GEMM is in the tree but only routes the language-model head where N=1 wastes the BN tile. The SSM proj fires 4 DMMVs per layer per token; batching them across the prompt is the deferred cycle-40 refactor. - Port the
gated_delta_net.cublock-resident state pattern — today every prompt token re-reads and re-writes the full 2 MB SSM state per layer. Loading state once per workgroup and walking all tokens inside the kernel collapses 18 GB of state DRAM traffic per prefill to 4 MB. - Open
canUseBatchedPrefillRdnafor MoE+SSM hybrids — the entire batched prefill body (flash_attn_batched,rope_batched,dmmv_q4k_batch_kpar) is gated off whenn_experts > 0orssm_d_inner > 0. Once items 1 and 2 land, dropping the gate activates Br-row attention batching on the same workload. - Land the cycle-50 micro-restructure pattern on MoE inner loops — wider threads-per-row plus halved per-thread register slabs lifted ssm_delta_net by 2.7%. The same shape change is untried on
dmmv_q4k_moe_kparanddmmv_q4k_moe_fused_down_acc. - Ship batched Metal prefill across the catalog — the Gemma path landed; Qwen 3.5/3.6 and GPT-OSS still route through the per-token Metal path that produces the 0.2–10 tok/s prefill numbers above.
The full plan and 50-cycle field report is in the cycle-50 blog post.
License
MIT
