GitHub - zolotukhin/zinc: Zig INferenCe Engine — Local LLM inference on AMD GPUs and Apple Silicon

9 min read Original article ↗

ZINC Logo

CI Status Zig Version License Platform Website ZINC Discord

Local LLM inference on AMD GPUs and Apple Silicon — no ROCm, no MLX, one binary.

ZINC Chat Demo — streaming inference on AMD RDNA4
35B parameter model running locally — Zig + Vulkan/Metal, no ROCm, no MLX

Supported Platforms

Platform GPU Backend Status
Linux AMD RDNA4 (RX 9070, AI PRO R9700) Vulkan Primary — hand-tuned shaders
Linux AMD RDNA3 (RX 7900 XTX, etc.) Vulkan Supported
macOS Apple Silicon (M1, M2, M3, M4, M5) Metal Supported — native MSL shaders

Start Here

Works the same on Linux (AMD GPU) and macOS (Apple Silicon):

git clone https://github.com/zolotukhin/zinc.git
cd zinc
zig build -Doptimize=ReleaseFast

# On RDNA4 Linux, enable cooperative matrix
export RADV_PERFTEST=coop_matrix  # skip on macOS

# Verify GPU, shaders, and runtime
./zig-out/bin/zinc --check

# See which models fit this machine
./zig-out/bin/zinc model list

# Download a model
./zig-out/bin/zinc model pull qwen3-8b-q4k-m

# Run a prompt (--chat applies the model's chat template for instruct models)
./zig-out/bin/zinc --model-id qwen3-8b-q4k-m --prompt "Hello" --chat

# Or open the chat UI in your browser
./zig-out/bin/zinc chat

The server exposes the built-in chat UI at / and an OpenAI-compatible API at /v1.

What Works Today

  • Single-stream CLI inference on the validated models listed below
  • OpenAI-compatible /v1 API with streaming
  • Built-in browser chat UI with thinking mode support
  • Managed model workflow: list, pull, use, active, rm
  • zinc chat — start server and open browser in one command
  • AMD path: RDNA4-tuned Vulkan shaders (wave64, cooperative matrix, fused ops)
  • Apple Silicon path: native Metal shaders (MSL, zero-copy mmap, simdgroup ops)
  • Auto-detection: ZINC picks the right backend (Vulkan or Metal) at build time

Still Rough

  • Continuous batching and multi-tenant serving are still roadmap work
  • The supported-model list is intentionally narrow
  • Apple Silicon performance tuning is ongoing (RDNA4 path is more mature)

The Problem

Consumer GPUs have the hardware for fast LLM inference — bandwidth, compute, VRAM — but the software doesn't use it:

  • AMD RDNA3/RDNA4: ROCm doesn't support them. vLLM requires ROCm. llama.cpp's Vulkan path has no RDNA-specific tuning. These $500–1500 cards sit idle.
  • Apple Silicon: MLX and llama.cpp Metal work, but leave performance on the table. No engine is built from scratch around Metal's strengths (unified memory, simdgroup ops, zero-copy mmap).

The Solution

ZINC builds an inference engine tuned for the hardware you actually have.

Hand-tuned shaders for each platform. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup reductions, zero-copy model loading, and Metal pipeline tuning. Not a generic backend that happens to run — built to extract real performance from each GPU.

One binary, no driver stack. No ROCm, no CUDA, no Python. Build with Zig, point at a GGUF, run inference. The right backend (Vulkan or Metal) is selected automatically at build time.

Drop-in compatible. OpenAI-compatible API, built-in chat UI, managed model catalog. Point your existing client at it and it works.

Supported Models

The list below matches the current managed model catalog, not a broader wishlist.

Quantization formats: Q4_K, Q5_K, Q6_K, Q8_0, Q5_0, MXFP4, F16, F32

Quick Start

Prerequisites

Tool Install
Zig 0.15.2+ ziglang.org/download
Vulkan loader + tools apt install libvulkan-dev vulkan-tools (Linux) or brew install vulkan-loader vulkan-headers (macOS)
glslc on Linux apt install glslc
Bun for tests and the docs site curl -fsSL https://bun.sh/install | bash

Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.

Build ZINC

git clone https://github.com/zolotukhin/zinc.git
cd zinc

# Build the CLI and server
# macOS: shaders are skipped
# Linux: shaders are compiled automatically
zig build -Doptimize=ReleaseFast

The binary is placed in zig-out/bin/zinc. Compiled SPIR-V shaders go to zig-out/share/zinc/shaders/. Use ReleaseFast for any performance measurement or server deployment. Plain zig build is not a fair throughput baseline.

Run a Preflight Check First

Before your first prompt, run --check. The target state is a clean READY [OK] run with no warnings.

# General machine + Vulkan + shader preflight
./zig-out/bin/zinc --check

# Recommended on RDNA4 before measuring performance
export RADV_PERFTEST=coop_matrix
./zig-out/bin/zinc --check

# Check one exact GGUF file
./zig-out/bin/zinc --check -m /path/to/model.gguf

# Check one managed catalog model by id
./zig-out/bin/zinc --check --model-id qwen35-35b-a3b-q4k-xl

--check verifies:

  • host environment and RDNA4-specific shell hints
  • compiled shader assets
  • Vulkan device discovery and the selected GPU
  • GGUF metadata when you pass -m /path/to/model.gguf
  • managed-model compatibility when you pass --model-id <id>
  • estimated single-GPU VRAM fit for the current runtime

If --check reports warnings, treat them as setup work to finish before judging runtime behavior. For the full walkthrough, see Running ZINC and Hardware requirements.

Choosing Models

The README keeps the supported-model section concise and leaves the full managed-model workflow to the docs.

Use these for model selection, cache management, and API details:

Run a Prompt

./zig-out/bin/zinc -m /path/to/model.gguf --prompt "The capital of France is"

Run the Server

Start the server — no --prompt flag means server mode:

./zig-out/bin/zinc -m /path/to/model.gguf -p 8080

Then open http://localhost:8080/ in your browser for the built-in chat interface.

Use the API

ZINC exposes an OpenAI-compatible API at /v1.

For the actual request examples and SDK usage, use the website docs instead of the README:

  • Running ZINC for CLI, server mode, and first-run examples
  • Serving HTTP API for curl, OpenAI SDK examples, endpoint behavior, and response shapes

The built-in chat UI is served at /, the API is under /v1, and the health endpoint is /health.

Development

For building, testing, debugging, benchmarking, graph export, and contributing — see the Development Guide (web version).

Quick start:

zig build -Doptimize=ReleaseFast   # build
zig build test                      # run all tests
./zig-out/bin/zinc --check          # verify GPU/runtime setup

See also: CONTRIBUTING.md · Code of Conduct

Architecture

ZINC Architecture

Benchmarks

The tables below are pulled directly from the latest published artifact at zolotukhin.ai/zinc/benchmarks. Latest refresh: 2026-05-02 (both targets). Numbers are median tok/s across the suite's runs on a fresh boot, ZINC and llama.cpp on the same hardware, weights, and prompt.

AMD RDNA4 — Radeon AI PRO R9700 (Vulkan)

Model ZINC prefill llama.cpp prefill ZINC % ZINC decode llama.cpp decode ZINC %
Qwen 3 8B (dense) 115.85 84.73 137% 60.52 109.59 55%
Qwen 3.5 35B A3B (MoE+SSM) 74.56 168.13 44% 103.72 96.76 107%
Qwen 3.6 35B A3B (MoE+SSM) 72.20 179.97 40% 99.40 104.48 95%
Gemma 4 26B A4B (MoE) 56.03 334.08 17% 102.82 106.81 96%
Gemma 4 31B (dense) 44.35 135.78 33% 44.53 32.17 138%
GPT-OSS 20B (MoE) 85.64 267.35 32% 83.41 181.72 46%

Apple Silicon M4 Max (Metal)

Model ZINC prefill llama.cpp prefill ZINC decode llama.cpp decode ZINC % decode
Qwen 3.5 35B A3B (MoE+SSM) 29.5 124.42 46.36 77.70 60%
Qwen 3.6 35B A3B (MoE+SSM) 25.2 130.32 38.13 79.88 48%
Gemma 4 12B (MoE) 27.4 279.51 43.68 92.17 47%
Qwen 3 8B (dense) 24.7 74.67 32.74 83.67 39%
GPT-OSS 20B (MoE) 20.0 177.01 25.98 119.85 22%
Gemma 4 31B (dense) 19.5 73.28 5.16 25.14 21%

Where we stand vs llama.cpp

  • Ahead of llama.cpp: Qwen 3 8B prefill on RDNA4 (1.37x), Qwen 3.5 35B-A3B decode on RDNA4 (1.07x), Gemma 4 31B dense decode on RDNA4 (1.38x).
  • Within striking distance (≥90% of llama.cpp): Qwen 3.6 35B-A3B decode on RDNA4 (95%), Gemma 4 26B A4B decode on RDNA4 (96%).
  • Active gap: Qwen 3.5/3.6 35B-A3B prefill on RDNA4 sits at ~40% of llama.cpp because the entire batched prefill path is gated off for any model with n_experts > 0 or ssm_d_inner > 0. The wire-up that closes this is documented in the cycle-50 field report.
  • In flight: Metal prefill is uniformly bottlenecked (20–30 tok/s) because the per-token Metal path doesn't amortize weight reads across prompt tokens. Metal decode is improving fastest on the larger A3B models (46.4 tok/s on Qwen 3.5 35B-A3B); the Gemma 4 31B decode floor at 5.16 tok/s is the active optimization target.

For local benchmark commands, harnesses, and methodology, see:

Current Status

Component Status
Vulkan infrastructure Done
GGUF parser + model loader Done
GPU detection (RDNA3/4) Done
Native BPE tokenizer (from GGUF) Done
GLSL compute shaders (16) Done
Compute graph + architecture builders Done
Forward pass (decode loop) Working — 103.72 tok/s on RDNA4 and 46.36 tok/s on Apple M4 Max for Qwen 3.5 35B-A3B
Forward pass (prefill loop) Working — 70+ tok/s on RDNA4 long-context for Qwen 3.5/3.6 35B-A3B; Metal prefill in flight
GPU SSM shaders + cmd batching Done — RDNA decode is 99+ tok/s on Qwen 3.5/3.6 35B and 83 tok/s on GPT-OSS 20B
HTTP server + OpenAI API Done — Qwen 35B-A3B raw API ~100 tok/s on RDNA4 and ~46 tok/s on Apple M4 Max
Continuous batching Phase 4
TurboQuant KV compression Phase 5

Validated on AMD Radeon AI PRO R9700 (RDNA4): Vulkan 1.3 init, GGUF parsing, 21 GB model loaded to VRAM, 723-node MoE graph built, coherent inference output verified against CPU reference.

Next Steps

The next push is closing the prefill gap to llama.cpp on hybrid MoE-plus-SSM models:

  1. Wire mul_mm_q4k into SSM proj prefill — the tiled Q4_K GEMM is in the tree but only routes the language-model head where N=1 wastes the BN tile. The SSM proj fires 4 DMMVs per layer per token; batching them across the prompt is the deferred cycle-40 refactor.
  2. Port the gated_delta_net.cu block-resident state pattern — today every prompt token re-reads and re-writes the full 2 MB SSM state per layer. Loading state once per workgroup and walking all tokens inside the kernel collapses 18 GB of state DRAM traffic per prefill to 4 MB.
  3. Open canUseBatchedPrefillRdna for MoE+SSM hybrids — the entire batched prefill body (flash_attn_batched, rope_batched, dmmv_q4k_batch_kpar) is gated off when n_experts > 0 or ssm_d_inner > 0. Once items 1 and 2 land, dropping the gate activates Br-row attention batching on the same workload.
  4. Land the cycle-50 micro-restructure pattern on MoE inner loops — wider threads-per-row plus halved per-thread register slabs lifted ssm_delta_net by 2.7%. The same shape change is untried on dmmv_q4k_moe_kpar and dmmv_q4k_moe_fused_down_acc.
  5. Ship batched Metal prefill across the catalog — the Gemma path landed; Qwen 3.5/3.6 and GPT-OSS still route through the per-token Metal path that produces the 0.2–10 tok/s prefill numbers above.

The full plan and 50-cycle field report is in the cycle-50 blog post.

License

MIT