Your Mac has a GPU. Your Mac has RAM. Why are you paying someone else to think?
This guide gets you a fully local agentic coding setup: Claude Code talking to Qwen 3.5-35B-A3B via llama.cpp, all running on your Apple Silicon Mac. No API keys. No cloud. No surprise invoices. Just you, your M-series chip, and 35 billion parameters doing your bidding on localhost.
Based on this article.
TL;DR — Just Run the Scripts
./install.sh # one-time setup: Homebrew, Node, llama.cpp, Claude Code, settings.json, .zshrc source ~/.zshrc ./run.sh # start the model server, then open a new terminal and type: claude
Both scripts are idempotent — run them as many times as you want, they won't break anything or duplicate .zshrc entries or overwrite existing settings.json.
How It Works
You --> Claude Code --> llama-server (localhost:8131) --> Qwen 3.5-35B-A3B (GGUF on Metal)
Claude Code thinks it's talking to Anthropic's API. It's actually talking to a quantized open-weight model running on your Mac's GPU. Same agentic workflow, same tool use, zero egress. It's beautiful.
Prerequisites
You're self-hosting a 35B parameter model. You'll need some hardware.
| What | Minimum | Ideal |
|---|---|---|
| Mac | Apple Silicon (M1 / M2 / M3 / M4) | Any Pro / Max variant |
| Unified Memory | 24 GB | 64 GB |
| Free Disk | ~25 GB | ~40 GB (room to try other models) |
| macOS | 13.0+ (Ventura) | Latest |
| Node.js | 18+ | 22+ |
| Homebrew | Installed | You're self-hosting LLMs, of course you have Homebrew |
| Patience | First model download is ~20 GB | Go make coffee |
16 GB RAM? You can still play — just use
-c 32768instead of-c 131072in the server command. It works, just with less context. More on that below.
Step 1 — Install llama.cpp
This is the engine. It serves your model as an OpenAI/Anthropic-compatible API over HTTP. Metal GPU acceleration is enabled by default on Apple Silicon — no flags needed.
Option A — Homebrew (recommended, you already have it):
Option B — Build from source (for the "I compile my own kernels" crowd):
brew install cmake ninja git
git clone https://github.com/ggml-org/llama.cpp ~/llama.cppcd ~/llama.cpp && cmake -B build -G Ninja -DGGML_METAL=ON && cmake --build build --config Release -j 8
If you built from source, add llama-server to your PATH:
echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc
Verify it's working:
Step 2 — Start the Model Server
This downloads Qwen 3.5-35B-A3B (~20 GB, Q4_K_M quantization) from HuggingFace on first run. Subsequent starts use the cached model.
llama-server \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--port 8131 \
-ngl 999 \
-t 4 \
-c 131072 \
-b 512 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--keep 1024 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--swa-full \
--no-context-shift \
--chat-template-kwargs '{"enable_thinking": false}' \
--mlock \
--no-mmapLeave this terminal open. First startup takes 10–30 seconds. Go stare at Activity Monitor if you want — watching 20 GB of model weights load into unified memory is oddly satisfying.
Tune -t for your chip
The -t flag sets CPU threads for non-GPU work. Match it to your performance cores:
| Your Mac | Set -t to |
|---|---|
| M1 / M2 / M3 / M4 (base) | 4 |
| M1 / M2 / M3 / M4 Pro / Max | 8 |
| M2 / M3 Ultra | 16 |
On 16 GB RAM
Swap the context size flag. Everything else stays the same:
# use this instead of -c 131072
-c 32768You lose long-context capability but the model still works great for normal coding tasks.
What all these flags do
| Flag | Why it matters |
|---|---|
-ngl 999 |
Offload every layer to Metal GPU. This is the single biggest speedup. Without it, your CPU does all the work and your GPU sits there looking pretty. |
-t 4 |
CPU threads for non-offloaded work. Match to your P-core count (table above). |
-b 512 |
Prompt batch size. 512 beats 2048 on M2 base in benchmarks. Counterintuitive, but true. |
--swa-full |
The hidden performance flag. Qwen 3.5 uses sliding window attention. Without this, every follow-up request reprocesses the entire prompt from scratch. With it, prompt caching works. The difference is ~10x on follow-up latency. |
--no-context-shift |
Required when using --swa-full. Context shifting is incompatible with SWA. |
--chat-template-kwargs '{"enable_thinking": false}' |
Disables the model's internal chain-of-thought. In agentic workflows, those thinking tokens are wasted — Claude Code manages its own reasoning. |
--cache-type-k/v q8_0 |
Quantize the KV cache. Near-zero quality loss, measurable throughput improvement. Free lunch. |
--keep 1024 |
Pin the system prompt in cache. Claude Code sends a chunky system prompt — no point re-processing it every turn. |
--mlock |
Lock the model in RAM. Prevents macOS from deciding your model weights are a great candidate for swap. |
--no-mmap |
Don't memory-map the model file. More stable on macOS, especially under memory pressure. |
-fa on |
Flash attention. Faster prompt evaluation. |
Step 3 — Verify the Server
Open a new terminal tab (the other one is busy serving a 35B model) and poke the API:
curl http://localhost:8131/v1/models
You should see JSON with the model ID. If you get connection refused, the model is still loading — give it another 10 seconds.
Step 4 — Install Claude Code
You probably already have this. If not:
Option A — Native installer (auto-updates):
curl -fsSL https://claude.ai/install.sh | bashOption B — npm:
npm install -g @anthropic-ai/claude-code
Option C — Homebrew:
brew install --cask claude-code
Homebrew conflict? If you see
Error: It seems there is already a Binary at '/opt/homebrew/bin/claude', runrm /opt/homebrew/bin/claudefirst, then reinstall.
Verify:
Step 5 — Configure ~/.claude/settings.json
This is the proper way to configure Claude Code. The env block sets environment variables for every session (no .zshrc needed), permissions pre-approves common tools so the model doesn't ask for confirmation on every ls, and a few flags disable telemetry and features that don't make sense for local inference.
cat << 'SETTINGS' > ~/.claude/settings.json { "$schema": "https://json.schemastore.org/claude-code-settings.json", "env": { "ANTHROPIC_BASE_URL": "http://127.0.0.1:8131", "ANTHROPIC_AUTH_TOKEN": "local", "ANTHROPIC_MODEL": "unsloth/qwen3.5-35b-a3b", "ANTHROPIC_DEFAULT_OPUS_MODEL": "unsloth/qwen3.5-35b-a3b", "ANTHROPIC_DEFAULT_SONNET_MODEL": "unsloth/qwen3.5-35b-a3b", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "unsloth/qwen3.5-35b-a3b", "CLAUDE_CODE_SUBAGENT_MODEL": "unsloth/qwen3.5-35b-a3b", "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "128000", "DISABLE_PROMPT_CACHING": "1", "DISABLE_AUTOUPDATER": "1", "DISABLE_TELEMETRY": "1", "DISABLE_ERROR_REPORTING": "1", "DISABLE_NON_ESSENTIAL_MODEL_CALLS": "1" }, "permissions": { "allow": [ "Bash(git *)", "Bash(npm *)", "Bash(npx *)", "Bash(node *)", "Bash(python *)", "Bash(python3 *)", "Bash(pip *)", "Bash(pip3 *)", "Bash(brew *)", "Bash(ls *)", "Bash(cat *)", "Bash(head *)", "Bash(tail *)", "Bash(find *)", "Bash(grep *)", "Bash(rg *)", "Bash(mkdir *)", "Bash(cp *)", "Bash(mv *)", "Bash(rm *)", "Bash(echo *)", "Bash(curl *)", "Bash(which *)", "Bash(env *)", "Bash(cd *)", "Bash(pwd)", "Bash(wc *)", "Bash(sort *)", "Bash(uniq *)", "Bash(diff *)", "Bash(chmod *)", "Bash(touch *)", "Bash(sed *)", "Bash(awk *)", "Bash(xargs *)", "Bash(tee *)", "Read", "Edit", "Write", "Glob", "Grep", "WebFetch", "WebSearch" ], "deny": [ "Read(./.env)", "Read(./.env.*)", "Read(./secrets/**)" ] } } SETTINGS
What these settings do
| Setting | Why |
|---|---|
env.ANTHROPIC_BASE_URL |
Points Claude Code at your local llama-server instead of Anthropic's cloud. |
env.ANTHROPIC_AUTH_TOKEN |
Any non-empty string. Satisfies the auth check without a real API key. |
env.*_MODEL |
Maps every model tier (Opus, Sonnet, Haiku, subagents) to your local Qwen 3.5. |
env.CLAUDE_CODE_MAX_OUTPUT_TOKENS |
Allows longer responses. Default is 32K — local models have no billing, so crank it. |
env.DISABLE_PROMPT_CACHING |
Prompt caching is an Anthropic API feature. Your local server handles its own caching via --swa-full. |
env.DISABLE_AUTOUPDATER |
You're running local. Auto-updates would just add network calls you don't need. |
env.DISABLE_TELEMETRY |
Running local to keep things private? Then don't phone home. |
env.DISABLE_ERROR_REPORTING |
Same reasoning. Your errors, your business. |
env.DISABLE_NON_ESSENTIAL_MODEL_CALLS |
Stops Claude Code from making extra model calls for things like spinner text. Every token counts on local inference. |
permissions.allow |
Pre-approves common shell commands and all file tools. Without this, Claude Code asks for permission on every single tool call. Gets old fast. |
permissions.deny |
Keeps .env and secrets off-limits, because even local models shouldn't read your credentials. |
Already have a
settings.json? Theinstall.shscript detects existing content and backs it up before writing. Or just merge theenvandpermissionsblocks into your existing file manually.
Step 6 — Shell configuration (backup + cclocal helper)
The settings.json above handles everything Claude Code needs. But having the env vars in .zshrc too gives you a fallback and lets the cclocal helper function work:
cat << 'EOF' >> ~/.zshrc # Claude Code + Local Qwen 3.5 — because localhost > cloud export LM_MODEL="unsloth/qwen3.5-35b-a3b" export ANTHROPIC_BASE_URL="http://127.0.0.1:8131" export ANTHROPIC_AUTH_TOKEN="local" export CLAUDE_CODE_MAX_OUTPUT_TOKENS=128000 export ANTHROPIC_MODEL="$LM_MODEL" export ANTHROPIC_DEFAULT_OPUS_MODEL="$LM_MODEL" export ANTHROPIC_DEFAULT_SONNET_MODEL="$LM_MODEL" export ANTHROPIC_DEFAULT_HAIKU_MODEL="$LM_MODEL" export CLAUDE_CODE_SUBAGENT_MODEL="$LM_MODEL" EOF
Step 7 — Launch Claude Code
Or, if you prefer a no-config one-liner (works even without settings.json or .zshrc changes):
ANTHROPIC_BASE_URL=http://127.0.0.1:8131 ANTHROPIC_AUTH_TOKEN=local claude
You should see unsloth/qwen3.5-35b-a3b as the active model. Type something. Watch your llama-server terminal light up with inference logs. Smile.
Daily Driver: Quick Start Script
Once everything is installed, this is your daily workflow. One command, two services:
llama-server \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \ --port 8131 \ -ngl 999 \ -t 4 \ -c 131072 \ -b 512 \ -ub 1024 \ --parallel 1 \ -fa on \ --jinja \ --keep 1024 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --swa-full \ --no-context-shift \ --chat-template-kwargs '{"enable_thinking": false}' \ --mlock \ --no-mmap & sleep 15 && ANTHROPIC_BASE_URL=http://127.0.0.1:8131 ANTHROPIC_AUTH_TOKEN=local claude
Shell Helper: cclocal
For the refined developer who wants a clean command:
cat << 'FUNC' >> ~/.zshrc cclocal() { local port=8131 if [[ "$1" =~ ^[0-9]+$ ]]; then port="$1" shift fi ANTHROPIC_BASE_URL="http://127.0.0.1:${port}" \ ANTHROPIC_AUTH_TOKEN="local" \ claude "$@" } FUNC
Now you can:
cclocal # Qwen 3.5 on port 8131 cclocal 8127 # Qwen3-Coder on port 8127 cclocal 8129 # GLM-4.7-Flash on port 8129
Other Models Worth Trying
Already running Qwen 3.5 and want more? You can run multiple models on different ports and switch with cclocal <port>.
| Model | Port | Size | Good For | Command |
|---|---|---|---|---|
| Qwen3-Coder-30B-A3B | 8127 | ~30 GB | Pure coding. If you only write code, this might be better than Qwen 3.5. | llama-server --fim-qwen-30b-default --port 8127 |
| GLM-4.7-Flash | 8129 | ~18 GB | Lighter weight, still capable. Good if you're on 24 GB. | llama-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --port 8129 -c 131072 -b 2048 -ub 1024 --parallel 1 -fa on --jinja --chat-template-file ~/llama.cpp/models/templates/glm-4.jinja |
| GPT-OSS-20B | 8123 | ~20 GB | Fast baseline. 17–38 tok/s on M1 Max. | llama-server --gpt-oss-20b-default --port 8123 |
| Qwen3-Coder-Next-80B-A3B | 8130 | ~46 GB | SOTA coder. Needs 64 GB RAM. Worth it if you have the metal. | llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL --port 8130 -c 131072 -b 2048 -ub 1024 --parallel 1 -fa on --jinja |
Troubleshooting
Things that might go wrong and how to fix them without rage-quitting.
| Symptom | Likely Cause | Fix |
|---|---|---|
connection refused on curl |
Server still loading | Wait 30s, retry. First load is slow. |
| Painfully slow responses | Missing --swa-full or -ngl 999 |
Restart llama-server with both flags. This is the #1 gotcha. |
| Claude Code uses the cloud model | Env vars not set | Re-export them in the terminal you're using. Check with echo $ANTHROPIC_BASE_URL. |
failed to find a memory slot |
Context too large for available RAM | Use --parallel 1 and/or reduce -c to 32768. |
| Auth errors from Claude Code | Missing auth token | export ANTHROPIC_AUTH_TOKEN=local — any non-empty string works. |
| Wrong model responding | Model ID mismatch | Run curl localhost:8131/v1/models and compare with echo $LM_MODEL. |
| First request takes forever | Cold start, model loading into memory | Normal. 10–30 seconds. Subsequent requests are fast. |
| System swapping / beachball | Model doesn't fit in RAM | Reduce -c, or try a smaller model like GLM-4.7-Flash (~18 GB). |
| Homebrew binary conflict | Previous install left /opt/homebrew/bin/claude |
rm /opt/homebrew/bin/claude && brew reinstall --cask claude-code |
Stopping
Or Ctrl+C in the llama-server terminal. Your model weights aren't going anywhere — next launch picks up the cached GGUF instantly.
Credits
- Tattva Tarang's original article for the core setup
- Konstantinos' LM Studio walkthrough for environment variable details
- claude-code-tools docs for the flag reference and model catalog
- llama.cpp for making all of this possible on consumer hardware
- Your Mac, for being a surprisingly good inference box