How to Run Qwen3.5 Locally With Claude Code (No API Bills, Full Agentic Coding)

10 min read Original article ↗

Your Mac has a GPU. Your Mac has RAM. Why are you paying someone else to think?

This guide gets you a fully local agentic coding setup: Claude Code talking to Qwen 3.5-35B-A3B via llama.cpp, all running on your Apple Silicon Mac. No API keys. No cloud. No surprise invoices. Just you, your M-series chip, and 35 billion parameters doing your bidding on localhost.

Based on this article.


TL;DR — Just Run the Scripts

./install.sh   # one-time setup: Homebrew, Node, llama.cpp, Claude Code, settings.json, .zshrc
source ~/.zshrc
./run.sh       # start the model server, then open a new terminal and type: claude

Both scripts are idempotent — run them as many times as you want, they won't break anything or duplicate .zshrc entries or overwrite existing settings.json.


How It Works

You  -->  Claude Code  -->  llama-server (localhost:8131)  -->  Qwen 3.5-35B-A3B (GGUF on Metal)

Claude Code thinks it's talking to Anthropic's API. It's actually talking to a quantized open-weight model running on your Mac's GPU. Same agentic workflow, same tool use, zero egress. It's beautiful.


Prerequisites

You're self-hosting a 35B parameter model. You'll need some hardware.

What Minimum Ideal
Mac Apple Silicon (M1 / M2 / M3 / M4) Any Pro / Max variant
Unified Memory 24 GB 64 GB
Free Disk ~25 GB ~40 GB (room to try other models)
macOS 13.0+ (Ventura) Latest
Node.js 18+ 22+
Homebrew Installed You're self-hosting LLMs, of course you have Homebrew
Patience First model download is ~20 GB Go make coffee

16 GB RAM? You can still play — just use -c 32768 instead of -c 131072 in the server command. It works, just with less context. More on that below.


Step 1 — Install llama.cpp

This is the engine. It serves your model as an OpenAI/Anthropic-compatible API over HTTP. Metal GPU acceleration is enabled by default on Apple Silicon — no flags needed.

Option A — Homebrew (recommended, you already have it):

Option B — Build from source (for the "I compile my own kernels" crowd):

brew install cmake ninja git
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp && cmake -B build -G Ninja -DGGML_METAL=ON && cmake --build build --config Release -j 8

If you built from source, add llama-server to your PATH:

echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc

Verify it's working:


Step 2 — Start the Model Server

This downloads Qwen 3.5-35B-A3B (~20 GB, Q4_K_M quantization) from HuggingFace on first run. Subsequent starts use the cached model.

llama-server \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
  --port 8131 \
  -ngl 999 \
  -t 4 \
  -c 131072 \
  -b 512 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --keep 1024 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --swa-full \
  --no-context-shift \
  --chat-template-kwargs '{"enable_thinking": false}' \
  --mlock \
  --no-mmap

Leave this terminal open. First startup takes 10–30 seconds. Go stare at Activity Monitor if you want — watching 20 GB of model weights load into unified memory is oddly satisfying.

Tune -t for your chip

The -t flag sets CPU threads for non-GPU work. Match it to your performance cores:

Your Mac Set -t to
M1 / M2 / M3 / M4 (base) 4
M1 / M2 / M3 / M4 Pro / Max 8
M2 / M3 Ultra 16

On 16 GB RAM

Swap the context size flag. Everything else stays the same:

# use this instead of -c 131072
-c 32768

You lose long-context capability but the model still works great for normal coding tasks.

What all these flags do

Flag Why it matters
-ngl 999 Offload every layer to Metal GPU. This is the single biggest speedup. Without it, your CPU does all the work and your GPU sits there looking pretty.
-t 4 CPU threads for non-offloaded work. Match to your P-core count (table above).
-b 512 Prompt batch size. 512 beats 2048 on M2 base in benchmarks. Counterintuitive, but true.
--swa-full The hidden performance flag. Qwen 3.5 uses sliding window attention. Without this, every follow-up request reprocesses the entire prompt from scratch. With it, prompt caching works. The difference is ~10x on follow-up latency.
--no-context-shift Required when using --swa-full. Context shifting is incompatible with SWA.
--chat-template-kwargs '{"enable_thinking": false}' Disables the model's internal chain-of-thought. In agentic workflows, those thinking tokens are wasted — Claude Code manages its own reasoning.
--cache-type-k/v q8_0 Quantize the KV cache. Near-zero quality loss, measurable throughput improvement. Free lunch.
--keep 1024 Pin the system prompt in cache. Claude Code sends a chunky system prompt — no point re-processing it every turn.
--mlock Lock the model in RAM. Prevents macOS from deciding your model weights are a great candidate for swap.
--no-mmap Don't memory-map the model file. More stable on macOS, especially under memory pressure.
-fa on Flash attention. Faster prompt evaluation.

Step 3 — Verify the Server

Open a new terminal tab (the other one is busy serving a 35B model) and poke the API:

curl http://localhost:8131/v1/models

You should see JSON with the model ID. If you get connection refused, the model is still loading — give it another 10 seconds.


Step 4 — Install Claude Code

You probably already have this. If not:

Option A — Native installer (auto-updates):

curl -fsSL https://claude.ai/install.sh | bash

Option B — npm:

npm install -g @anthropic-ai/claude-code

Option C — Homebrew:

brew install --cask claude-code

Homebrew conflict? If you see Error: It seems there is already a Binary at '/opt/homebrew/bin/claude', run rm /opt/homebrew/bin/claude first, then reinstall.

Verify:


Step 5 — Configure ~/.claude/settings.json

This is the proper way to configure Claude Code. The env block sets environment variables for every session (no .zshrc needed), permissions pre-approves common tools so the model doesn't ask for confirmation on every ls, and a few flags disable telemetry and features that don't make sense for local inference.

cat << 'SETTINGS' > ~/.claude/settings.json
{
  "$schema": "https://json.schemastore.org/claude-code-settings.json",
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8131",
    "ANTHROPIC_AUTH_TOKEN": "local",
    "ANTHROPIC_MODEL": "unsloth/qwen3.5-35b-a3b",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "unsloth/qwen3.5-35b-a3b",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "unsloth/qwen3.5-35b-a3b",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "unsloth/qwen3.5-35b-a3b",
    "CLAUDE_CODE_SUBAGENT_MODEL": "unsloth/qwen3.5-35b-a3b",
    "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "128000",
    "DISABLE_PROMPT_CACHING": "1",
    "DISABLE_AUTOUPDATER": "1",
    "DISABLE_TELEMETRY": "1",
    "DISABLE_ERROR_REPORTING": "1",
    "DISABLE_NON_ESSENTIAL_MODEL_CALLS": "1"
  },
  "permissions": {
    "allow": [
      "Bash(git *)", "Bash(npm *)", "Bash(npx *)", "Bash(node *)",
      "Bash(python *)", "Bash(python3 *)", "Bash(pip *)", "Bash(pip3 *)",
      "Bash(brew *)", "Bash(ls *)", "Bash(cat *)", "Bash(head *)",
      "Bash(tail *)", "Bash(find *)", "Bash(grep *)", "Bash(rg *)",
      "Bash(mkdir *)", "Bash(cp *)", "Bash(mv *)", "Bash(rm *)",
      "Bash(echo *)", "Bash(curl *)", "Bash(which *)", "Bash(env *)",
      "Bash(cd *)", "Bash(pwd)", "Bash(wc *)", "Bash(sort *)",
      "Bash(uniq *)", "Bash(diff *)", "Bash(chmod *)", "Bash(touch *)",
      "Bash(sed *)", "Bash(awk *)", "Bash(xargs *)", "Bash(tee *)",
      "Read", "Edit", "Write", "Glob", "Grep",
      "WebFetch", "WebSearch"
    ],
    "deny": [
      "Read(./.env)", "Read(./.env.*)", "Read(./secrets/**)"
    ]
  }
}
SETTINGS

What these settings do

Setting Why
env.ANTHROPIC_BASE_URL Points Claude Code at your local llama-server instead of Anthropic's cloud.
env.ANTHROPIC_AUTH_TOKEN Any non-empty string. Satisfies the auth check without a real API key.
env.*_MODEL Maps every model tier (Opus, Sonnet, Haiku, subagents) to your local Qwen 3.5.
env.CLAUDE_CODE_MAX_OUTPUT_TOKENS Allows longer responses. Default is 32K — local models have no billing, so crank it.
env.DISABLE_PROMPT_CACHING Prompt caching is an Anthropic API feature. Your local server handles its own caching via --swa-full.
env.DISABLE_AUTOUPDATER You're running local. Auto-updates would just add network calls you don't need.
env.DISABLE_TELEMETRY Running local to keep things private? Then don't phone home.
env.DISABLE_ERROR_REPORTING Same reasoning. Your errors, your business.
env.DISABLE_NON_ESSENTIAL_MODEL_CALLS Stops Claude Code from making extra model calls for things like spinner text. Every token counts on local inference.
permissions.allow Pre-approves common shell commands and all file tools. Without this, Claude Code asks for permission on every single tool call. Gets old fast.
permissions.deny Keeps .env and secrets off-limits, because even local models shouldn't read your credentials.

Already have a settings.json? The install.sh script detects existing content and backs it up before writing. Or just merge the env and permissions blocks into your existing file manually.


Step 6 — Shell configuration (backup + cclocal helper)

The settings.json above handles everything Claude Code needs. But having the env vars in .zshrc too gives you a fallback and lets the cclocal helper function work:

cat << 'EOF' >> ~/.zshrc

# Claude Code + Local Qwen 3.5 — because localhost > cloud
export LM_MODEL="unsloth/qwen3.5-35b-a3b"
export ANTHROPIC_BASE_URL="http://127.0.0.1:8131"
export ANTHROPIC_AUTH_TOKEN="local"
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=128000
export ANTHROPIC_MODEL="$LM_MODEL"
export ANTHROPIC_DEFAULT_OPUS_MODEL="$LM_MODEL"
export ANTHROPIC_DEFAULT_SONNET_MODEL="$LM_MODEL"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="$LM_MODEL"
export CLAUDE_CODE_SUBAGENT_MODEL="$LM_MODEL"
EOF

Step 7 — Launch Claude Code

Or, if you prefer a no-config one-liner (works even without settings.json or .zshrc changes):

ANTHROPIC_BASE_URL=http://127.0.0.1:8131 ANTHROPIC_AUTH_TOKEN=local claude

You should see unsloth/qwen3.5-35b-a3b as the active model. Type something. Watch your llama-server terminal light up with inference logs. Smile.


Daily Driver: Quick Start Script

Once everything is installed, this is your daily workflow. One command, two services:

llama-server \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
  --port 8131 \
  -ngl 999 \
  -t 4 \
  -c 131072 \
  -b 512 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --keep 1024 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --swa-full \
  --no-context-shift \
  --chat-template-kwargs '{"enable_thinking": false}' \
  --mlock \
  --no-mmap &

sleep 15 && ANTHROPIC_BASE_URL=http://127.0.0.1:8131 ANTHROPIC_AUTH_TOKEN=local claude

Shell Helper: cclocal

For the refined developer who wants a clean command:

cat << 'FUNC' >> ~/.zshrc

cclocal() {
    local port=8131
    if [[ "$1" =~ ^[0-9]+$ ]]; then
        port="$1"
        shift
    fi
    ANTHROPIC_BASE_URL="http://127.0.0.1:${port}" \
    ANTHROPIC_AUTH_TOKEN="local" \
    claude "$@"
}
FUNC

Now you can:

cclocal              # Qwen 3.5 on port 8131
cclocal 8127         # Qwen3-Coder on port 8127
cclocal 8129         # GLM-4.7-Flash on port 8129

Other Models Worth Trying

Already running Qwen 3.5 and want more? You can run multiple models on different ports and switch with cclocal <port>.

Model Port Size Good For Command
Qwen3-Coder-30B-A3B 8127 ~30 GB Pure coding. If you only write code, this might be better than Qwen 3.5. llama-server --fim-qwen-30b-default --port 8127
GLM-4.7-Flash 8129 ~18 GB Lighter weight, still capable. Good if you're on 24 GB. llama-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --port 8129 -c 131072 -b 2048 -ub 1024 --parallel 1 -fa on --jinja --chat-template-file ~/llama.cpp/models/templates/glm-4.jinja
GPT-OSS-20B 8123 ~20 GB Fast baseline. 17–38 tok/s on M1 Max. llama-server --gpt-oss-20b-default --port 8123
Qwen3-Coder-Next-80B-A3B 8130 ~46 GB SOTA coder. Needs 64 GB RAM. Worth it if you have the metal. llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL --port 8130 -c 131072 -b 2048 -ub 1024 --parallel 1 -fa on --jinja

Troubleshooting

Things that might go wrong and how to fix them without rage-quitting.

Symptom Likely Cause Fix
connection refused on curl Server still loading Wait 30s, retry. First load is slow.
Painfully slow responses Missing --swa-full or -ngl 999 Restart llama-server with both flags. This is the #1 gotcha.
Claude Code uses the cloud model Env vars not set Re-export them in the terminal you're using. Check with echo $ANTHROPIC_BASE_URL.
failed to find a memory slot Context too large for available RAM Use --parallel 1 and/or reduce -c to 32768.
Auth errors from Claude Code Missing auth token export ANTHROPIC_AUTH_TOKEN=local — any non-empty string works.
Wrong model responding Model ID mismatch Run curl localhost:8131/v1/models and compare with echo $LM_MODEL.
First request takes forever Cold start, model loading into memory Normal. 10–30 seconds. Subsequent requests are fast.
System swapping / beachball Model doesn't fit in RAM Reduce -c, or try a smaller model like GLM-4.7-Flash (~18 GB).
Homebrew binary conflict Previous install left /opt/homebrew/bin/claude rm /opt/homebrew/bin/claude && brew reinstall --cask claude-code

Stopping

Or Ctrl+C in the llama-server terminal. Your model weights aren't going anywhere — next launch picks up the cached GGUF instantly.


Credits