How to Run Local LLMs with Claude Code | Unsloth Documentation

13 min read Original article ↗
For the complete documentation index, see llms.txt. This page is also available as Markdown.

How to Run Local LLMs with Claude Code

Guide to use open models with Claude Code on your local device.

This step-by-step guide shows you how to connect open LLMs and APIs to Claude Code entirely locally, complete with screenshots. Run using any open model like Qwen3.6, DeepSeek and Gemma.

For this tutorial, we’ll use the open models: Gemma 4 and Qwen3.5 which are strong agentic & coding models (works on 24GB RAM/unified mem device). For inference, we'll use Unsloth Studio and llama.cpp enables you to run/serve LLMs on macOS, Linux, and Windows. You can swap in any other model, just update the model names in your scripts.

Claude Code Setup📖 Setup Local Model Tutorial

For model quants, we will utilize Unsloth Dynamic GGUFs to run any LLM quantized, while retaining as much accuracy as possible.

Before setting up our local LLM, we need to install Claude Code. Claude Code is a terminal-based coding agent that understands your codebase and handles complex Git workflows using natural language.

Paste into your terminal to install Claude Code:

curl -fsSL https://claude.ai/install.sh | bash

After install, navigate to your project folder. Then type claude into the shell to begin.

cd ~/projects/my-project 
claude

🕵️Fixing 90% slower inference in Claude Code

Claude Code recently prepends and adds a Claude Code Attribution header, which invalidates the KV Cache, making inference 90% slower with local models.

The attribution is a line prepended to the start of the system prompt (x-anthropic-billing-header: cc_version=...; cch=...;) whose value changes on every request, so the whole prompt prefix misses the KV cache each turn.

The simplest fix is to disable it inline when you launch Claude Code, so there is no file to edit:

Recent Claude Code releases also honor export CLAUDE_CODE_ATTRIBUTION_HEADER=0; older builds ignored the shell variable, so the --settings form above (or the settings file below) is the reliable choice.

To make it permanent, add CLAUDE_CODE_ATTRIBUTION_HEADER set to 0 inside "env" in ~/.claude/settings.json. For example do cat > ~/.claude/settings.json then add the below (when pasted, do ENTER then CTRL+D to save it). If you have a previous ~/.claude/settings.json file, just add "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0" to the "env" section, and leave the rest of the settings file unchanged.

Before we begin, we firstly need to complete setup for the specific model you're going to use. We use Unsloth (a web UI) and llama.cpp which are open-source frameworks for running and serving LLMs on your Mac, Linux, Windows devices.

Unsloth also has unique self-healing tool-calling and web search capabilities. See right for Claude Code connected to Unsloth:

Connect Claude Code🦥 Unsloth Tutorial llama.cpp Tutorial

For this tutorial, we will serve/connect local models to Claude Code via a UI by using Unsloth. Unsloth works on Windows, WSL, Linux and MacOS.

  • Search, download, run GGUFs and safetensor models

  • Fast CPU + GPU inference via llama.cpp

See below for install instructions:

Example of Qwen3.6 2-bit running in Unsloth.

Launch the terminal from Mac, then install Unsloth by entering the command below.

Unsloth will start setting up the environment and installing the required packages as shown below. Type Y and Press Enter when asked if you want to allow Studio to start now. This will start Unsloth on your local 8888 port.

If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using unsloth studio -p 8888 . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add -H 0.0.0.0 to the unsloth studio command.

Open your browser of choice and type http://127.0.0.1:8888 in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

Model Loading + API Guide

Before using the API, load a model from the Select model dropdown in the top-left corner of the Chat page.

In this guide, we’ll use: unsloth/gemma-4-26B-A4B-it-GGUF with the recommended UD-Q4_K_XL quantization.

Before using the Client, send a quick message:

This confirms that the model loaded correctly and is ready to respond.

In Studio, open Settings → API to view or create your API key.

Treat your API key like a password and avoid exposing it in screenshots or repositories.

Now that we have setup the local LLM for Claude Code, we now configure Claude Code to work with Unsloth or llama.cpp. We start by setting the following environment variables. These variables will not persist between sessions by default.

Config: Set the local API URL:

Copy your key from Unsloth Studio → Settings → API (or from the console when you start it with unsloth run, where it is printed as sk-unsloth-...), then set it. Also set an empty ANTHROPIC_API_KEY so Claude Code does not prompt for a cloud key:

Optional: Use the name of the model currently loaded in Unsloth as a default.

Use the full model id exactly as it appears in GET http://localhost:8888/v1/models (the same string you pass to claude --model).

Start Claude Code with the model that is currently loaded in Unsloth.

We will use gemma-4-26B-A4B-it-GGUF, but you can use any Unsloth compatible model.

For an extra speed boost on local models, you can also launch with --bare --exclude-dynamic-system-prompt-sections. See Optional: shrink the system prompt below.

Claude Code should open and display the selected model.

Try this prompt to research and rank high-quality SFT datasets.

After you submit the prompt, the agent will search the web, evaluate findings, and write the final report. This may take a few minutes.

Some workflows may require you to approve actions or answer follow up prompts.

Some workflows may require you to approve actions or answer follow-up prompts.

Once complete, the generated sft_report.md will look similar to this.

If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL

If you find open models to be 90% slower, see here first to fix KV cache being invalidated.

Optional: shrink the system prompt

Claude Code was built for Anthropic's hosted models, so its default system prompt is large. On local models you can trim it for faster responses and better KV-cache reuse by adding two flags when you launch:

--bare skips auto-discovery of hooks, skills, plugins, MCP servers and CLAUDE.md (Claude keeps Bash and file read/edit), and --exclude-dynamic-system-prompt-sections moves per-machine sections out of the prompt prefix. Both shrink the prompt and improve KV-cache reuse, which makes local models noticeably faster. They are optional and do not change the connection setup above.

Optional: tune the Unsloth server

Claude Code uses the model running in Unsloth. You can customize how the server behaves when starting it.

Use --disable-tools when driving Claude Code (or any external coding agent). By default Unsloth Studio runs its own server-side tools, which swallows the agent's tool calls, so Claude Code answers but never edits files. --disable-tools switches to passthrough, so Claude Code's own Write/Edit/Bash tools are used.

Use --reasoning off to turn thinking off, or --reasoning on to turn it on for models that support reasoning.

This starts the server on 0.0.0.0:8888, allowing other devices on your local network to connect.

Use -p to change which port the server runs on. Use -H 0.0.0.0 if you want phones, laptops, or other devices on your network to connect.

For more advanced runtime configuration, see the main API tuning section.

Before we begin, we firstly need to complete setup for the specific model you're going to use. We use llama.cpp which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains llama-server which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint.

We'll be using Qwen3.5-35B-A3B and specific settings for fast accurate coding tasks. If you don't have enough VRAM and want a smarter model, Qwen3.5-27B is a great choice, but it will be ~2x slower, or you can use other Qwen3.5 variants like 9B, 4B or 2B.

Use Qwen3.5-27B if you want a smarter model or if you don't have enough VRAM. It will be ~2x slower than 35B-A3B however. Or you can use Qwen3-Coder-Next which is fantastic if you have enough VRAM.

We need to install llama.cpp to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

Download and use models locally

Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here. If downloads get stuck, see Hugging Face Hub, XET debugging

We used unsloth/Qwen3.5-35B-A3B-GGUF , but you can use another variant like 27B or any other model like unsloth/Qwen3-Coder-Next-GGUF.

To deploy Qwen3.5 for agentic workloads, we use llama-server. We apply Qwen's recommended sampling parameters for thinking mode: temp 0.6, top_p 0.95 , top-k 20. Keep in mind these numbers change if you use non-thinking mode or other tasks.

Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size .

We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization for less VRAM usage. For full precision, use --cache-type-k bf16 --cache-type-v bf16 .Note bf16 KV Cache might be slightly slower on some machines.

You can also disable thinking for Qwen3.5 which can improve performance for agentic coding stuff. To disable thinking with llama.cpp add this to the llama-server command:

--chat-template-kwargs "{\"enable_thinking\": false}"

Start Claude Code with llama-server

We used unsloth/GLM-4.7-Flash-GGUF , but you can use anything like unsloth/Qwen3.6-27B-GGUF.

Navigate to your project folder (mkdir project ; cd project) and run:

To use Qwen3.6-35B-A3B, simply change it to:

To set Claude Code to execute commands without any approvals do (BEWARE this will make Claude Code do and execute code however it likes without any approvals!)

Try this prompt to install and run a simple Unsloth finetune:

After waiting a bit, Unsloth will be installed in a venv via uv, and loaded up:

and finally you will see a successfully finetuned model with Unsloth!

If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL

If you find open models to be 90% slower, see here first to fix KV cache being invalidated.