How to Run Local LLMs with Claude Code & OpenAI Codex

Run Claude Code and OpenAI Codex on your local device guide.

This step-by-step guide shows you how to connect open LLMs to Claude Code and Codex entirely locally, complete with screenshots. Run using any open model like DeepSeek, Qwen and Gemma.

For this tutorial, we’ll use GLM-4.7-Flash, the strongest 30B MoE agentic & coding model as of Jan 2026 to autonomously fine-tune an LLM with Unsloth. You can swap in any other model, just update the model names in your scripts.

Claude Code Tutorial OpenAI Codex Tutorial

We use llama.cpp which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains llama-server which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint.

For model quants, we will utilize Unsloth Dynamic GGUFs to run any LLM quantized, while retaining as much accuracy as possible.

📖 Step #1: Install Llama.cpp Tutorial

We need to install llama.cpp to deploy/serve local LLMs to use in Codex etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Download and use models locally

Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here.

We used unsloth/GLM-4.7-Flash-GGUF , but you can use anything like unsloth/Qwen3-Coder-Next-GGUF - see Qwen3-Coder-Next

import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id = "unsloth/GLM-4.7-Flash-GGUF",
    local_dir = "unsloth/GLM-4.7-Flash-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)

To deploy GLM-4.7-Flash for agentic workloads, we use llama-server. We apply Z.ai's recommended sampling parameters (temp 1.0, top_p 0.95) and enable --jinja for proper tool calling support.

Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size . We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization to reduce VRAM usage.

./llama.cpp/llama-server \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --fit on \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --port 8001 \
    --jinja \
    --kv-unified \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on \
    --batch-size 4096 --ubatch-size 1024 \
    --ctx-size 131072

You can also disable thinking for GLM-4.7-Flash which can improve performance for agentic coding stuff. To disable thinking with llama.cpp add this to the llama-server command:

--chat-template-kwargs "{"enable_thinking": false}"

Claude Code is Anthropic's agentic coding tool that lives in your terminal, understands your codebase, and handles complex Git workflows via natural language.

Install Claude Code and run it locally

curl -fsSL https://claude.ai/install.sh | bash
# Or via Homebrew: brew install --cask claude-code

Configure

Set the ANTHROPIC_BASE_URL environment variable to redirect Claude Code to your local llama.cpp server:

export ANTHROPIC_BASE_URL="http://localhost:8001"

Also you might need to set ANTHROPIC_API_KEY depending on the server. For example:

export ANTHROPIC_API_KEY='sk-no-key-required' ## or 'sk-1234'

Session vs Persistent: The commands above apply to the current terminal only. To persist across new terminals:

Add the export line to ~/.bashrc (bash) or ~/.zshrc (zsh).

If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL

Missing API key

If you see this, set export ANTHROPIC_API_KEY='sk-no-key-required' ## or 'sk-1234'

Use Powershell for all commands below:

irm https://claude.ai/install.ps1 | iex

Configure

Set the ANTHROPIC_BASE_URL environment variable to redirect Claude Code to your local llama.cpp server:

$env:ANTHROPIC_BASE_URL="http://localhost:8001"

Session vs Persistent: The commands above apply to the current terminal only. To persist across new terminals:

Run setx ANTHROPIC_BASE_URL "http://localhost:8001" once, or add the $env: line to your $PROFILE.

🌟Running Claude Code locally on Linux / Mac / Windows

We used unsloth/GLM-4.7-Flash-GGUF , but you can use anything like unsloth/Qwen3-Coder-Next-GGUF - see Qwen3-Coder-Next

Navigate to your project folder (mkdir project ; cd project) and run:

To set Claude Code to execute commands without any approvals do (BEWARE this will make Claude Code do and execute code however it likes without any approvals!)

Try this prompt to install and run a simple Unsloth finetune:

After waiting a bit, Unsloth will be installed in a venv via uv, and loaded up:

and finally you will see a successfully finetuned model with Unsloth!

IDE Extension (VS Code / Cursor)

You can also use Claude Code directly inside your editor via the official extension:

Alternatively, press Ctrl+Shift+X (Windows/Linux) or Cmd+Shift+X (Mac), search for Claude Code, and click Install.

If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL

👾 OpenAI Codex CLI Tutorial

Codex is OpenAI's official coding agent that runs locally. While designed for ChatGPT, it supports custom API endpoints, making it perfect for local LLMs. See https://developers.openai.com/codex/windows/ for installing on Windows - it's best to use WSL.

Mac (Homebrew):

Universal (NPM) for Linux

Configure

First run codex to login and setup things, then create or edit the configuration file at ~/.codex/config.toml (Mac/Linux) or %USERPROFILE%\.codex\config.toml (Windows).

Use cat > ~/.codex/config.toml for Linux / Mac:

Navigate to your project folder (mkdir project ; cd project) and run:

Or to allow any code to execute. (BEWARE this will make Codex do and execute code however it likes without any approvals!)

And you will see:

OpenAI's Codex is removing wire_api = "chat" support it seems - it still works as of January 29th 2026. We should switch to wire_api = "responses" but it keeps error-ing out with: {"error":{"code":400,"message":"'type' of tool must be 'function'","type":"invalid_request_error"}}

Try this prompt to install and run a simple Unsloth finetune:

and you will see:

and if we wait a little longer, we finally get:

claude --model unsloth/GLM-4.7-Flash

claude --model unsloth/GLM-4.7-Flash --dangerously-skip-permissions

You can only work in the cwd project/. Do not search for CLAUDE.md - this is it. Install Unsloth via a virtual environment via uv. See https://unsloth.ai/docs/get-started/install/pip-install on how (get it and read). Then do a simple Unsloth finetuning run described in https://github.com/unslothai/unsloth. You have access to 1 GPU.

brew install --cask codex

apt update
apt install nodejs npm -y
npm install -g @openai/codex

[model_providers.llama_cpp]
name = "llama_cpp API"
base_url = "http://localhost:8001/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp --search

codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp --search --dangerously-bypass-approvals-and-sandbox

You can only work in the cwd project/. Do not search for AGENTS.md - this is it. Install Unsloth via a virtual environment via uv. See https://unsloth.ai/docs/get-started/install/pip-install on how (get it and read). Then do a simple Unsloth finetuning run described in https://github.com/unslothai/unsloth. You have access to 1 GPU.