Qwen3-Coder-Next: How to Run Locally | Unsloth Documentation

🌠Qwen3-Coder-Next: How to Run Locally

Guide to run Qwen3-Coder-Next locally on your device!

Qwen releases Qwen3-Coder-Next, an 80B MoE model (3B active parameters) with 256K context for fast agentic coding and local use. It is comparable to the performance of models with 10–20× more active parameters.

It runs on 46GB RAM/VRAM/unified memory (85GB for 8-bit), is non-reasoning for ultra-quick code responses. The model excels at long-horizon reasoning, complex tool use, and recovery from execution failures.

Feb 4 update: llama.cpp fixed a bug correcting the calculation for vectorized key_gdiff. This fixes previous looping and output issues. We updated the GGUFs - please re-download and UPDATE llama.cpp for better outputs.

You’ll also learn to run the model on Codex & Claude Code. For fine-tuning, Qwen3-Next-Coder fits on a single B200 GPU for bf16 LoRA in Unsloth.

Qwen3-Coder-Next Unsloth Dynamic GGUFs to run: unsloth/Qwen3-Coder-Next-GGUF

Run GGUF Tutorial Codex & Claude Code FP8 vLLM Tutorial

Don't have 46GB RAM or unified memory? No worries you can run our smaller quants like 3-bit. It is best to have the model size = to the sum of your compute ( disk space + RAM + VRAM ≥ size of quant). If your quant fully fits on your device, expect 20+ tokens/s. If it doesn't fit, it'll still work by offloading but it will be slower.

To achieve optimal performance, Qwen recommends these settings:

Min_P = 0.01 (llama.cpp's default is 0.05)
repeat penalty = disabled or 1.0

Supports up to 262,144 context natively but you can set it to 32,768 tokens for less memory use.

Depending on your use-case you will need to use different settings. Because this guide uses 4-bit, you will need around 46GB RAM/unified memory. We recommend using at least 3-bit precision for best performance.

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. So specifying enable_thinking=False is no longer required.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

You can directly pull from Hugging Face. You can increase the context to 256K if your RAM/VRAM can fit it. Using --fit on will also auto determine the context length.

You can use the recommended parameters: temperature=1.0, top_p=0.95, top_k=40

Download the model via (after installing pip install huggingface_hub). You can choose UD-Q4_K_XL or other quantized versions. If downloads get stuck, see Hugging Face Hub, XET debugging

Then run the model in conversation mode:

Also, adjust context window as required, up to 262,144

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. So specifying enable_thinking=False is no longer required.

🦙Llama-server serving & deployment

To deploy Qwen3-Coder-Next for production, we use llama-server In a new terminal say via tmux. Then, deploy the model via:

Then in a new terminal, after doing pip install openai, we can run the model:

Which will output:

We extracted the HTML and ran it, and the example Flappy Bird game it generated worked well!

👾 OpenAI Codex & Claude Code

To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to 'Qwen3-Coder-Next' and ensure you follow the correct Qwen3-Coder-Next parameters and usage instructions. Use the llama-server we just set up just then.

Claude Code & OpenAI Codex

After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :

If you see API Error: 400 {"error":{"code":400,"message":"request (16582 tokens) exceeds the available context size (16384 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":16582,"n_ctx":16384}} that means you need to increase the context length or see 📐How to fit long context

🎱 FP8 Qwen3-Coder-Next in vLLM

You can now use our new FP8 Dynamic quant of the model for premium and fast inference. First install vLLM from nightly. Change --extra-index-url https://wheels.vllm.ai/nightly/cu130 to your CUDA version found via nvidia-smi - only cu129 and cu130 are currently supported.

Then serve Unsloth's dynamic FP8 version of the model. You can also enable FP8 to reduce KV cache memory usage by 50% by adding --kv-cache-dtype fp8 We served it on on 4 GPUs, but if you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. Use tmux to launch the below in a new terminal then CTRL+B+D - use tmux attach-session -t0 to return back to it.

You should see something like below. See Tool Calling with Qwen3-Coder-Next for how to actually use Qwen3-Coder-Next using the OpenAI API and tool calling - this works for vLLM and llama-server.

🔧Tool Calling with Qwen3-Coder-Next

In a new terminal, we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

Now we'll showcase multiple methods of running tool-calling for many different use-cases below:

Execute generated Python code

Execute arbitrary terminal functions

We confirm the file was created and it was!

See Tool Calling Guide for more examples for tool calling.

🛠️ Improving generation speed

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster. The next section talks about KV cache quantization.

📐How to fit long context

To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1 So try out --cache-type-k q4_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. After installing Flash Attention, you can then use --cache-type-v q4_1

If you are using our Dynamic FP8 quants 🎱 FP8 Qwen3-Coder-Next in vLLMthen using FP8 KV cache quantization can make context length support double approximately. Add --kv-cache-dtype fp8

Qwen3-Coder-Next is the best performing model for its size, and its performance is comparable to models with 10–20× more active parameters.

SWE-Bench Verified (w/ SWE-Agent)

SWE-Bench Multilingual (w/ SWE-Agent)

SWE-Bench Pro (w/ SWE-Agent)

Terminal-Bench 2.0 (w/ Terminus-2 json)

Last updated 55 minutes ago

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

./llama.cpp/llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --ctx-size 16384 \
    --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40

pip install -U huggingface_hub
hf download unsloth/Qwen3-Coder-Next-GGUF \
    --local-dir unsloth/Qwen3-Coder-Next-GGUF \
    --include "*UD-Q4_K_XL*"

./llama.cpp/llama-cli \
    --model unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40

./llama.cpp/llama-server \
    --model unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
    --alias "unsloth/Qwen3-Coder-Next" \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --port 8001 \

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Qwen3-Coder-Next",
    messages = [{"role": "user", "content": "Create a Flappy Bird game in HTML"},],
)
print(completion.choices[0].message.content)

# Install uv if you don't have it for faster environment installs
curl -LsSf https://astral.sh/uv/install.sh | sh

# Make a new Python environment - not needed if you want to change your whole system
uv venv unsloth_fp8 --python 3.12 --seed
source unsloth_fp8/bin/activate

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/Qwen3-Coder-Next-FP8-Dynamic \
    --served-model-name unsloth/Qwen3-Coder-Next \
    --tensor-parallel-size 4 \
    --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.93 \
    --port 8001

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Create a Fibonacci function in Python and find fib(20)."}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = 40, min_p = 0.00)

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Write 'I'm a happy Sloth' to a file, then print it back to me."}],
}]
messages = unsloth_inference(messages, temperature = 1.0, top_p = 1.0, top_k = 40, min_p = 0.00)