🌠Qwen3-Coder-Next: How to Run Locally
Guide to run Qwen3-Coder-Next locally on your device!
Qwen releases Qwen3-Coder-Next, an 80B MoE model (3B active parameters) with 256K context for fast agentic coding and local use. It is comparable to the performance of models with 10–20× more active parameters.
It runs on 46GB RAM/VRAM/unified memory (85GB for 8-bit), is non-reasoning for ultra-quick code responses. The model excels at long-horizon reasoning, complex tool use, and recovery from execution failures.
Feb 4 update: llama.cpp fixed a bug correcting the calculation for vectorized key_gdiff. This fixes previous looping and output issues. We updated the GGUFs - please re-download and UPDATE llama.cpp for better outputs.
You’ll also learn to run the model on Codex & Claude Code. For fine-tuning, Qwen3-Next-Coder fits on a single B200 GPU for bf16 LoRA in Unsloth.
Qwen3-Coder-Next Unsloth Dynamic GGUFs to run: unsloth/Qwen3-Coder-Next-GGUF
Run GGUF TutorialCodex & Claude CodeFP8 vLLM Tutorial
Don't have 46GB RAM or unified memory? No worries you can run our smaller quants like 3-bit. It is best to have the model size = to the sum of your compute ( disk space + RAM + VRAM ≥ size of quant). If your quant fully fits on your device, expect 20+ tokens/s. If it doesn't fit, it'll still work by offloading but it will be slower.
To achieve optimal performance, Qwen recommends these settings:
Min_P = 0.01(llama.cpp's default is 0.05)repeat penalty= disabled or 1.0
Supports up to 262,144 context natively but you can set it to 32,768 tokens for less memory use.
Depending on your use-case you will need to use different settings. Because this guide uses 4-bit, you will need around 46GB RAM/unified memory. We recommend using at least 3-bit precision for best performance.
Feb 4 update: llama.cpp fixed a bug correcting the calculation for vectorized key_gdiff. This fixes previous looping and output issues. We updated the GGUFs - please re-download and UPDATE llama.cpp for better outputs.
NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. So specifying enable_thinking=False is no longer required.
Llama.cpp Tutorial (GGUF):
Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
You can directly pull from Hugging Face. You can increase the context to 256K if your RAM/VRAM can fit it. Using --fit on will also auto determine the context length.
You can use the recommended parameters: temperature=1.0, top_p=0.95, top_k=40
Download the model via (after installing pip install huggingface_hub). You can choose UD-Q4_K_XL or other quantized versions. If downloads get stuck, see Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Also, adjust context window as required, up to 262,144
NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. So specifying enable_thinking=False is no longer required.
🦙Llama-server serving & deployment
To deploy Qwen3-Coder-Next for production, we use llama-server In a new terminal say via tmux. Then, deploy the model via:
Then in a new terminal, after doing pip install openai, we can run the model:
Which will output:
We extracted the HTML and ran it, and the example Flappy Bird game it generated worked well!
👾 OpenAI Codex & Claude Code
To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to 'Qwen3-Coder-Next' and ensure you follow the correct Qwen3-Coder-Next parameters and usage instructions. Use the llama-server we just set up just then.
After following the instructions for Claude Code for example you will see:
We can then ask say Create a Python game for Chess :
If you see API Error: 400 {"error":{"code":400,"message":"request (16582 tokens) exceeds the available context size (16384 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":16582,"n_ctx":16384}} that means you need to increase the context length or see 📐How to fit long context
🎱 FP8 Qwen3-Coder-Next in vLLM
You can now use our new FP8 Dynamic quant of the model for premium and fast inference. First install vLLM from nightly. Change --extra-index-url https://wheels.vllm.ai/nightly/cu130 to your CUDA version found via nvidia-smi - only cu129 and cu130 are currently supported.
Then serve Unsloth's dynamic FP8 version of the model. You can also enable FP8 to reduce KV cache memory usage by 50% by adding --kv-cache-dtype fp8 We served it on on 4 GPUs, but if you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. Use tmux to launch the below in a new terminal then CTRL+B+D - use tmux attach-session -t0 to return back to it.
You should see something like below. See Tool Calling with Qwen3-Coder-Next for how to actually use Qwen3-Coder-Next using the OpenAI API and tool calling - this works for vLLM and llama-server.
🔧Tool Calling with Qwen3-Coder-Next
In a new terminal, we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:
Now we'll showcase multiple methods of running tool-calling for many different use-cases below:
Execute generated Python code
Execute arbitrary terminal functions
We confirm the file was created and it was!
See Tool Calling Guide for more examples for tool calling.
🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster. The next section talks about KV cache quantization.
📐How to fit long context
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1 So try out --cache-type-k q4_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. After installing Flash Attention, you can then use --cache-type-v q4_1
If you are using our Dynamic FP8 quants 🎱 FP8 Qwen3-Coder-Next in vLLMthen using FP8 KV cache quantization can make context length support double approximately. Add --kv-cache-dtype fp8
Qwen3-Coder-Next is the best performing model for its size, and its performance is comparable to models with 10–20× more active parameters.
SWE-Bench Verified (w/ SWE-Agent)
SWE-Bench Multilingual (w/ SWE-Agent)
SWE-Bench Pro (w/ SWE-Agent)
Terminal-Bench 2.0 (w/ Terminus-2 json)
Last updated
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp./llama.cpp/llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40pip install -U huggingface_hub
hf download unsloth/Qwen3-Coder-Next-GGUF \
--local-dir unsloth/Qwen3-Coder-Next-GGUF \
--include "*UD-Q4_K_XL*"./llama.cpp/llama-cli \
--model unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40./llama.cpp/llama-server \
--model unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--alias "unsloth/Qwen3-Coder-Next" \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--port 8001 \from openai import OpenAI
import json
openai_client = OpenAI(
base_url = "http://127.0.0.1:8001/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-Next",
messages = [{"role": "user", "content": "Create a Flappy Bird game in HTML"},],
)
print(completion.choices[0].message.content)# Install uv if you don't have it for faster environment installs
curl -LsSf https://astral.sh/uv/install.sh | sh
# Make a new Python environment - not needed if you want to change your whole system
uv venv unsloth_fp8 --python 3.12 --seed
source unsloth_fp8/bin/activate
uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numbaexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/Qwen3-Coder-Next-FP8-Dynamic \
--served-model-name unsloth/Qwen3-Coder-Next \
--tensor-parallel-size 4 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--dtype bfloat16 \
--seed 3407 \
--max-model-len 200000 \
--gpu-memory-utilization 0.93 \
--port 8001messages = [{
"role": "user",
"content": [{"type": "text", "text": "Create a Fibonacci function in Python and find fib(20)."}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = 40, min_p = 0.00)messages = [{
"role": "user",
"content": [{"type": "text", "text": "Write 'I'm a happy Sloth' to a file, then print it back to me."}],
}]
messages = unsloth_inference(messages, temperature = 1.0, top_p = 1.0, top_k = 40, min_p = 0.00)