MiniMax-M2.5: How to Run Guide

Run MiniMax-M2.5 locally on your own device!

MiniMax-M2.5 is a new open LLM achieving SOTA in coding, agentic tool use and search and office work, scoring 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp.

The 230B parameters (10B active) model has a 200K context window and unquantized bf16 requires 457GB. Unsloth Dynamic 3-bit GGUF reduces size to 101GB (-62%): MiniMax-M2.5 GGUF

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 3-bit has important layers upcasted to 8 or 16-bit. You can also fine-tune the model via Unsloth, using multiGPUs.

The 3-bit dynamic quant UD-Q3_K_XL uses 101GB of disk space - this fits nicely on a 128GB unified memory Mac for ~20+ tokens/s, and also works faster with a 1x16GB GPU and 96GB of RAM for 25+ tokens/s. 2-bit quants or the biggest 2-bit will fit on a 96GB device.

For near full precision, use Q8_0 (8-bit) which utilizes 243GB and will fit on a 256GB RAM device / Mac for 10+ tokens/s.

For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.

MiniMax recommends using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40.

Default Settings (Most Tasks)

repeat penalty = 1.0 or disabled

Maximum context window: 196,608
Min_P = 0.01 (default might be 0.05)

You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax.

Run MiniMax-M2.5 Tutorials:

For these tutorials, we will be utilizing the 3-bit UD-Q3_K_XL quant which fits in a 128GB RAM device.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.

Follow this for most default use-cases:

export LLAMA_CACHE="unsloth/MiniMax-M2.5-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/MiniMax-M2.5-GGUF:UD-Q3_K_XL \
    --ctx-size 16384 \
    --flash-attn on \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q3_K_XL (dynamic 4-bit quant) or other quantized versions like UD-Q6_K_XL . We recommend using our 4bit dynamic quant UD-Q3_K_XL to balance size and accuracy. If downloads get stuck, see Hugging Face Hub, XET debugging

hf download unsloth/MiniMax-M2.5-GGUF \
    --local-dir unsloth/MiniMax-M2.5-GGUF \
    --include "*UD-Q3_K_XL*" # Use "*Q8_0*" for 8-bit

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \
    --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --ctx-size 16384 \
    --seed 3407

🦙 Llama-server & OpenAI's completion library

To deploy MiniMax-M2.5 for production, we use llama-server or OpenAI API. In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

Benjamin Marie (third-party) benchmarked MiniMax-M2.5 using Unsloth GGUF quantizations on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500), reporting both overall accuracy and relative error increase (how much more often the quantized model makes mistakes vs. the original).

Unsloth quants, no matter their precision perform much better than their non-Unsloth counterparts for both accuracy and relative error (despite being 8GB smaller).

Key results:

Best quality/size tradeoff here: unsloth UD-Q4_K_XL. It’s the closest to Original: only 6.0 points down, and “only” +22.8% more errors than baseline.
Other Unsloth Q4 quants perform closely together (~64.5–64.9 accuracy). IQ4_NL, MXFP4_MOE, and UD-IQ2_XXS are all basically the same quality on this benchmark, with ~33–35% more errors than Original.
Unsloth GGUFs perform much better than other non-Unsloth GGUFs, e.g. see lmstudio-community - Q4_K_M (despite being 8GB smaller) and AesSedai - IQ3_S.

You can view further below for benchmarks in table format:

Last updated 2 months ago

./llama.cpp/llama-server \
    --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf \
    --alias "unsloth/MiniMax-M2.5" \
    --prio 3 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --ctx-size 16384 \
    --port 8001

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/MiniMax-M2.5",
    messages = [{"role": "user", "content": "Create a Snake game."},],
)
print(completion.choices[0].message.content)