Best Self-Hosted LLM Leaderboard 2026 | Open-Weight Model Rankings for Enterprise

Self-Hosted LLMs — 2026 Rankings

The definitive ranking of self-hostable LLMs for enterprise — compared across quality, speed, hardware requirements, and cost. Find the best open-weight model for your infrastructure.

Roshan Desai · Last updated: 2026-03-24

Best Self-Hosted LLMs by Task — Benchmark Rankings

Which self-hosted model is best for coding, reasoning, or agentic tasks? See how every open-weight model stacks up — hover any bar for details.

Best Advanced Knowledge

Advanced knowledge with harder 10-option format (MMLU-Pro)

Best in Graduate Reasoning

PhD-level science reasoning (GPQA Diamond)

Best at Instruction Following

Instruction following accuracy (IFEval)

Chatbot Arena Rankings

Crowdsourced Elo from human preference votes (LMArena)

Self-Hosted LLM Benchmark Scores & Hardware Requirements

Complete benchmark results, VRAM requirements, and licensing for every major self-hostable LLM. Click any column header to sort and rank.


Command R+ Cohere	104B	131K	CC-BY-NC	55 GB	208 GB	N/A	N/A	N/A	1262	N/A	N/A	N/A	N/A	N/A
DeepSeek R1 DeepSeek	671B	128K	MIT	351 GB	1340 GB	84.0	71.5	83.3	1398	49.2	90.2	65.9	87.5	97.3
DeepSeek V3.2 DeepSeek	685B	130K	N/A	351 GB	1367 GB	85.0	79.9	N/A	1423	67.8	N/A	74.1	89.3	N/A
Devstral-2-123B Mistral	123B	256K	Modified MIT	65 GB	246 GB	N/A	N/A	N/A	N/A	72.2	N/A	N/A	N/A	N/A
DS-R1-Distill-Llama-70B DeepSeek	70B	128K	MIT	36 GB	140 GB	N/A	65.2	N/A	N/A	N/A	86.0	57.5	70.0	94.5
DS-R1-Distill-Qwen-14B DeepSeek	14B	128K	MIT	8 GB	28 GB	N/A	59.1	N/A	N/A	N/A	N/A	53.1	N/A	93.9
DS-R1-Distill-Qwen-32B DeepSeek	32B	128K	MIT	17 GB	64 GB	N/A	62.1	N/A	N/A	N/A	85.4	53.1	72.0	94.3
DS-R1-Distill-Qwen-7B DeepSeek	7B	128K	MIT	4 GB	14 GB	N/A	49.1	N/A	N/A	N/A	N/A	N/A	N/A	92.8
Gemma 3 12B Google	12B	128K	Gemma License	8 GB	24 GB	60.0	40.9	N/A	1342	N/A	85.4	N/A	N/A	N/A
Gemma 3 27B Google	27B	128K	Gemma License	14 GB	54 GB	67.5	42.4	N/A	1366	N/A	N/A	29.7	N/A	89.0
GLM-4.7 Zhipu AI	355B	200K	MIT	180 GB	710 GB	84.3	85.7	88.0	1441	73.8	94.2	84.9	95.7	N/A
GLM-5 Zhipu AI	744B	200K	MIT	386 GB	1490 GB	70.4	86.0	88.0	1454	77.8	90.0	52.0	84.0	88.0
GPT-oss 120B OpenAI	117B	128K	Apache 2.0	62 GB	234 GB	90.0	80.9	N/A	1355	62.4	88.3	60.0	97.9	N/A
GPT-oss 20B OpenAI	20B	128K	Apache 2.0	11 GB	40 GB	85.3	71.5	N/A	1318	N/A	N/A	N/A	98.7	N/A
Hunyuan 2.0 Tencent	406B	256K	Tencent License	215 GB	812 GB	N/A	N/A	N/A	N/A	53.0	N/A	N/A	N/A	N/A
Kimi K2.5 Moonshot	1T	262K	MIT	542 GB	2000 GB	87.1	87.6	94.0	1438	76.8	99.0	85.0	96.1	98.0
Llama 3.1-8B Meta	8B	131K	Llama License	5 GB	16 GB	48.3	32.8	80.4	1212	N/A	72.6	N/A	N/A	51.9
Llama 3.3 70B Meta	70B	131K	Llama License	38 GB	140 GB	68.9	50.7	92.1	1319	N/A	88.4	N/A	N/A	77.0
Llama 4 Maverick Meta	400B	1M	Llama License	206 GB	800 GB	80.5	69.8	N/A	1328	N/A	62.0	43.4	N/A	N/A
Llama 4 Scout Meta	109B	10M	Llama License	58 GB	218 GB	74.3	58.2	N/A	1323	N/A	N/A	N/A	N/A	N/A
MiMo-V2-Flash Xiaomi	309B	262K	MIT	159 GB	618 GB	84.9	83.7	N/A	1393	73.4	84.8	80.6	94.1	N/A
MiniMax M2.5 MiniMax	230B	205K	Apache 2.0	117 GB	460 GB	76.5	85.2	87.5	1404	80.2	89.6	65.0	86.3	N/A
Mistral Large 3 Mistral	675B	256K	Apache 2.0	355 GB	1350 GB	N/A	43.9	N/A	1416	N/A	92.0	82.8	88.0	93.6
Mistral Small 3.1 Mistral	24B	131K	Apache 2.0	14 GB	48 GB	66.8	40.7	79.8	1304	N/A	87.2	N/A	N/A	N/A
Nemotron Ultra 253B Nvidia	253B	128K	Open Weight	135 GB	506 GB	N/A	76.0	89.5	1348	N/A	N/A	66.3	72.5	97.0
Phi-4 Microsoft	14B	16K	MIT	9 GB	28 GB	70.4	56.1	64.6	1256	N/A	82.6	N/A	N/A	80.4
Phi-4-mini Microsoft	3.8B	131K	MIT	3 GB	8 GB	52.8	30.4	N/A	N/A	N/A	72.0	N/A	N/A	N/A
Qwen 2.5-72B Qwen	72B	131K	Apache 2.0	37 GB	145 GB	71.1	49.0	86.5	1303	N/A	86.6	N/A	N/A	83.1
Qwen 3.5 Qwen	397B	262K	Apache 2.0	207 GB	794 GB	87.8	88.4	92.6	1450	76.4	N/A	83.6	N/A	N/A
Qwen3-235B-A22B Qwen	235B	131K	Apache 2.0	120 GB	470 GB	N/A	71.1	N/A	1423	N/A	N/A	70.7	81.5	N/A
Qwen3-30B-A3B Qwen	30B	131K	Apache 2.0	16 GB	60 GB	68.7	60.0	N/A	1384	N/A	N/A	N/A	76.7	95.2
Qwen3-Coder-Next Qwen	80B	256K	Apache 2.0	42 GB	160 GB	78.4	53.4	89.1	N/A	70.6	94.1	74.5	89.2	83.5
Qwen3.5-4B Qwen	4B	262K	Apache 2.0	2 GB	8 GB	79.1	76.2	N/A	N/A	N/A	N/A	N/A	N/A	N/A
Qwen3.5-9B Qwen	9B	262K	Apache 2.0	5 GB	18 GB	82.5	81.7	N/A	N/A	N/A	N/A	65.6	N/A	N/A
Step-3.5-Flash Stepfun	196B	262K	Apache 2.0	102 GB	392 GB	85.8	N/A	N/A	1389	74.4	81.1	86.4	99.8	N/A

VRAM estimates are based on model weight size only: FP16 uses 2 bytes per parameter (e.g. 70B model = 140 GB), INT4 uses 0.5 bytes per parameter (e.g. 70B model = 35 GB). Actual usage is typically 10–20% higher due to KV cache, activations, and framework overhead. Tools like Ollama default to 4-bit quantization, so real-world usage is often closer to the INT4 figure.

Compare Self-Hosted LLMs Head-to-Head

Select two models to see how they stack up across all benchmarks.

Deploy These Models with Onyx

Onyx is the open-source AI platform that lets you self-host any of these LLMs and connect them to your team's docs, apps, and people.