Self-Hosted LLMs — 2026 Rankings
The definitive ranking of self-hostable LLMs for enterprise — compared across quality, speed, hardware requirements, and cost. Find the best open-weight model for your infrastructure.
![]()
Roshan Desai · Last updated: 2026-03-24
Best Self-Hosted LLMs by Task — Benchmark Rankings
Which self-hosted model is best for coding, reasoning, or agentic tasks? See how every open-weight model stacks up — hover any bar for details.
Best Advanced Knowledge
Advanced knowledge with harder 10-option format (MMLU-Pro)
Best in Graduate Reasoning
PhD-level science reasoning (GPQA Diamond)
Best at Instruction Following
Instruction following accuracy (IFEval)
Chatbot Arena Rankings
Crowdsourced Elo from human preference votes (LMArena)
Self-Hosted LLM Benchmark Scores & Hardware Requirements
Complete benchmark results, VRAM requirements, and licensing for every major self-hostable LLM. Click any column header to sort and rank.
Command R+ Cohere | 104B | 131K | CC-BY-NC | 55 GB | 208 GB | N/A | N/A | N/A | 1262 | N/A | N/A | N/A | N/A | N/A |
DeepSeek R1 DeepSeek | 671B | 128K | MIT | 351 GB | 1340 GB | 84.0 | 71.5 | 83.3 | 1398 | 49.2 | 90.2 | 65.9 | 87.5 | 97.3 |
DeepSeek V3.2 DeepSeek | 685B | 130K | N/A | 351 GB | 1367 GB | 85.0 | 79.9 | N/A | 1423 | 67.8 | N/A | 74.1 | 89.3 | N/A |
Devstral-2-123B Mistral | 123B | 256K | Modified MIT | 65 GB | 246 GB | N/A | N/A | N/A | N/A | 72.2 | N/A | N/A | N/A | N/A |
DS-R1-Distill-Llama-70B DeepSeek | 70B | 128K | MIT | 36 GB | 140 GB | N/A | 65.2 | N/A | N/A | N/A | 86.0 | 57.5 | 70.0 | 94.5 |
DS-R1-Distill-Qwen-14B DeepSeek | 14B | 128K | MIT | 8 GB | 28 GB | N/A | 59.1 | N/A | N/A | N/A | N/A | 53.1 | N/A | 93.9 |
DS-R1-Distill-Qwen-32B DeepSeek | 32B | 128K | MIT | 17 GB | 64 GB | N/A | 62.1 | N/A | N/A | N/A | 85.4 | 53.1 | 72.0 | 94.3 |
DS-R1-Distill-Qwen-7B DeepSeek | 7B | 128K | MIT | 4 GB | 14 GB | N/A | 49.1 | N/A | N/A | N/A | N/A | N/A | N/A | 92.8 |
Gemma 3 12B | 12B | 128K | Gemma License | 8 GB | 24 GB | 60.0 | 40.9 | N/A | 1342 | N/A | 85.4 | N/A | N/A | N/A |
Gemma 3 27B | 27B | 128K | Gemma License | 14 GB | 54 GB | 67.5 | 42.4 | N/A | 1366 | N/A | N/A | 29.7 | N/A | 89.0 |
GLM-4.7 Zhipu AI | 355B | 200K | MIT | 180 GB | 710 GB | 84.3 | 85.7 | 88.0 | 1441 | 73.8 | 94.2 | 84.9 | 95.7 | N/A |
GLM-5 Zhipu AI | 744B | 200K | MIT | 386 GB | 1490 GB | 70.4 | 86.0 | 88.0 | 1454 | 77.8 | 90.0 | 52.0 | 84.0 | 88.0 |
GPT-oss 120B OpenAI | 117B | 128K | Apache 2.0 | 62 GB | 234 GB | 90.0 | 80.9 | N/A | 1355 | 62.4 | 88.3 | 60.0 | 97.9 | N/A |
GPT-oss 20B OpenAI | 20B | 128K | Apache 2.0 | 11 GB | 40 GB | 85.3 | 71.5 | N/A | 1318 | N/A | N/A | N/A | 98.7 | N/A |
Hunyuan 2.0 Tencent | 406B | 256K | Tencent License | 215 GB | 812 GB | N/A | N/A | N/A | N/A | 53.0 | N/A | N/A | N/A | N/A |
Kimi K2.5 Moonshot | 1T | 262K | MIT | 542 GB | 2000 GB | 87.1 | 87.6 | 94.0 | 1438 | 76.8 | 99.0 | 85.0 | 96.1 | 98.0 |
Llama 3.1-8B Meta | 8B | 131K | Llama License | 5 GB | 16 GB | 48.3 | 32.8 | 80.4 | 1212 | N/A | 72.6 | N/A | N/A | 51.9 |
Llama 3.3 70B Meta | 70B | 131K | Llama License | 38 GB | 140 GB | 68.9 | 50.7 | 92.1 | 1319 | N/A | 88.4 | N/A | N/A | 77.0 |
Llama 4 Maverick Meta | 400B | 1M | Llama License | 206 GB | 800 GB | 80.5 | 69.8 | N/A | 1328 | N/A | 62.0 | 43.4 | N/A | N/A |
Llama 4 Scout Meta | 109B | 10M | Llama License | 58 GB | 218 GB | 74.3 | 58.2 | N/A | 1323 | N/A | N/A | N/A | N/A | N/A |
MiMo-V2-Flash Xiaomi | 309B | 262K | MIT | 159 GB | 618 GB | 84.9 | 83.7 | N/A | 1393 | 73.4 | 84.8 | 80.6 | 94.1 | N/A |
MiniMax M2.5 MiniMax | 230B | 205K | Apache 2.0 | 117 GB | 460 GB | 76.5 | 85.2 | 87.5 | 1404 | 80.2 | 89.6 | 65.0 | 86.3 | N/A |
Mistral Large 3 Mistral | 675B | 256K | Apache 2.0 | 355 GB | 1350 GB | N/A | 43.9 | N/A | 1416 | N/A | 92.0 | 82.8 | 88.0 | 93.6 |
Mistral Small 3.1 Mistral | 24B | 131K | Apache 2.0 | 14 GB | 48 GB | 66.8 | 40.7 | 79.8 | 1304 | N/A | 87.2 | N/A | N/A | N/A |
Nemotron Ultra 253B Nvidia | 253B | 128K | Open Weight | 135 GB | 506 GB | N/A | 76.0 | 89.5 | 1348 | N/A | N/A | 66.3 | 72.5 | 97.0 |
Phi-4 Microsoft | 14B | 16K | MIT | 9 GB | 28 GB | 70.4 | 56.1 | 64.6 | 1256 | N/A | 82.6 | N/A | N/A | 80.4 |
Phi-4-mini Microsoft | 3.8B | 131K | MIT | 3 GB | 8 GB | 52.8 | 30.4 | N/A | N/A | N/A | 72.0 | N/A | N/A | N/A |
Qwen 2.5-72B Qwen | 72B | 131K | Apache 2.0 | 37 GB | 145 GB | 71.1 | 49.0 | 86.5 | 1303 | N/A | 86.6 | N/A | N/A | 83.1 |
Qwen 3.5 Qwen | 397B | 262K | Apache 2.0 | 207 GB | 794 GB | 87.8 | 88.4 | 92.6 | 1450 | 76.4 | N/A | 83.6 | N/A | N/A |
Qwen3-235B-A22B Qwen | 235B | 131K | Apache 2.0 | 120 GB | 470 GB | N/A | 71.1 | N/A | 1423 | N/A | N/A | 70.7 | 81.5 | N/A |
Qwen3-30B-A3B Qwen | 30B | 131K | Apache 2.0 | 16 GB | 60 GB | 68.7 | 60.0 | N/A | 1384 | N/A | N/A | N/A | 76.7 | 95.2 |
Qwen3-Coder-Next Qwen | 80B | 256K | Apache 2.0 | 42 GB | 160 GB | 78.4 | 53.4 | 89.1 | N/A | 70.6 | 94.1 | 74.5 | 89.2 | 83.5 |
Qwen3.5-4B Qwen | 4B | 262K | Apache 2.0 | 2 GB | 8 GB | 79.1 | 76.2 | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
Qwen3.5-9B Qwen | 9B | 262K | Apache 2.0 | 5 GB | 18 GB | 82.5 | 81.7 | N/A | N/A | N/A | N/A | 65.6 | N/A | N/A |
Step-3.5-Flash Stepfun | 196B | 262K | Apache 2.0 | 102 GB | 392 GB | 85.8 | N/A | N/A | 1389 | 74.4 | 81.1 | 86.4 | 99.8 | N/A |
VRAM estimates are based on model weight size only: FP16 uses 2 bytes per parameter (e.g. 70B model = 140 GB), INT4 uses 0.5 bytes per parameter (e.g. 70B model = 35 GB). Actual usage is typically 10–20% higher due to KV cache, activations, and framework overhead. Tools like Ollama default to 4-bit quantization, so real-world usage is often closer to the INT4 figure.
Compare Self-Hosted LLMs Head-to-Head
Select two models to see how they stack up across all benchmarks.
Deploy These Models with Onyx
Onyx is the open-source AI platform that lets you self-host any of these LLMs and connect them to your team's docs, apps, and people.