SelfHostLLM - GPU Memory Calculator for LLM Inference

Max Concurrent Requests = Available Memory / KV Cache per Request

1. Total VRAM Available Total VRAM = Number of GPUs × VRAM per GPU

Example: 2 × 24GB = 48GB total

2. Model Memory (Adjusted for Quantization) Adjusted Model Memory = Base Model Memory × Quantization Factor

Example: 14GB (7B model) × 0.5 (INT4) = 7GB

The model weights are loaded once and stay in memory.

3. KV Cache per Request KV Cache = (Context Length × Adjusted Model Memory × KV Overhead) / 1000

Example: (8192 × 7GB × 0.2) / 1000 = 11.47GB per request

This memory is needed for each active request's attention cache.

4. Available Memory for Inference Available = Total VRAM - System Overhead - Model Memory

Example: 48GB - 2GB - 7GB = 39GB

This is what's left for KV caches after loading the model.

5. Maximum Concurrent Requests Max Requests = Available Memory / KV Cache per Request

Example: 39GB / 11.47GB = 3.4 requests

< 1 request: Can't handle full context length, need smaller context or better GPU
1-2 requests: Basic serving capability, suitable for personal use
3-5 requests: Good for small-scale deployment
10+ requests: Production-ready for moderate traffic

Special Handling for MoE Models

MoE models (like Mixtral, DeepSeek V3/R1, Qwen3 MoE, Kimi K2, GLM-4.5) work differently:

Total Parameters: The full model size (e.g., Mixtral 8x7B = 56B total parameters)
Active Parameters: Only a subset of experts are used per token (e.g., ~14B active)
Memory Calculation: We automatically use active memory for these models
Why this matters: You only need RAM for active experts, not the entire model

Example: Mixtral 8x7B shows "~94GB total, ~16GB active" - we calculate using 16GB

This is a rough estimate - actual usage varies by model architecture
Assumes worst-case scenario: All requests use the full context window. In reality, most requests use much less, so you may handle more concurrent requests
KV cache grows linearly with actual tokens used, not maximum context
Different attention mechanisms (MHA, MQA, GQA) affect memory usage
Framework overhead and memory fragmentation can impact real-world performance
Dynamic batching and memory management can improve real-world throughput
MoE models: Memory requirements can vary based on routing algorithms and expert utilization patterns