The Formula:
Max Concurrent Requests = Available Memory / KV Cache per Request
Step-by-Step Breakdown:
1. Total VRAM Available
Total VRAM = Number of GPUs × VRAM per GPU
Example: 2 × 24GB = 48GB total
2. Model Memory (Adjusted for Quantization)
Adjusted Model Memory = Base Model Memory × Quantization Factor
Example: 14GB (7B model) × 0.5 (INT4) = 7GB
The model weights are loaded once and stay in memory.
3. KV Cache per Request
KV Cache = (Context Length × Adjusted Model Memory × KV Overhead) / 1000
Example: (8192 × 7GB × 0.2) / 1000 = 11.47GB per request
This memory is needed for each active request's attention cache.
4. Available Memory for Inference
Available = Total VRAM - System Overhead - Model Memory
Example: 48GB - 2GB - 7GB = 39GB
This is what's left for KV caches after loading the model.
5. Maximum Concurrent Requests
Max Requests = Available Memory / KV Cache per Request
Example: 39GB / 11.47GB = 3.4 requests
What the Results Mean:
- < 1 request: Can't handle full context length, need smaller context or better GPU
- 1-2 requests: Basic serving capability, suitable for personal use
- 3-5 requests: Good for small-scale deployment
- 10+ requests: Production-ready for moderate traffic
Mixture-of-Experts (MoE) Models:
Special Handling for MoE Models
MoE models (like Mixtral, DeepSeek V3/R1, Qwen3 MoE, Kimi K2, GLM-4.5) work differently:
- Total Parameters: The full model size (e.g., Mixtral 8x7B = 56B total parameters)
- Active Parameters: Only a subset of experts are used per token (e.g., ~14B active)
- Memory Calculation: We automatically use active memory for these models
- Why this matters: You only need RAM for active experts, not the entire model
Example: Mixtral 8x7B shows "~94GB total, ~16GB active" - we calculate using 16GB
Important Notes:
- This is a rough estimate - actual usage varies by model architecture
- Assumes worst-case scenario: All requests use the full context window. In reality, most requests use much less, so you may handle more concurrent requests
- KV cache grows linearly with actual tokens used, not maximum context
- Different attention mechanisms (MHA, MQA, GQA) affect memory usage
- Framework overhead and memory fragmentation can impact real-world performance
- Dynamic batching and memory management can improve real-world throughput
- MoE models: Memory requirements can vary based on routing algorithms and expert utilization patterns