SelfHostLLM - GPU Memory Calculator for LLM Inference

2 min read Original article ↗

The Formula:

Max Concurrent Requests = Available Memory / KV Cache per Request

Step-by-Step Breakdown:

1. Total VRAM Available Total VRAM = Number of GPUs × VRAM per GPU

Example: 2 × 24GB = 48GB total

2. Model Memory (Adjusted for Quantization) Adjusted Model Memory = Base Model Memory × Quantization Factor

Example: 14GB (7B model) × 0.5 (INT4) = 7GB

The model weights are loaded once and stay in memory.

3. KV Cache per Request KV Cache = (Context Length × Adjusted Model Memory × KV Overhead) / 1000

Example: (8192 × 7GB × 0.2) / 1000 = 11.47GB per request

This memory is needed for each active request's attention cache.

4. Available Memory for Inference Available = Total VRAM - System Overhead - Model Memory

Example: 48GB - 2GB - 7GB = 39GB

This is what's left for KV caches after loading the model.

5. Maximum Concurrent Requests Max Requests = Available Memory / KV Cache per Request

Example: 39GB / 11.47GB = 3.4 requests

What the Results Mean:

  • < 1 request: Can't handle full context length, need smaller context or better GPU
  • 1-2 requests: Basic serving capability, suitable for personal use
  • 3-5 requests: Good for small-scale deployment
  • 10+ requests: Production-ready for moderate traffic

Mixture-of-Experts (MoE) Models:

Special Handling for MoE Models

MoE models (like Mixtral, DeepSeek V3/R1, Qwen3 MoE, Kimi K2, GLM-4.5) work differently:

  • Total Parameters: The full model size (e.g., Mixtral 8x7B = 56B total parameters)
  • Active Parameters: Only a subset of experts are used per token (e.g., ~14B active)
  • Memory Calculation: We automatically use active memory for these models
  • Why this matters: You only need RAM for active experts, not the entire model

Example: Mixtral 8x7B shows "~94GB total, ~16GB active" - we calculate using 16GB

Important Notes:

  • This is a rough estimate - actual usage varies by model architecture
  • Assumes worst-case scenario: All requests use the full context window. In reality, most requests use much less, so you may handle more concurrent requests
  • KV cache grows linearly with actual tokens used, not maximum context
  • Different attention mechanisms (MHA, MQA, GQA) affect memory usage
  • Framework overhead and memory fragmentation can impact real-world performance
  • Dynamic batching and memory management can improve real-world throughput
  • MoE models: Memory requirements can vary based on routing algorithms and expert utilization patterns