GitHub - neul-labs/fast-litellm: High-performance Rust acceleration for LiteLLM - providing significant performance improvements for connection pooling, rate limiting, and memory-intensive workloads.

5 min read Original article ↗

CI PyPI License: MIT Python Versions

High-performance Rust acceleration for LiteLLM - providing significant performance improvements for connection pooling, rate limiting, and memory-intensive workloads.

Why Fast LiteLLM?

Fast LiteLLM is a drop-in Rust acceleration layer for LiteLLM that provides targeted performance improvements where it matters most:

  • 3.2x faster connection pooling with DashMap lock-free data structures
  • 1.6x faster rate limiting with atomic operations
  • 1.5-1.7x faster token counting for large texts
  • 42x more memory efficient for high-cardinality rate limiting (1000+ unique keys)
  • Lock-free concurrent access using DashMap for thread-safe operations

Built with PyO3 and Rust, it seamlessly integrates with existing LiteLLM code with zero configuration required. Performance gains are most significant in connection pooling, rate limiting, and memory-intensive workloads.

Installation

# Using uv (recommended)
uv add fast-litellm

# Or using pip
pip install fast-litellm

Quick Start

import fast_litellm  # Automatically accelerates LiteLLM
import litellm

# All LiteLLM operations now use Rust acceleration where available
response = litellm.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

That's it! Just import fast_litellm before litellm and acceleration is automatically applied.

Architecture

The acceleration uses PyO3 to create Python extensions from Rust code:

┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package                                      │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer)                    │
│ ├── Enhanced Monkeypatching                                │
│ ├── Feature Flags & Gradual Rollout                        │
│ ├── Performance Monitoring                                 │
│ └── Automatic Fallback                                     │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3)                        │
│ ├── core               (Advanced Routing)                   │
│ ├── tokens             (Token Counting)                    │
│ ├── connection_pool    (Connection Management)             │
│ └── rate_limiter       (Rate Limiting)                     │
└─────────────────────────────────────────────────────────────┘

Features

  • Zero Configuration: Works automatically on import
  • Production Safe: Built-in feature flags, monitoring, and automatic fallback to Python
  • Performance Monitoring: Real-time metrics and optimization recommendations
  • Gradual Rollout: Support for canary deployments and percentage-based feature rollout
  • Thread Safe: Lock-free data structures using DashMap for concurrent operations
  • Type Safe: Full Python type hints and type stubs included

Performance Benchmarks

Benchmarks comparing production-grade Python implementations (with thread-safety) vs Rust:

Component Speedup Memory Best For
Connection Pool 3.2x faster Same HTTP connection management
Rate Limiting 1.6x faster Same Request throttling, quota management
Large Text Tokenization 1.5-1.7x faster Same Processing long documents
High-Cardinality Rate Limits 1.2x faster 42x less memory Many unique API keys/users
Concurrent Connection Pool 1.2x faster Same Multi-threaded workloads
Small Text Tokenization 0.5x (Python faster) Same Short messages (FFI overhead)
Routing 0.4x (Python faster) Same Model selection (FFI overhead)

Key Insights

Use Rust acceleration for:

  • Connection pooling (3x+ speedup)
  • Rate limiting (1.5x+ speedup)
  • Large text token counting (1.5x+ speedup)
  • High-cardinality workloads (40x+ memory savings)

⚠️ Python may be faster for:

  • Small text token counting (FFI overhead dominates)
  • Routing with complex Python objects

Run benchmarks yourself:

python scripts/run_benchmarks.py --iterations 200

See BENCHMARK.md for detailed results.

Proxy Mode (Gunicorn)

Fast LiteLLM works with LiteLLM's proxy server. When using gunicorn, the easiest approach is to create a small wrapper module and use the --preload flag.

Create app.py:

import fast_litellm  # Apply acceleration before litellm loads
from litellm.proxy.proxy_server import app

Run with gunicorn:

gunicorn app:app --preload -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:4000

The --preload flag ensures fast_litellm patches litellm in the master process before workers fork, so all workers inherit the accelerated components.

For more deployment options (Docker, systemd, config files), see the Proxy Integration Guide.

Configuration

Fast LiteLLM works out of the box with zero configuration. For advanced use cases, you can configure behavior via environment variables:

# Disable specific features
export FAST_LITELLM_RUST_ROUTING=false

# Gradual rollout (10% of traffic)
export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10

# Custom configuration file
export FAST_LITELLM_FEATURE_CONFIG=/path/to/config.json

See the configuration section in CLAUDE.md for more options.

Compatibility

Compatibility Tests

Component Supported Versions
Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13
Platforms Linux (x86_64, aarch64), macOS (x86_64, ARM64), Windows (x86_64)
LiteLLM Latest stable release
PyO3 0.24+

Rust is not required for installation - prebuilt wheels are available for all major platforms.

For detailed compatibility information, see COMPATIBILITY.md.

Development

To contribute or build from source:

Prerequisites:

  • Python 3.8+ (3.12 recommended)
  • Rust toolchain (1.70+)
  • uv for package management (recommended)
  • maturin for building Python extensions

Setup:

git clone https://github.com/neul-labs/fast-litellm.git
cd fast-litellm

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install maturin
uv add --dev maturin

# Build and install in development mode
uv run maturin develop

# Run unit tests
uv add --dev pytest pytest-asyncio
uv run pytest tests/

Integration Testing

Fast LiteLLM includes comprehensive integration tests that run LiteLLM's test suite with acceleration enabled:

# Setup LiteLLM for testing
./scripts/setup_litellm.sh

# Run LiteLLM tests with acceleration
./scripts/run_litellm_tests.sh

# Compare performance (with vs without acceleration)
./scripts/compare_performance.py

This ensures Fast LiteLLM doesn't break any LiteLLM functionality.

Documentation

Troubleshooting

Having issues? See our Troubleshooting Guide for common problems and solutions.

Contributing

We welcome contributions! Please see our Contributing Guide.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Links