Fast LiteLLM
High-performance Rust acceleration for LiteLLM - providing significant performance improvements for connection pooling, rate limiting, and memory-intensive workloads.
Why Fast LiteLLM?
Fast LiteLLM is a drop-in Rust acceleration layer for LiteLLM that provides targeted performance improvements where it matters most:
- 3.2x faster connection pooling with DashMap lock-free data structures
- 1.6x faster rate limiting with atomic operations
- 1.5-1.7x faster token counting for large texts
- 42x more memory efficient for high-cardinality rate limiting (1000+ unique keys)
- Lock-free concurrent access using DashMap for thread-safe operations
Built with PyO3 and Rust, it seamlessly integrates with existing LiteLLM code with zero configuration required. Performance gains are most significant in connection pooling, rate limiting, and memory-intensive workloads.
Installation
# Using uv (recommended) uv add fast-litellm # Or using pip pip install fast-litellm
Quick Start
import fast_litellm # Automatically accelerates LiteLLM import litellm # All LiteLLM operations now use Rust acceleration where available response = litellm.completion( model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello!"}] )
That's it! Just import fast_litellm before litellm and acceleration is automatically applied.
Architecture
The acceleration uses PyO3 to create Python extensions from Rust code:
┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer) │
│ ├── Enhanced Monkeypatching │
│ ├── Feature Flags & Gradual Rollout │
│ ├── Performance Monitoring │
│ └── Automatic Fallback │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3) │
│ ├── core (Advanced Routing) │
│ ├── tokens (Token Counting) │
│ ├── connection_pool (Connection Management) │
│ └── rate_limiter (Rate Limiting) │
└─────────────────────────────────────────────────────────────┘
Features
- Zero Configuration: Works automatically on import
- Production Safe: Built-in feature flags, monitoring, and automatic fallback to Python
- Performance Monitoring: Real-time metrics and optimization recommendations
- Gradual Rollout: Support for canary deployments and percentage-based feature rollout
- Thread Safe: Lock-free data structures using DashMap for concurrent operations
- Type Safe: Full Python type hints and type stubs included
Performance Benchmarks
Benchmarks comparing production-grade Python implementations (with thread-safety) vs Rust:
| Component | Speedup | Memory | Best For |
|---|---|---|---|
| Connection Pool | 3.2x faster | Same | HTTP connection management |
| Rate Limiting | 1.6x faster | Same | Request throttling, quota management |
| Large Text Tokenization | 1.5-1.7x faster | Same | Processing long documents |
| High-Cardinality Rate Limits | 1.2x faster | 42x less memory | Many unique API keys/users |
| Concurrent Connection Pool | 1.2x faster | Same | Multi-threaded workloads |
| Small Text Tokenization | 0.5x (Python faster) | Same | Short messages (FFI overhead) |
| Routing | 0.4x (Python faster) | Same | Model selection (FFI overhead) |
Key Insights
✅ Use Rust acceleration for:
- Connection pooling (3x+ speedup)
- Rate limiting (1.5x+ speedup)
- Large text token counting (1.5x+ speedup)
- High-cardinality workloads (40x+ memory savings)
- Small text token counting (FFI overhead dominates)
- Routing with complex Python objects
Run benchmarks yourself:
python scripts/run_benchmarks.py --iterations 200
See BENCHMARK.md for detailed results.
Configuration
Fast LiteLLM works out of the box with zero configuration. For advanced use cases, you can configure behavior via environment variables:
# Disable specific features export FAST_LITELLM_RUST_ROUTING=false # Gradual rollout (10% of traffic) export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10 # Custom configuration file export FAST_LITELLM_FEATURE_CONFIG=/path/to/config.json
See the configuration section in CLAUDE.md for more options.
Compatibility
| Component | Supported Versions |
|---|---|
| Python | 3.9, 3.10, 3.11, 3.12, 3.13 |
| Platforms | Linux (x86_64, aarch64), macOS (x86_64, ARM64), Windows (x86_64) |
| LiteLLM | Latest stable release |
| PyO3 | 0.24+ |
Rust is not required for installation - prebuilt wheels are available for all major platforms.
For detailed compatibility information, see COMPATIBILITY.md.
Development
To contribute or build from source:
Prerequisites:
- Python 3.9+
- Rust toolchain (1.70+)
- uv for package management (recommended)
- maturin for building Python extensions
Setup:
git clone https://github.com/neul-labs/fast-litellm.git cd fast-litellm # Install uv (if not already installed) curl -LsSf https://astral.sh/uv/install.sh | sh # Create virtual environment uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install maturin uv add --dev maturin # Build and install in development mode uv run maturin develop # Run unit tests uv add --dev pytest pytest-asyncio uv run pytest tests/
Integration Testing
Fast LiteLLM includes comprehensive integration tests that run LiteLLM's test suite with acceleration enabled:
# Setup LiteLLM for testing ./scripts/setup_litellm.sh # Run LiteLLM tests with acceleration ./scripts/run_litellm_tests.sh # Compare performance (with vs without acceleration) ./scripts/compare_performance.py
This ensures Fast LiteLLM doesn't break any LiteLLM functionality.
Documentation
- API Reference - Complete API documentation
- Contributing Guide - Development setup and guidelines
Contributing
We welcome contributions! Please see our Contributing Guide.
License
This project is licensed under the MIT License - see the LICENSE file for details.