A simple L7 proxy for vLLM that manages LoRA adapter storage via NVMe, routes requests, and pins workloads to nodes.
loraplex sits between your clients and vLLM. It routes requests across a cluster using consistent hashing, manages LoRA adapter files on disk (fetching on demand from HuggingFace, S3, or HTTP, with LRU eviction), and provides node affinity through configurable hash keys. By default it hashes on the adapter name, but it can hash on any request header, enabling session pinning for prefix cache reuse, document-based routing for RAG workloads, or tenant isolation. vLLM's lora_filesystem_resolver reads adapter files from the same directory loraplex writes to.
- Quick Start
- How It Works
- API
- Examples and Deployment Modes
- Architecture Details
- Reference
- Development
- License
Quick Start
Install
go install github.com/shayonj/loraplex/cmd/loraplex@latest
Or build from source:
git clone https://github.com/shayonj/loraplex.git
cd loraplex
make buildRun (single node)
# Start vLLM with filesystem resolver pointed at loraplex's storage directory export VLLM_ALLOW_RUNTIME_LORA_UPDATING=true export VLLM_PLUGINS=lora_filesystem_resolver export VLLM_LORA_RESOLVER_CACHE_DIR=/mnt/nvme/loraplex python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --enable-lora --max-loras 200 --max-lora-rank 64 # Start loraplex (writes adapter files to the same directory vLLM reads from) ./bin/loraplex \ --listen :9090 \ --vllm-url http://localhost:8000 \ --dir /mnt/nvme/loraplex
Point clients at localhost:9090 instead of localhost:8000. Base model requests pass through unchanged. LoRA adapter requests trigger loraplex to fetch and store the adapter files, then proxy to vLLM which loads them from the shared directory.
curl http://localhost:9090/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-org/your-lora-adapter", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50 }'
Run (multi-node)
On each node, pass --self and --peers:
# Node 1 ./bin/loraplex --listen :9090 --self 10.0.0.1:9090 \ --vllm-url http://localhost:8000 \ --peers 10.0.0.1:9090,10.0.0.2:9090,10.0.0.3:9090 # Node 2 ./bin/loraplex --listen :9090 --self 10.0.0.2:9090 \ --vllm-url http://localhost:8000 \ --peers 10.0.0.1:9090,10.0.0.2:9090,10.0.0.3:9090
Run (Kubernetes)
In K8s, loraplex discovers peers automatically via the Endpoints API so you don't need to list peers manually. See examples/k8s/ for full manifests including RBAC, headless Service, and Deployment.
kubectl apply -f examples/k8s/manifests.yaml
Key requirements:
- A headless Service (
clusterIP: None) selecting loraplex pods so the Endpoints API lists their IPs. - A ServiceAccount with RBAC permission to
getEndpoints in the namespace. - Both the vLLM and loraplex containers must share the storage directory via an
emptyDirvolume. - vLLM environment variables (
VLLM_ALLOW_RUNTIME_LORA_UPDATING,VLLM_PLUGINS,VLLM_LORA_RESOLVER_CACHE_DIR) must be set on the vLLM container.
Use --config to load a YAML file instead of CLI flags. See config.example.yaml for all options. CLI flags override config file values.
How It Works
loraplex and vLLM share a directory on disk. loraplex writes adapter files there, and vLLM's filesystem resolver reads from there. loraplex does not modify requests to vLLM or inject paths. It ensures the files exist before proxying, and vLLM's resolver independently discovers them.
loraplex writes ──▶ /mnt/nvme/loraplex/{adapter}/adapter_config.json
/mnt/nvme/loraplex/{adapter}/adapter_model.safetensors
vLLM reads ◀── VLLM_LORA_RESOLVER_CACHE_DIR=/mnt/nvme/loraplex
vLLM manages adapters in GPU slots and CPU memory with its own LRU cache (--max-loras, --max-cpu-loras). loraplex manages the layer below that: files on disk. It ensures adapter files are present in the shared directory before proxying requests to vLLM, handles on-demand fetching from remote origins, bounds disk usage with LRU eviction, and routes requests to the node that already has the adapter stored.
Client request (model: "acme/summarizer-lora")
│
▼
loraplex (:9090)
│
├─ base model? ──▶ passthrough to vLLM (no storage needed)
│
├─ consistent hash ──▶ this node owns it?
│ │ │
│ │ no │ yes
│ ▼ ▼
│ forward to ensure adapter files exist in shared dir
│ owner node │
│ │ on disk? ──▶ proxy to vLLM
│ ▼ │
│ (owner runs │ not on disk
│ same flow) ▼
│ fetch from HuggingFace/S3
│ write to shared dir
│ │
│ ▼
│ proxy to vLLM
│
▼
vLLM (:8000)
filesystem resolver finds adapter in shared dir
loads adapter weights into CPU/GPU memory
runs inference
The diagram above shows the default behavior where hash_on is set to model. When hash_on uses a request header (e.g., header:X-Session-ID), the same consistent hashing applies to all requests including base model requests. No adapter files are fetched for base model requests, but the routing still pins the request to a deterministic node. This is useful for workloads where landing on the same vLLM instance matters, like multi-turn conversations that benefit from vLLM's prefix caching (--enable-prefix-caching) or RAG queries where the document context is already in a node's KV cache.
API
loraplex is a transparent OpenAI-compatible proxy. All vLLM endpoints pass through.
Proxy endpoints (passthrough to vLLM)
POST /v1/chat/completionsPOST /v1/completionsGET /v1/models- Any other path is proxied as-is
Admin endpoints
# Storage stats curl http://localhost:9090/admin/cache/stats # Pre-warm adapters curl -X POST http://localhost:9090/admin/cache/warmup \ -d '{"adapters": ["acme/summarizer-lora", "acme/translator-lora"]}' # Evict an adapter (evicts locally and broadcasts to all peers) curl -X POST http://localhost:9090/admin/cache/evict \ -d '{"adapter": "acme/summarizer-lora"}' # List peers curl http://localhost:9090/admin/peers # Health check (includes vLLM health, load, pending requests) curl http://localhost:9090/healthz # Prometheus metrics curl http://localhost:9090/metrics
Examples and Deployment Modes
loraplex is composable. Pick the mode that fits your stack:
Standalone gateway. loraplex handles storage and routing. Point a load balancer at your loraplex nodes.
Storage-only sidecar. Deploy loraplex 1:1 with each vLLM pod. Use an external router (AIBrix, gateway-api-inference-extension) for request distribution. Set discovery.mode: static with only localhost as a peer.
Kubernetes. Use discovery.mode: k8s with a headless Service. loraplex discovers peers automatically via the Endpoints API. Requires a ServiceAccount with RBAC to read Endpoints. See examples/k8s/.
See the examples/ directory for ready-to-use configs:
| Path | Use case |
|---|---|
single-node.yaml |
One loraplex + one vLLM, HuggingFace origin |
multi-node.yaml |
3-node cluster with static peer discovery |
cache-only-sidecar.yaml |
Storage sidecar behind an external router (AIBrix, etc.) with S3 origin |
session-affinity.yaml |
Pin agent sessions to one node for prefix cache hits across turns |
document-routing.yaml |
Route RAG queries by document so the context stays in one node's KV cache |
k8s/ |
Full K8s deployment: RBAC, headless Service, Deployment with vLLM sidecar |
Architecture Details
Storage
loraplex manages a single directory on disk. This is the directory vLLM's filesystem resolver reads from. Point it at your fastest available local storage: NVMe, tmpfs, or any local mount.
The EnsureAdapter call guarantees adapter files exist in this directory:
- Already on disk? Nothing to do.
- Not on disk? Download from origin, write to the directory.
| Config | Default | Description |
|---|---|---|
storage.dir |
/mnt/nvme/loraplex |
Path to the adapter storage directory. Must match VLLM_LORA_RESOLVER_CACHE_DIR. |
storage.max_size |
"100GB" |
Maximum total size of stored adapters. When full, the least recently used adapter is evicted. |
vLLM also has its own CPU memory LRU cache (--max-cpu-loras) for loaded adapter weights. If vLLM still has the weights in CPU memory, it won't re-read from disk at all. The disk storage matters most when vLLM's CPU cache is also full and it needs to reload from the filesystem.
Routing
Consistent hashing maps each request's hash key to a primary owner node. By default the key is the adapter name, so a 3-node cluster stores 3x the adapters of a single node instead of redundant copies. When overflow protection kicks in, multiple nodes may store the same hot adapter (see below).
When a request arrives at the wrong node, it forwards to the owner. If the owner is down, the request is retried on the next peer in the ring. If both fail, the receiving node handles it locally (fetches the adapter itself). Forwarding adds a few milliseconds of overhead within a datacenter.
Overload protection. Each node reports its load (pending requests / max concurrent) via /healthz. Peers poll each other's health periodically. When a node's reported load exceeds overflow_threshold (default 0.8), other nodes stop forwarding to it and handle requests locally instead. If a peer is unreachable, its load is treated as 1.0 so no traffic is sent to it. This is bidirectional: if node B is overloaded with a hot adapter, nodes A and C notice via health polling and absorb the overflow by storing and serving that adapter locally. The overflow_local_total metric tracks how often this happens. Actual inference rate limiting is left to vLLM (--max-num-seqs, --max-num-batched-tokens).
Hash key. The hash_on config controls what gets hashed to determine the owner node:
hash_on |
Hash key | Use case |
|---|---|---|
model (default) |
The model field from the request body |
Route by adapter name. Most common for LoRA workloads. |
tenant |
The X-Tenant-ID header |
Full tenant-node affinity. All of a tenant's adapters land on the same node. |
tenant/model |
{tenant}/{model} |
Tenant A and tenant B using the same adapter name route to different nodes. Falls back to model-only if no tenant header. |
header:<name> |
Value of the named header | Route by any request header. Use header:X-Session-ID to pin conversations to a node for prefix cache reuse (example), or header:X-Document-ID for RAG context affinity (example). Works for base model and adapter requests alike. |
header:<name>/model |
{header}/{model} |
Composite: route by header + adapter name. Falls back to model-only if the header is missing. |
Peer discovery. Three modes for how nodes find each other:
| Mode | Config | How it works |
|---|---|---|
static |
--peers 10.0.0.1:9090,10.0.0.2:9090 |
Fixed peer list. Use for bare metal or fixed-size clusters. |
k8s |
discovery.mode: k8s |
Polls K8s Endpoints API for a headless Service. Auto-discovers pods as they scale. |
file |
discovery.mode: file |
Shared directory with heartbeat files. Use with NFS or EFS without K8s. |
In K8s mode, loraplex polls the Endpoints API every 5 seconds. When pods scale up or down, the consistent hash ring rebuilds automatically. If a request is forwarded to a peer that went down between polls, it retries the next peer on the ring before falling back to local.
Origins
loraplex fetches adapter files from one or more remote origins. If the first origin fails, it tries the next.
| Origin | Content ID for invalidation | Config |
|---|---|---|
| HuggingFace | Git commit SHA of the repo | type: huggingface, optional token |
| S3 | ETag from HeadObject |
type: s3, bucket, region, optional prefix |
| HTTP | ETag header from HEAD request |
type: http, base_url |
S3 uses the standard AWS credential chain (environment variables, IAM role, ~/.aws/credentials). Adapter files are expected at s3://{bucket}/{prefix}/{adapter-id}/adapter_config.json and adapter_model.safetensors.
Invalidation
loraplex uses a stale-while-revalidate pattern. Stored adapters are served immediately, and staleness checks happen in the background without blocking requests.
- Each stored adapter has a
.loraplex_metafile that records the origin, content ID (ETag or commit SHA), fetch timestamp, and TTL. - On every access, if the TTL has expired, a background goroutine calls
Headon the origin. This is a lightweight check (S3HeadObject, HuggingFace API call, or HTTPHEAD) that returns the current content ID without downloading files. - If the content ID matches what's stored, the timestamp is refreshed. No re-download.
- If the content ID has changed (adapter was retrained and re-uploaded), the old files are evicted and a fresh copy is fetched from origin.
The current request always gets the stored version. The next request after the background re-fetch completes gets the updated adapter. Set invalidation.ttl to control how often checks happen (default: "1h"). Set to "0" to disable.
Reference
CLI Flags
| Flag | Default | Description |
|---|---|---|
--version |
Print version and exit | |
--config |
Path to YAML config file | |
--listen |
:8080 |
Listen address |
--self |
Self address for ring identity (e.g. 10.0.0.1:9090) |
|
--vllm-url |
http://localhost:8000 |
vLLM backend URL |
--dir |
/mnt/nvme/loraplex |
Adapter storage directory (should match VLLM_LORA_RESOLVER_CACHE_DIR) |
--discovery |
static |
Discovery mode (static, file, k8s) |
--peers |
Comma-separated peer list for static discovery |
Config File
See config.example.yaml for a complete annotated example. CLI flags override config file values. Environment variables in config are expanded (${HF_TOKEN}).
storage configures adapter storage. dir is the directory loraplex writes adapter files to. It should match VLLM_LORA_RESOLVER_CACHE_DIR so vLLM's filesystem resolver can find them. Point it at your fastest local storage (NVMe, tmpfs, or any mount). max_size bounds total disk usage with LRU eviction. Sizes are strings like "100GB", "500MB", "1TB".
origins defines where to fetch adapters. Multiple origins are tried in order until one succeeds. Each entry needs a type (huggingface, s3, or http). HuggingFace accepts an optional token for private repos. S3 requires bucket and region, with an optional prefix. HTTP takes a base_url where adapters are served at {base_url}/{adapter-id}/adapter_config.json.
invalidation controls background freshness checks. ttl sets how long before a stored adapter's content ID (ETag or commit SHA) is re-checked against the origin. The check is lightweight (HEAD request, not a full download) and non-blocking. Set to "0" to disable.
routing controls request distribution. hash_on determines the consistent hash key (default: model). Supports model, tenant, tenant/model, header:<name>, and header:<name>/model for custom or composite routing. overflow_threshold (0.0-1.0) controls when overloaded nodes shed work to other nodes. fallback_to_base_model serves the base model if an adapter can't be loaded.
discovery sets how peers find each other. static uses an explicit list. k8s watches a headless Service (requires RBAC for Endpoints). file uses a shared directory with heartbeat files.
Prometheus Metrics
All metrics are prefixed with loraplex_.
| Metric | Type | Description |
|---|---|---|
requests_total |
counter | Total requests by tenant and source |
request_duration_seconds |
histogram | Request latency by tenant and source |
cache_hits_total |
counter | Adapter lookups by source (local, origin) |
cache_evictions_total |
counter | Evictions from storage |
cache_size_bytes |
gauge | Current storage size |
cache_items |
gauge | Current adapter count in storage |
peer_forwards_total |
counter | Requests forwarded to peers |
forward_duration_seconds |
histogram | Peer forwarding latency |
overflow_local_total |
counter | Requests handled locally due to owner overload |
peers_active |
gauge | Number of active peers in the ring |
ring_rebuilds_total |
counter | Ring rebuild events |
vllm_healthy |
gauge | 1 if vLLM health check passes |
Development
make test # unit + e2e tests make lint # go vet make build # build binary to bin/ make build-linux # cross-compile for linux/amd64 make docker # build Docker image
License
Apache 2.0