Deploy data-heavy Python services with warm memory, sticky routing, and zero infrastructure code.
What memrun is
Memrun is a Python SDK and deployment platform for services that are memory-bound, not CPU-bound. The kind of services where you load a 2GB dataset or a large ML model, then serve many requests against it. Where the expensive part is getting data into memory, not processing it.
You write a handler function with a decorator. Memrun handles everything else: provisioning workers on Hetzner VMs, routing related requests to the same worker via Kafka partition keys, managing a 600GB NVMe-backed cache per worker, bounding concurrency with semaphores, and reporting health via heartbeats.
from memrun import MemoryService
svc = MemoryService(
name="doc-search",
memory="32Gi",
disk="600Gi",
max_workers=10,
concurrency=16,
timeout_seconds=300,
)
@svc.handler(sticky_key="corpus_id")
async def handle(ctx, req):
embeddings = await ctx.get_or_fetch(req["embeddings_path"])
results = search(embeddings, req["query"], top_k=req.get("top_k", 10))
return {"matches": results}
Deploy with one command:
memrun deploy handler.py --name doc-search --memory 32Gi --disk 600Gi
That’s it. No Dockerfiles, no Kubernetes manifests, no Terraform. The platform boots VMs, installs your handler, and starts serving.
The problem it solves
Every time I built a data-intensive service on Lambda or Cloud Run, the same pattern emerged:
- Request arrives
- Fetch 1-3GB from S3 (200-800ms)
- Deserialize into working structures (100-500ms)
- Do the actual computation (50-200ms)
- Return result
- Container gets killed
- Next request: repeat from step 1
Steps 2-3 dominate runtime. Steps 6-7 make it worse. The actual work is step 4, but you’re paying for steps 2-3 on every single request.
With memrun, step 2-3 happens once. The LRUCache stores fetched data on NVMe. The SharedWorkerContext keeps decoded structures in memory across requests. Kafka’s sticky routing ensures the same user/dataset combination always lands on the same worker. After the first request, subsequent requests skip straight to step 4.
How it works, concretely
The SDK
You define a MemoryService with resource declarations:
svc = MemoryService(
name="analytics", # DNS-compatible service name
memory="32Gi", # RAM per worker
disk="600Gi", # NVMe cache per worker
max_workers=50, # Maximum worker count
min_workers=2, # Minimum (always-on) workers
concurrency=16, # Max concurrent requests per worker
timeout_seconds=300, # Per-request timeout
env={"MODEL_VERSION": "v3"}, # Environment variables
)
Init handlers
Load expensive resources once per worker lifetime:
@svc.init_handler()
async def setup(ctx):
"""Runs once before the worker accepts requests."""
# Download and load a large model
model_bytes = await ctx.cache.get_or_fetch_from_s3(
"s3://models/transformer-large.bin"
)
model = deserialize_model(model_bytes)
ctx.set_object("model", model)
# Pre-build an index
index = build_faiss_index(model.embeddings)
ctx.set_object("index", index)
The init_handler runs after WorkerRuntime loads the handler module and before the Kafka consumer starts. If it fails, the worker reports failure and the deployment is marked as FAILED. Objects stored via ctx.set_object() persist in SharedWorkerContext._objects for the worker’s entire lifetime.
Request handlers
The handler receives a WorkerRequestContext and request payload:
@svc.handler(sticky_key="user_id:dataset_id")
async def handle(ctx, req):
# Objects set in init_handler are available here
model = ctx.get_object("model")
index = ctx.get_object("index")
# Cache-through read: NVMe first, then S3
dataset = await ctx.get_or_fetch(req["dataset_path"])
# Your actual logic
results = model.query(dataset, req["question"])
ranked = index.search(results, top_k=10)
# Optionally persist results to S3
result_url = await ctx.upload_result(
json.dumps(ranked).encode(),
key=f"results/{ctx.request_id}.json",
)
return {"ranked": ranked, "result_url": result_url}
The sticky_key="user_id:dataset_id" tells memrun to extract those fields from the request payload, hash them, and use the hash as the Kafka partition key. Same values, same partition, same worker. The cache stays warm.
The cache hierarchy
ctx.get_or_fetch(s3_url) does this:
- Check
LRUCache(backed by NVMe at/var/lib/memrun/cache/{service}/) - On miss: fetch from S3 via
aioboto3, write to cache, return - On hit: read from NVMe (or OS page cache if recently accessed), return
The cache uses LRU eviction when full. No TTL by default — data stays cached until space is needed. The index is persisted to index.json so workers can recover cache state after restarts.
For in-memory objects (decoded models, indices, parsed dataframes), use the object store:
# Store once
ctx.set_object("parsed_df", parse_parquet(raw_bytes))
# Retrieve on every request (microseconds, no deserialization)
df = ctx.get_object("parsed_df")
Bounded concurrency
Each worker runs at most N requests concurrently (default 16). The RequestExecutor enforces this with an asyncio.Semaphore:
# Internal implementation
async with self._semaphore:
result = await asyncio.wait_for(
handler(ctx, payload),
timeout=self._timeout_seconds,
)
When all 16 slots are occupied, the worker stops pulling from Kafka. Messages queue up on the broker. This is explicit backpressure — no silent degradation, no OOM kills from unbounded parallelism. If the queue grows too deep, you scale workers.
Local development
Test your handler locally without deploying:
if __name__ == "__main__":
svc.serve(host="0.0.0.0", port=8000)
Or deploy to local Docker containers:
memrun deploy handler.py --name analytics --local
This uses LocalProvisioner to start Docker containers instead of Hetzner VMs. Same handler code, same cache behavior, local infrastructure.
CLI operations
# Deploy
memrun deploy handler.py --name my-service --memory 32Gi --disk 600Gi
# List services
memrun ls
memrun ls --watch # Live refresh
# Invoke
memrun invoke my-service --payload '{"user_id": "u1", "query": "..."}'
memrun invoke my-service --payload '{"user_id": "u1", "query": "..."}' --sync
# Scale
memrun scale my-service --workers 20
# Monitor
memrun status my-service
memrun logs my-service --follow
memrun queue my-service # Queue depth and consumer lag
# Manage infrastructure
memrun servers --service my-service
memrun delete-server my-service worker-abc123
# Teardown
memrun delete my-service --force
What happens when you deploy
The full lifecycle:
- Package: CLI bundles
handler.py+manifest.jsoninto a tarball - Upload: Tarball goes to S3 at
memrun-artifacts/deployments/{service}/{id}/package.tar.gz - Schedule: API creates
DeploymentModel,Scheduler.deploy()begins rolling deployment - Provision: For each worker,
ServerProvisionergenerates cloud-init YAML and creates a Hetzner VM - Boot: Cloud-init installs Python,
mem-worker,memrun, anduv - Download: Worker downloads handler tarball from a presigned S3 URL
- Dependencies:
WorkerRuntime._install_dependencies()runsuv sync --script handler.pyfor PEP 723 inline dependencies - Init:
_load_handler()imports your module, discovers@svc.handlerand@svc.init_handlerfunctions - Warmup: Init handler runs (load models, build indices)
- Ready: Worker starts Kafka consumer on topic
memrun.requests.{service}, sends first heartbeat - Serve: Requests flow. Stats reported every 10 seconds via heartbeat.
Rolling deployments: new workers are provisioned and confirmed ready before old ones are deprovisioned. Kafka rebalances partitions to include new consumers.
PEP 723 inline dependencies
Your handler can declare its own dependencies without a requirements.txt:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "numpy>=1.24",
# "pandas>=2.0",
# "faiss-cpu>=1.7",
# ]
# ///
from memrun import MemoryService
import numpy as np
import pandas as pd
import faiss
svc = MemoryService(name="vector-search", memory="16Gi", disk="200Gi")
@svc.handler(sticky_key="collection_id")
async def handle(ctx, req):
...
WorkerRuntime detects PEP 723 metadata and runs uv sync --script handler.py before loading the handler. Dependencies are installed once per deployment, not per request.
Example: document QA service
A complete example — a service that answers questions over large document corpora:
# /// script
# requires-python = ">=3.11"
# dependencies = ["sentence-transformers>=2.2", "faiss-cpu>=1.7"]
# ///
from memrun import MemoryService
from sentence_transformers import SentenceTransformer
import faiss
import json
svc = MemoryService(
name="doc-qa",
memory="16Gi",
disk="400Gi",
max_workers=20,
concurrency=8,
timeout_seconds=60,
)
@svc.init_handler()
async def setup(ctx):
model = SentenceTransformer("all-MiniLM-L6-v2")
ctx.set_object("encoder", model)
@svc.handler(sticky_key="corpus_id")
async def handle(ctx, req):
encoder = ctx.get_object("encoder")
# Cache-through: first request fetches from S3, subsequent hit NVMe
index_bytes = await ctx.get_or_fetch(
f"s3://corpora/{req['corpus_id']}/index.faiss"
)
metadata_bytes = await ctx.get_or_fetch(
f"s3://corpora/{req['corpus_id']}/metadata.json"
)
index = faiss.deserialize_index(index_bytes)
metadata = json.loads(metadata_bytes)
# Encode query and search
query_vec = encoder.encode([req["question"]])
distances, indices = index.search(query_vec, req.get("top_k", 5))
results = []
for i, idx in enumerate(indices[0]):
if idx >= 0:
results.append({
"text": metadata[idx]["text"],
"score": float(distances[0][i]),
"source": metadata[idx].get("source", ""),
})
return {"results": results}
Deploy and invoke:
# Deploy
memrun deploy doc_qa.py --name doc-qa --memory 16Gi --disk 400Gi
# Query
memrun invoke doc-qa --sync --payload '{
"corpus_id": "legal-contracts-2024",
"question": "What are the termination clauses?",
"top_k": 5
}'
The first request for corpus_id=legal-contracts-2024 fetches the FAISS index and metadata from S3 and caches them on NVMe. Every subsequent request for the same corpus hits the cache. The sentence transformer model loads once in init_handler and stays in memory for the worker’s lifetime.
Example: financial time-series analysis
from memrun import MemoryService
import pandas as pd
import io
svc = MemoryService(
name="timeseries",
memory="32Gi",
disk="600Gi",
max_workers=10,
concurrency=16,
)
@svc.handler(sticky_key="portfolio_id")
async def handle(ctx, req):
# Each portfolio has ~500MB of historical data
raw = await ctx.get_or_fetch(
f"s3://market-data/portfolios/{req['portfolio_id']}/history.parquet"
)
df = pd.read_parquet(io.BytesIO(raw))
# Filter by date range
mask = (df["date"] >= req["start"]) & (df["date"] <= req["end"])
subset = df[mask]
# Compute requested metrics
result = {}
if "returns" in req.get("metrics", []):
result["returns"] = subset["close"].pct_change().describe().to_dict()
if "volatility" in req.get("metrics", []):
result["volatility"] = float(subset["close"].pct_change().std() * (252 ** 0.5))
if "sharpe" in req.get("metrics", []):
excess = subset["close"].pct_change() - 0.04/252
result["sharpe"] = float(excess.mean() / excess.std() * (252 ** 0.5))
return result
The 500MB parquet file for each portfolio loads once from S3 and stays cached on NVMe. A financial analyst running multiple queries against the same portfolio pays the load cost once.
Infrastructure
memrun runs on Hetzner Cloud. Worker VMs are selected based on memory and disk requirements:
cpx11: 2 vCPU, 4GB RAM, 40GB NVMe ~€4/mo
cpx21: 3 vCPU, 8GB RAM, 80GB NVMe ~€8/mo
cpx31: 4 vCPU, 16GB RAM, 160GB NVMe ~€15/mo
cpx41: 8 vCPU, 32GB RAM, 240GB NVMe ~€29/mo
cpx51: 16 vCPU, 32GB RAM, 360GB NVMe ~€54/mo
For comparison, an AWS r6g.xlarge (4 vCPU, 32GB RAM) costs ~$150/month. The economics work because memory-bound workloads don’t need hyperscaler network or instance variety — they need large RAM, fast NVMe, and low cost per GB.
The control plane (FastAPI + PostgreSQL + Kafka/Redpanda + MinIO/S3) runs separately. Workers communicate with it via HTTP heartbeats and result reporting.
Configuration
# ~/.memrunctl/config.toml
[local]
api_url = "http://localhost:8000"
[production]
api_url = "https://api.memrun.net"
api_token = "your-token"
The CLI reads this config to determine where to deploy. Use --local flag for Docker-based local development.
Tradeoffs and limitations
Things memrun is not good for:
- Sub-second latency requirements. Workers process requests from a Kafka queue. There’s inherent latency from the message broker. If you need sub-50ms responses, use a direct HTTP service.
- Purely CPU-bound workloads. If you’re not reading significant data, traditional serverless is simpler and likely cheaper.
- Stateless request-reply APIs. If requests are independent with no data reuse, sticky routing adds no value.
- Scale-to-zero. Workers are long-lived by design.
min_workers=0is supported but cold starts mean provisioning a VM, which is slow.
Things to be aware of:
- Kafka rebalancing during scaling. When workers are added or removed, some partitions move. Affected keys pay one cold-start penalty.
- Cache is not replicated. If a worker dies, its NVMe cache is lost. Data reconstructs from S3 on the replacement worker.
- Single-region currently. Workers run in one Hetzner location. No multi-region routing yet.
- Python only. The SDK and worker runtime are Python. No polyglot support currently.
What’s next
We’re using memrun in production for our own data-intensive services. Things we’re working on:
- Queue depth-based autoscaling (currently manual via
memrun scale) - Multi-region worker placement
- Cache pre-warming during deployments (transfer hot keys from old workers)
- Metrics dashboard (currently stats are via heartbeat API)
If you’re building something data-heavy and tired of paying the state tax on every request, give it a look. Feedback welcome.