Building Memory as a Service with memrun

Deploy data-heavy Python services with warm memory, sticky routing, and zero infrastructure code.

What memrun is

Memrun is a Python SDK and deployment platform for services that are memory-bound, not CPU-bound. The kind of services where you load a 2GB dataset or a large ML model, then serve many requests against it. Where the expensive part is getting data into memory, not processing it.

You write a handler function with a decorator. Memrun handles everything else: provisioning workers on Hetzner VMs, routing related requests to the same worker via Kafka partition keys, managing a 600GB NVMe-backed cache per worker, bounding concurrency with semaphores, and reporting health via heartbeats.

from memrun import MemoryService

svc = MemoryService(
    name="doc-search",
    memory="32Gi",
    disk="600Gi",
    max_workers=10,
    concurrency=16,
    timeout_seconds=300,
)

@svc.handler(sticky_key="corpus_id")
async def handle(ctx, req):
    embeddings = await ctx.get_or_fetch(req["embeddings_path"])
    results = search(embeddings, req["query"], top_k=req.get("top_k", 10))
    return {"matches": results}

Deploy with one command:

memrun deploy handler.py --name doc-search --memory 32Gi --disk 600Gi

That’s it. No Dockerfiles, no Kubernetes manifests, no Terraform. The platform boots VMs, installs your handler, and starts serving.

The problem it solves

Every time I built a data-intensive service on Lambda or Cloud Run, the same pattern emerged:

Request arrives
Fetch 1-3GB from S3 (200-800ms)
Deserialize into working structures (100-500ms)
Do the actual computation (50-200ms)
Return result
Container gets killed
Next request: repeat from step 1

Steps 2-3 dominate runtime. Steps 6-7 make it worse. The actual work is step 4, but you’re paying for steps 2-3 on every single request.

With memrun, step 2-3 happens once. The LRUCache stores fetched data on NVMe. The SharedWorkerContext keeps decoded structures in memory across requests. Kafka’s sticky routing ensures the same user/dataset combination always lands on the same worker. After the first request, subsequent requests skip straight to step 4.

How it works, concretely

The SDK

You define a MemoryService with resource declarations:

svc = MemoryService(
    name="analytics",       # DNS-compatible service name
    memory="32Gi",          # RAM per worker
    disk="600Gi",           # NVMe cache per worker
    max_workers=50,         # Maximum worker count
    min_workers=2,          # Minimum (always-on) workers
    concurrency=16,         # Max concurrent requests per worker
    timeout_seconds=300,    # Per-request timeout
    env={"MODEL_VERSION": "v3"},  # Environment variables
)

Init handlers

Load expensive resources once per worker lifetime:

@svc.init_handler()
async def setup(ctx):
    """Runs once before the worker accepts requests."""
    # Download and load a large model
    model_bytes = await ctx.cache.get_or_fetch_from_s3(
        "s3://models/transformer-large.bin"
    )
    model = deserialize_model(model_bytes)
    ctx.set_object("model", model)

    # Pre-build an index
    index = build_faiss_index(model.embeddings)
    ctx.set_object("index", index)

The init_handler runs after WorkerRuntime loads the handler module and before the Kafka consumer starts. If it fails, the worker reports failure and the deployment is marked as FAILED. Objects stored via ctx.set_object() persist in SharedWorkerContext._objects for the worker’s entire lifetime.

Request handlers

The handler receives a WorkerRequestContext and request payload:

@svc.handler(sticky_key="user_id:dataset_id")
async def handle(ctx, req):
    # Objects set in init_handler are available here
    model = ctx.get_object("model")
    index = ctx.get_object("index")

    # Cache-through read: NVMe first, then S3
    dataset = await ctx.get_or_fetch(req["dataset_path"])

    # Your actual logic
    results = model.query(dataset, req["question"])
    ranked = index.search(results, top_k=10)

    # Optionally persist results to S3
    result_url = await ctx.upload_result(
        json.dumps(ranked).encode(),
        key=f"results/{ctx.request_id}.json",
    )

    return {"ranked": ranked, "result_url": result_url}

The sticky_key="user_id:dataset_id" tells memrun to extract those fields from the request payload, hash them, and use the hash as the Kafka partition key. Same values, same partition, same worker. The cache stays warm.

The cache hierarchy

ctx.get_or_fetch(s3_url) does this:

Check LRUCache (backed by NVMe at /var/lib/memrun/cache/{service}/)
On miss: fetch from S3 via aioboto3, write to cache, return
On hit: read from NVMe (or OS page cache if recently accessed), return

The cache uses LRU eviction when full. No TTL by default — data stays cached until space is needed. The index is persisted to index.json so workers can recover cache state after restarts.

For in-memory objects (decoded models, indices, parsed dataframes), use the object store:

# Store once
ctx.set_object("parsed_df", parse_parquet(raw_bytes))

# Retrieve on every request (microseconds, no deserialization)
df = ctx.get_object("parsed_df")

Bounded concurrency

Each worker runs at most N requests concurrently (default 16). The RequestExecutor enforces this with an asyncio.Semaphore:

# Internal implementation
async with self._semaphore:
    result = await asyncio.wait_for(
        handler(ctx, payload),
        timeout=self._timeout_seconds,
    )

When all 16 slots are occupied, the worker stops pulling from Kafka. Messages queue up on the broker. This is explicit backpressure — no silent degradation, no OOM kills from unbounded parallelism. If the queue grows too deep, you scale workers.

Local development

Test your handler locally without deploying:

if __name__ == "__main__":
    svc.serve(host="0.0.0.0", port=8000)

Or deploy to local Docker containers:

memrun deploy handler.py --name analytics --local

This uses LocalProvisioner to start Docker containers instead of Hetzner VMs. Same handler code, same cache behavior, local infrastructure.

CLI operations

# Deploy
memrun deploy handler.py --name my-service --memory 32Gi --disk 600Gi

# List services
memrun ls
memrun ls --watch  # Live refresh

# Invoke
memrun invoke my-service --payload '{"user_id": "u1", "query": "..."}'
memrun invoke my-service --payload '{"user_id": "u1", "query": "..."}' --sync

# Scale
memrun scale my-service --workers 20

# Monitor
memrun status my-service
memrun logs my-service --follow
memrun queue my-service  # Queue depth and consumer lag

# Manage infrastructure
memrun servers --service my-service
memrun delete-server my-service worker-abc123

# Teardown
memrun delete my-service --force

What happens when you deploy

The full lifecycle:

Package: CLI bundles handler.py + manifest.json into a tarball
Upload: Tarball goes to S3 at memrun-artifacts/deployments/{service}/{id}/package.tar.gz
Schedule: API creates DeploymentModel, Scheduler.deploy() begins rolling deployment
Provision: For each worker, ServerProvisioner generates cloud-init YAML and creates a Hetzner VM
Boot: Cloud-init installs Python, mem-worker, memrun, and uv
Download: Worker downloads handler tarball from a presigned S3 URL
Dependencies: WorkerRuntime._install_dependencies() runs uv sync --script handler.py for PEP 723 inline dependencies
Init: _load_handler() imports your module, discovers @svc.handler and @svc.init_handler functions
Warmup: Init handler runs (load models, build indices)
Ready: Worker starts Kafka consumer on topic memrun.requests.{service}, sends first heartbeat
Serve: Requests flow. Stats reported every 10 seconds via heartbeat.

Rolling deployments: new workers are provisioned and confirmed ready before old ones are deprovisioned. Kafka rebalances partitions to include new consumers.

PEP 723 inline dependencies

Your handler can declare its own dependencies without a requirements.txt:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "numpy>=1.24",
#     "pandas>=2.0",
#     "faiss-cpu>=1.7",
# ]
# ///

from memrun import MemoryService
import numpy as np
import pandas as pd
import faiss

svc = MemoryService(name="vector-search", memory="16Gi", disk="200Gi")

@svc.handler(sticky_key="collection_id")
async def handle(ctx, req):
    ...

WorkerRuntime detects PEP 723 metadata and runs uv sync --script handler.py before loading the handler. Dependencies are installed once per deployment, not per request.

Example: document QA service

A complete example — a service that answers questions over large document corpora:

# /// script
# requires-python = ">=3.11"
# dependencies = ["sentence-transformers>=2.2", "faiss-cpu>=1.7"]
# ///

from memrun import MemoryService
from sentence_transformers import SentenceTransformer
import faiss
import json

svc = MemoryService(
    name="doc-qa",
    memory="16Gi",
    disk="400Gi",
    max_workers=20,
    concurrency=8,
    timeout_seconds=60,
)

@svc.init_handler()
async def setup(ctx):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    ctx.set_object("encoder", model)

@svc.handler(sticky_key="corpus_id")
async def handle(ctx, req):
    encoder = ctx.get_object("encoder")

    # Cache-through: first request fetches from S3, subsequent hit NVMe
    index_bytes = await ctx.get_or_fetch(
        f"s3://corpora/{req['corpus_id']}/index.faiss"
    )
    metadata_bytes = await ctx.get_or_fetch(
        f"s3://corpora/{req['corpus_id']}/metadata.json"
    )

    index = faiss.deserialize_index(index_bytes)
    metadata = json.loads(metadata_bytes)

    # Encode query and search
    query_vec = encoder.encode([req["question"]])
    distances, indices = index.search(query_vec, req.get("top_k", 5))

    results = []
    for i, idx in enumerate(indices[0]):
        if idx >= 0:
            results.append({
                "text": metadata[idx]["text"],
                "score": float(distances[0][i]),
                "source": metadata[idx].get("source", ""),
            })

    return {"results": results}

Deploy and invoke:

# Deploy
memrun deploy doc_qa.py --name doc-qa --memory 16Gi --disk 400Gi

# Query
memrun invoke doc-qa --sync --payload '{
  "corpus_id": "legal-contracts-2024",
  "question": "What are the termination clauses?",
  "top_k": 5
}'

The first request for corpus_id=legal-contracts-2024 fetches the FAISS index and metadata from S3 and caches them on NVMe. Every subsequent request for the same corpus hits the cache. The sentence transformer model loads once in init_handler and stays in memory for the worker’s lifetime.

Example: financial time-series analysis

from memrun import MemoryService
import pandas as pd
import io

svc = MemoryService(
    name="timeseries",
    memory="32Gi",
    disk="600Gi",
    max_workers=10,
    concurrency=16,
)

@svc.handler(sticky_key="portfolio_id")
async def handle(ctx, req):
    # Each portfolio has ~500MB of historical data
    raw = await ctx.get_or_fetch(
        f"s3://market-data/portfolios/{req['portfolio_id']}/history.parquet"
    )
    df = pd.read_parquet(io.BytesIO(raw))

    # Filter by date range
    mask = (df["date"] >= req["start"]) & (df["date"] <= req["end"])
    subset = df[mask]

    # Compute requested metrics
    result = {}
    if "returns" in req.get("metrics", []):
        result["returns"] = subset["close"].pct_change().describe().to_dict()
    if "volatility" in req.get("metrics", []):
        result["volatility"] = float(subset["close"].pct_change().std() * (252 ** 0.5))
    if "sharpe" in req.get("metrics", []):
        excess = subset["close"].pct_change() - 0.04/252
        result["sharpe"] = float(excess.mean() / excess.std() * (252 ** 0.5))

    return result

The 500MB parquet file for each portfolio loads once from S3 and stays cached on NVMe. A financial analyst running multiple queries against the same portfolio pays the load cost once.

Infrastructure

memrun runs on Hetzner Cloud. Worker VMs are selected based on memory and disk requirements:

cpx11:  2 vCPU,  4GB RAM,  40GB NVMe   ~€4/mo
cpx21:  3 vCPU,  8GB RAM,  80GB NVMe   ~€8/mo
cpx31:  4 vCPU, 16GB RAM, 160GB NVMe  ~€15/mo
cpx41:  8 vCPU, 32GB RAM, 240GB NVMe  ~€29/mo
cpx51: 16 vCPU, 32GB RAM, 360GB NVMe  ~€54/mo

For comparison, an AWS r6g.xlarge (4 vCPU, 32GB RAM) costs ~$150/month. The economics work because memory-bound workloads don’t need hyperscaler network or instance variety — they need large RAM, fast NVMe, and low cost per GB.

The control plane (FastAPI + PostgreSQL + Kafka/Redpanda + MinIO/S3) runs separately. Workers communicate with it via HTTP heartbeats and result reporting.

Configuration

# ~/.memrunctl/config.toml

[local]
api_url = "http://localhost:8000"

[production]
api_url = "https://api.memrun.net"
api_token = "your-token"

The CLI reads this config to determine where to deploy. Use --local flag for Docker-based local development.

Tradeoffs and limitations

Things memrun is not good for:

Sub-second latency requirements. Workers process requests from a Kafka queue. There’s inherent latency from the message broker. If you need sub-50ms responses, use a direct HTTP service.
Purely CPU-bound workloads. If you’re not reading significant data, traditional serverless is simpler and likely cheaper.
Stateless request-reply APIs. If requests are independent with no data reuse, sticky routing adds no value.
Scale-to-zero. Workers are long-lived by design. min_workers=0 is supported but cold starts mean provisioning a VM, which is slow.

Things to be aware of:

Kafka rebalancing during scaling. When workers are added or removed, some partitions move. Affected keys pay one cold-start penalty.
Cache is not replicated. If a worker dies, its NVMe cache is lost. Data reconstructs from S3 on the replacement worker.
Single-region currently. Workers run in one Hetzner location. No multi-region routing yet.
Python only. The SDK and worker runtime are Python. No polyglot support currently.

What’s next

We’re using memrun in production for our own data-intensive services. Things we’re working on:

Queue depth-based autoscaling (currently manual via memrun scale)
Multi-region worker placement
Cache pre-warming during deployments (transfer hot keys from old workers)
Metrics dashboard (currently stats are via heartbeat API)

If you’re building something data-heavy and tired of paying the state tax on every request, give it a look. Feedback welcome.