Serverless Warm Memory: A Distinct Service Category

The real value is warm state, locality, and cache — not generic compute.

The limits of today’s serverless

AWS Lambda, Google Cloud Run, and Fly.io Machines all share a common abstraction: stateless requests handled by ephemeral containers that scale up and down on demand. This works well when requests are independent, data fits in memory or comes from fast remote stores, cold starts are acceptable, and compute is the bottleneck.

But it breaks down when each request reads gigabytes of data, when the same user sends consecutive requests over similar data, when memory bandwidth dominates CPU, or when the working set is much larger than RAM but benefits from caching. In these cases, statelessness isn’t a feature. It’s a tax.

The class of problems we keep fighting

Interactive analytics over large datasets. Matrix and tensor queries. Document or embedding search over large corpora. Time-series slicing. Scientific and financial analysis pipelines.

These workloads share common characteristics. They read one to three gigabytes per request. They run for ten to fifteen seconds. They are memory-bound, not CPU-bound. They exhibit strong temporal locality — follow-up queries overlap heavily with previous ones. Throughput matters more than raw parallelism.

Running these on pure serverless results in constant cache misses, repeated object storage reads, egress costs, and poor tail latency. What these workloads want is warm state, not infinite scale.

Warm memory is the product

Most platforms treat memory as something you allocate per request, or something you get for free if the container happens to be reused. That’s backwards for memory-bound systems.

For these workloads, warm RAM is the expensive, valuable resource. Fast local disk is the second tier. Remote object storage is the cold backup. The performance difference isn’t marginal:

RAM            → microseconds
NVMe / SSD     → low milliseconds
Object storage → tens to hundreds of milliseconds

In memrun, this hierarchy is explicit. Each worker gets a SharedWorkerContext that owns a disk-backed LRUCache and an in-memory object store. The LRUCache lives on NVMe at /var/lib/memrun/cache/{service_name}/ with configurable capacity up to 600GB. The in-memory store (_objects: dict[str, Any]) holds decoded structures — models, indices, parsed dataframes — across the worker’s lifetime.

When a handler calls ctx.get_or_fetch(s3_url), the lookup order is:

In-memory object store (microseconds)
NVMe-backed LRU cache (low milliseconds)
S3 fetch + cache write (hundreds of milliseconds)

A platform that ignores this hierarchy leaves performance on the table. A platform that embraces it treats memory as the product, with compute attached.

Locality beats elasticity

Here’s something I keep coming back to: serving the same user twice on the same machine is often worth more than adding ten more machines.

In memrun, locality is implemented through Kafka partition keys. Each service declares a sticky_key (like "user_id:dataset_id"), and the platform hashes request payloads by that key to determine the Kafka partition. Since each partition is consumed by exactly one worker in the consumer group (memrun-{service_name}), related requests always land on the same machine.

The result: reused decoded data structures, OS page cache hits, hot NVMe blocks, and precomputed indices. Elasticity, by contrast, spreads related work across machines, destroys cache warmth, and increases total I/O.

This doesn’t mean never scale. It means scale deliberately. When memrun’s Scheduler.scale_service() adds workers, Kafka rebalances partitions across the new consumer group membership. Some keys move to new workers and pay one cold-start penalty. But the majority stay put. The goal isn’t maximum utilization. The goal is minimum data movement.

Comparison of round-robin routing vs sticky routing showing cache hits and misses — Locality beats elasticity. Sticky routing via Kafka partition keys preserves cache warmth across requests.

A missing service category

What’s missing from existing platforms is long-lived workers with large memory footprints, fast local disks, sticky routing, bounded concurrency, and explicit backpressure. Not functions as a service, but memory as a service with compute attached.

This is what memrun provides. The primitives:

Workers stay alive for hours. Provisioned on Hetzner VMs (2-16 vCPU, 4-32GB RAM, up to 360GB NVMe) or local Docker containers.
Sticky routing via Kafka partition keys. Same key, same worker.
NVMe LRU cache per worker. Up to 600GB, with automatic eviction.
Bounded concurrency via asyncio.Semaphore. Default 16 concurrent requests per worker.
At-least-once delivery via manual Kafka offset commits. No lost requests.
Init handlers for pre-loading models and data before the worker accepts requests.

What serverless warm memory actually looks like

The programming model is a Python decorator that declares affinity and resources:

from memrun import MemoryService

svc = MemoryService(
    name="matrix-qa",
    memory="32Gi",
    disk="600Gi",
    max_workers=50,
    concurrency=16,
    timeout_seconds=300,
)

@svc.init_handler()
async def setup(ctx):
    """Runs once per worker before accepting requests."""
    model = load_model("s3://models/latest.bin")
    ctx.set_object("model", model)

@svc.handler(sticky_key="user_id:dataset_id")
async def handle(ctx, req):
    model = ctx.get_object("model")
    data = await ctx.get_or_fetch(req["dataset_path"])
    result = model.query(data, req["question"])
    return {"answer": result}

The init_handler runs once when the worker boots, before it starts consuming from Kafka. This is where you load ML models, build indices, or establish connections. The loaded objects persist in SharedWorkerContext._objects and are accessible to every request via ctx.get_object().

The handler receives a WorkerRequestContext with access to:

ctx.cache — the NVMe-backed LRUCache
ctx.storage — async S3 client (aioboto3)
ctx.get_or_fetch(s3_url) — cache-through reads
ctx.get_object(name) / ctx.set_object(name, obj) — in-memory persistence
ctx.upload_result(data, key) — write results back to S3

Concurrency is bounded by the RequestExecutor’s semaphore:

async with self._semaphore:  # asyncio.Semaphore(concurrency)
    result = await asyncio.wait_for(
        handler(ctx, payload),
        timeout=self._timeout_seconds,
    )

This prevents I/O collapse. 16 requests run concurrently; the 17th waits in the Kafka consumer backlog. Backpressure is real and visible — no silent degradation.

Comparison of unbounded vs bounded concurrency showing latency characteristics — Bounded concurrency prevents I/O collapse. The semaphore caps in-flight requests; excess work queues in Kafka.

This still feels serverless

Despite using long-lived workers, the developer never touches infrastructure:

memrun deploy handler.py \
  --name matrix-qa \
  --memory 32Gi \
  --disk 600Gi \
  --concurrency 16 \
  --sticky-key user_id:dataset_id

What happens behind the scenes:

The CLI packages handler.py into a tarball with a manifest.json
Uploads to S3 (memrun-artifacts/deployments/{service}/{deployment_id}/package.tar.gz)
API creates a DeploymentModel and triggers the Scheduler
Scheduler.deploy() calls the provisioner for each worker
For Hetzner: ServerProvisioner generates cloud-init config, boots a VM
Cloud-init installs mem-worker + uv, downloads the handler package
WorkerRuntime starts: loads handler, installs PEP 723 dependencies, runs init_handler, starts Kafka consumer
Worker sends heartbeat to API every 10 seconds confirming readiness
Kafka consumer begins pulling from memrun.requests.matrix-qa

The developer doesn’t provision machines, configure caches, manage routing tables, or handle retries. They deploy logic. The platform owns the memory.

The abstraction boundary shifts. Traditional serverless abstracts away machines. Serverless warm memory abstracts away machines and cache management and locality routing. The developer’s mental model gets simpler, not more complex.

Diagram showing the abstraction boundary between developer responsibilities and platform responsibilities — The developer declares intent. The platform handles provisioning, caching, routing, and lifecycle.

Why hyperscalers don’t offer this

This model conflicts with hyperscaler economics. They optimize for multi-tenant elasticity. Statelessness maximizes utilization. Warm state reduces fleet fungibility. Dedicated warm memory for one customer is capacity unavailable for another.

But for users, memory-bound workloads want lower utilization, because locality is worth more than elasticity. This is why people keep rebuilding similar systems: Redis-backed workers, stateful microservices, bespoke analytics backends. They’re all reinventing the same missing abstraction because the platforms don’t provide it.

memrun runs on Hetzner instead of hyperscalers for this reason. A cpx41 (8 vCPU, 32GB RAM, 240GB NVMe) costs ~€29/month. Comparable memory-optimized instances on AWS cost 5-10x more. When memory is the product, commodity hardware with large NVMe is the right substrate.

Cost efficiency through constraint

Something counterintuitive: memory-bound workloads get cheaper when you stop scaling.

With fixed worker pools, predictable memory usage, cheap local NVMe, and minimal data movement, cost becomes linear and stable. You trade peak elasticity for sustained efficiency. For workloads with predictable load and strong locality, this is the right tradeoff.

The economics flip. Traditional serverless charges for compute time and rewards minimizing it. Serverless warm memory charges for reserved capacity and rewards maximizing cache hit rates. Different cost models for different workload shapes.

This is not better serverless

This category does not replace AWS Lambda, Cloud Run, or App Runner. It complements them.

Use classic serverless when requests are independent, compute dominates, and latency tolerance is high. Use serverless warm memory when data reuse matters, memory bandwidth is the bottleneck, locality is everything, and predictability beats elasticity.

The distinction isn’t quality. It’s fit.

The mental model shift

The shift is from asking “how do I scale this to zero?” to asking “how do I keep the right memory hot?”

Traditional serverless optimizes for the cold path: fast startup, stateless execution, quick teardown. Serverless warm memory optimizes for the warm path: preserve state, reuse computation, minimize data movement.

Both are valid. The question is which matches your workload.

The future of data-intensive systems isn’t just faster CPUs or bigger clusters. It’s treating warm memory and locality as first-class resources. We built memrun to prove it works. The code is the argument.