A hypothetical search engine on S3 with Tantivy and warm cache on NVMe

29 min read Original article ↗

I’ve been curious about how far you can push object storage as a foundation for database-like systems. In previous posts, I explored moving JSON data from PostgreSQL to Parquet on S3 and building MVCC-style tables with constant-time deletes using S3’s conditional writes. These experiments showed that decoupling storage from compute unlocks interesting trade-offs while lowering costs and simpler operations in exchange for higher cold query latency.

Search engines traditionally don’t fit this model. Commonly these systems require always-on clusters with fast disks and enough RAM to keep indexes hot. When data volume grows, costs scale linearly because the search index and the data it represents must both live on expensive compute-attached storage. This works fine for high-QPS workloads where every millisecond matters and query volume justifies the infrastructure cost. But for say medium to low volume search where data is large and queries are infrequent, paying for always-on clusters feels wasteful.

This post continues that exploration. Can we build a practical search system where the storage lives entirely on S3 and compute scales independently? Using Tantivy as the index format, lets design a system where indexers build immutable shard directories and publish them to S3 as frozen snapshots. Query nodes lazily mirror shards to local NVMe and serve searches from that fast local copy. The design stays simple (mostly :D) by keeping all components stateless and using S3 as the only durable store with CAS-based coordination.

Consider a typical search deployment serving 100 million documents. A traditional search cluster might need several nodes with SSDs to hold the index, plus replicas for availability. Even at moderate query rates, you’re paying for that infrastructure around the clock. The index takes up space, the nodes consume CPU and memory, and scaling means adding more nodes with their own copies of the data.

The underlying issue is coupling storage with compute. Search indexes are write-once structures that change only when new data arrives or old data is compacted. Yet they sit on expensive compute-attached storage because query latency demands fast local access. This made sense when object storage had high latency and limited throughput, but S3 and similar systems now support hundreds of MB/s per object and sub-10ms first-byte latency for reasonably sized requests.

What if we could store the index on S3 at $0.023/GB/month and only spin up compute when queries arrive? The challenge is making queries fast enough despite the round-trip cost to object storage.

Tantivy as the index format

Tantivy is a full-text search library written in Rust, inspired by Apache Lucene. It builds inverted indexes using standard techniques like term dictionaries, posting lists with skip data, and BM25 scoring. The important characteristic for our purposes is that Tantivy stores its index as a directory of immutable segment files on disk. Each segment contains a self-contained portion of the index with its own term dictionary, postings, and stored fields.

When you write documents to Tantivy, they accumulate in memory until you commit. The commit operation writes new segment files to disk and updates a meta.json file that lists all current segments. Subsequent commits create additional segments. Background merging combines small segments into larger ones to keep query performance reasonable. Once written, segment files never change. Deletes are tracked in separate tombstone files, and updates are implemented as delete-then-insert.

This immutable-segment design is exactly what we need for S3. We can build indexes locally, freeze them by disabling further writes, and upload the entire directory to S3 as an immutable snapshot. Query nodes can then download these segments once and reuse them for many queries.

Architecture overview

The system has three types of components and one shared storage layer:

Client writes                              Client queries
      ↓                                           ↓
┌──────────────┐                          ┌──────────────┐
│  WAL on S3   │                          │   Router     │
│ (durable log)│                          │ (rendezvous) │
└──────┬───────┘                          └──────┬───────┘
       │                                         │
       │                                         ↓
       ↓                               ┌─────────────────┐
┌──────────────┐                       │  Query Node     │
│   Indexer    │                       │                 │
│  (stateless) │                       │  Reads _head    │
│              │                       │  (version ptr)  │
│ Builds shard │──CAS publish─────────▶│                 │
│ Tantivy dirs │         │             │  Mirrors shards │
└──────────────┘         │             │  to /mnt/nvme/  │
                         │             │                 │
                         ↓             │  Opens Tantivy  │
                  ┌────────────────┐   │  MmapDirectory  │
                  │   S3 Storage   │   │                 │
                  │                │   │  Runs BM25      │
                  │ _head (CAS)    │   │  Returns top-k  │
                  │ index/v000123/ │   └─────────────────┘
                  │   shard_0000/  │
                  │   shard_0001/  │
                  │   ...          │
                  │                │
                  │ wal/NNN.log    │
                  └────────────────┘

Writes go to S3 as append-only log entries. Indexers consume the log, build Tantivy shards, and publish immutable index epochs to S3. Query nodes use consistent hashing to pick a home node for each namespace, mirror the needed shards to local NVMe, and serve queries from that local copy. S3 acts as the single source of truth; everything else is ephemeral.

Write path and durability

When a client wants to index documents, it sends a batch to the write endpoint. The system appends this batch to a write-ahead log stored on S3. Each log entry is a standalone object with a monotonically increasing sequence number:

s3://search/namespace_acme/wal/00000001.log
s3://search/namespace_acme/wal/00000002.log
s3://search/namespace_acme/wal/00000003.log

The write returns success only after S3 confirms the object was written. This makes writes durable immediately, though query visibility lags until indexing completes. For applications that can tolerate a few seconds of indexing delay, this trade-off eliminates the need for complex replication protocols or consensus algorithms. The write path is a simple PUT to S3 with standard retry logic.

If you need strong consistency where a query immediately sees just-written documents, you can implement a fallback path that scans recent WAL entries on the fly. We’ll discuss that later, but the basic system treats indexing as asynchronous.

Indexer design

The indexer is a stateless process that watches for new WAL entries, builds Tantivy indexes, and publishes completed shards to S3. It can run as a scheduled job, a long-lived worker polling a queue, or triggered by S3 event notifications. The important characteristic is that it holds no persistent state across runs. All information comes from S3 and all output goes back to S3.

The indexing flow looks like this:

  1. Fetch the current head pointer to learn the latest published index version
  2. List WAL entries newer than the last indexed position
  3. Download and decode those WAL entries into document batches
  4. Partition documents by shard: shard_id = hash(namespace, doc_id) % num_shards
  5. For each shard, open a Tantivy index in a local temporary directory
  6. Add documents to the Tantivy writer and commit
  7. Run a merge to consolidate segments if needed
  8. Upload the entire shard directory to s3://search/{namespace}/index/v{next}/shard_{id}/
  9. Compare-and-swap the _head pointer to v{next} (atomic publish)
  10. If CAS conflicts (412), retry or abort; GC unreferenced epochs later

The shard count is chosen based on expected data volume and desired parallelism. For a namespace with tens of millions of documents, 32 or 64 shards might be reasonable. Each shard ends up as a standalone Tantivy index of a few hundred MB to a few GB. This size is small enough to mirror quickly but large enough to amortize S3 request overhead.

The indexer workflow in detail:

Indexer Process (stateless, can run anywhere)

  ┌─────────────────────────────────────────────────────┐
  │ 1. Discover latest published version from S3        │
  │    GET _head                                        │
  └────────────────┬────────────────────────────────────┘
                   ↓
  ┌─────────────────────────────────────────────────────┐
  │ 2. List WAL entries newer than last indexed         │
  │    s3://ns/wal/00000124.log, 00000125.log, ...      │
  └────────────────┬────────────────────────────────────┘
                   ↓
  ┌─────────────────────────────────────────────────────┐
  │ 3. Download and decode WAL batches                  │
  │    Parse JSON/protobuf documents                    │
  └────────────────┬────────────────────────────────────┘
                   ↓
  ┌─────────────────────────────────────────────────────┐
  │ 4. Partition by shard: hash(doc_id) % num_shards    │
  │    doc_1 → shard_7, doc_2 → shard_14, ...           │
  └────────────────┬────────────────────────────────────┘
                   ↓
  ┌─────────────────────────────────────────────────────┐
  │ 5. Build Tantivy indexes in /tmp/                   │
  │    For each shard:                                  │
  │      - Create Tantivy writer                        │
  │      - Add documents                                │
  │      - Commit (write segment files)                 │
  │      - Merge segments if needed                     │
  └────────────────┬────────────────────────────────────┘
                   ↓
  ┌─────────────────────────────────────────────────────┐
  │ 6. Upload shard directories to S3                   │
  │    PUT /tmp/shard_0000/* → s3://ns/index/vNext/...  │
  │    PUT /tmp/shard_0001/* → s3://ns/index/vNext/...  │
  │    ... (all shards)                                 │
  └────────────────┬────────────────────────────────────┘
                   ↓
  ┌─────────────────────────────────────────────────────┐
  │ 7. Publish atomically                               │
  │    CAS PUT _head → {\"version\": vNext, ...}          │
  └────────────────┬────────────────────────────────────┘
                   ↓
               [Complete]

The indexer is a simple batch process with no persistent state. If it crashes mid-run, you can restart it and it will pick up where it left off by reading the latest published version from S3. The actual Tantivy interaction would be in Rust or via language bindings.

Publishing epochs with a CAS head pointer

To make a new index version visible atomically, we use a small head pointer object that gets updated via compare-and-swap after all shards are uploaded. This approach is similar to the manifest pointer pattern I described in An MVCC-like columnar table on S3, where S3’s conditional write primitives enable coordination without external services.

The S3 layout looks like this:

s3://search/namespace_acme/
├── _head                              ← mutable pointer (CAS only)
│   {"version": 123, "published_at": "2025-11-11T10:23:45Z"}
│
├── index/v000123/                     ← immutable epoch
│   ├── shard_0000/{tantivy segment files}
│   ├── shard_0001/{tantivy segment files}
│   └── ...
│
└── wal/
    ├── 00000124.log
    └── ...

The _head object acts as the single source of truth for the current index version. When an indexer completes building a new epoch, it follows this protocol:

Indexer publish protocol

  ┌──────────────────────────────────────────┐
  │ 1. Upload all shards to index/vNext/     │
  │    (immutable, write-once)               │
  └────────────────┬─────────────────────────┘
                   ↓
  ┌──────────────────────────────────────────┐
  │ 2. GET _head to read current version     │
  │    Response: ETag "abc123", body {"v":N} │
  └────────────────┬─────────────────────────┘
                   ↓
  ┌──────────────────────────────────────────┐
  │ 3. CAS write new _head                   │
  │    PUT _head                             │
  │    If-Match: "abc123"                    │
  │    Body: {"version": N+1, ...}           │
  └────────────────┬─────────────────────────┘
                   ↓
  ┌──────────────────────────────────────────┐
  │ Success: 200 OK                          │
  │   → vNext is now visible                 │
  │                                          │
  │ Conflict: 412 Precondition Failed        │
  │   → Another indexer published first      │
  │   → Retry or abort                       │
  └──────────────────────────────────────────┘

The If-Match header ensures that only one indexer can successfully update the head pointer to a given version. If multiple indexers race, exactly one wins the CAS and the others get a 412 error. The losing indexers’ epochs remain valid on S3 (they’re immutable and complete), but they’re not referenced by _head. A garbage collection job can clean up these orphaned epochs later based on age or a retention policy.

Query nodes discover the current version by reading _head:

Query node version discovery

  ┌──────────────────────────────────────────┐
  │ Check local version cache (TTL: 30s)     │
  └────────────────┬─────────────────────────┘
                   ↓
            Cached & fresh?
                   │
         ┌─────────┴─────────┐
         │                   │
        Yes                 No
         │                   │
         ↓                   ↓
    Use cached      ┌──────────────────────┐
    version         │ GET _head from S3    │
                    │ Cache response + ETag│
                    └──────────┬───────────┘
                               ↓
                    ┌──────────────────────┐
                    │ Compare with local   │
                    │ shard version        │
                    └──────────┬───────────┘
                               ↓
                      Version changed?
                               │
                    ┌──────────┴───────────┐
                   Yes                    No
                    │                      │
                    ↓                      ↓
         Mirror new shards        Use local shards
         from S3 to NVMe         (fast path)

The version cache with a 30-second TTL means query nodes only hit S3 for version checks roughly once per 30 seconds, not on every query. This amortizes the ~15ms GET request across many queries. For a namespace receiving 100 queries per hour, the version check overhead is negligible: 3 requests per hour × 15ms = 45ms total, or 0.45ms per query on average.

This design provides atomic visibility without coordination beyond S3’s conditional writes. Multiple indexers can safely race by using CAS on the head pointer. Readers always see a complete, consistent epoch because the head pointer only updates after all shards are uploaded. If an indexer crashes mid-upload, the incomplete epoch remains invisible because _head still points to the previous version.

Query routing with rendezvous hashing

When a query arrives, we need to decide which query node should handle it. The goal is to route queries for the same namespace (or namespace and shard combination) to the same node as much as possible. This maximizes the chance that the needed Tantivy shard is already mirrored to that node’s NVMe, avoiding a cold fetch from S3.

Rendezvous hashing (also called highest random weight hashing) is a good fit here. For each query, we compute a hash score for every available node by combining the query key (namespace or namespace plus shard ID) with the node’s identifier. The node with the highest score wins. When a node is added or removed, only the keys that would have landed on that node are remapped. Everything else stays put, preserving cache locality better than naive modulo hashing.

Routing, caching, and indexing architecture:

┌─────────────┐
│   Clients   │
└──────┬──────┘
       │ queries
       ↓
┌────────────────────────────────────────────────────────────────────┐
│                         Router / Load Balancer                     │
│                                                                    │
│  Rendezvous hash: compute score for each node                     │
│    hash(namespace | node_id) → pick highest                       │
│                                                                    │
│  ns=acme   → node_b  (score: 0xa1)  ← cache sticky                │
│  ns=widgets → node_a (score: 0x9e)  ← cache sticky                │
└────────┬──────────────────────────────────┬────────────────────────┘
         │                                  │
         ↓                                  ↓
┌─────────────────────┐          ┌─────────────────────┐
│   Query Node A      │          │   Query Node B      │
│                     │          │                     │
│  NVMe Cache:        │          │  NVMe Cache:        │
│  ┌────────────────┐ │          │  ┌────────────────┐ │
│  │ ns=widgets     │ │          │  │ ns=acme (warm) │ │
│  │ /mnt/nvme/     │ │          │  │ /mnt/nvme/     │ │
│  │  widgets/      │ │          │  │  acme/v123/    │ │
│  │   v123/        │ │          │  │   shard_0000/  │ │
│  │   shard_0000/  │ │          │  │   shard_0001/  │ │
│  └────────────────┘ │          │  └────────────────┘ │
│                     │          │                     │
│  Tantivy MmapDir    │          │  Tantivy MmapDir    │
│  BM25 queries       │          │  BM25 queries       │
└──────────┬──────────┘          └──────────┬──────────┘
           │                                │
           │ on cache miss                  │
           │ GET _head, mirror shards       │
           │                                │
           └────────────────┬───────────────┘
                            ↓
           ┌─────────────────────────────────────┐
           │         S3 Storage (durable)        │
           │                                     │
           │  _head (mutable, CAS)               │
           │    {"version": 123, ...}            │
           │                                     │
           │  index/v000123/  (immutable)        │
           │    shard_0000/{tantivy segments}    │
           │    shard_0001/{tantivy segments}    │
           │    ...                              │
           │                                     │
           │  wal/00000124.log (append-only)     │
           └─────────────────┬───────────────────┘
                             ↑
                             │ writes
                             │
                ┌────────────┴──────────────┐
                │  Indexer (stateless)      │
                │                           │
                │  1. Consume WAL           │
                │  2. Build Tantivy shards  │
                │  3. Upload to S3          │
                │  4. CAS update _head      │
                └───────────────────────────┘

Query flow:

  • Client → Router hashes namespace → picks node with highest score
  • Node checks NVMe cache: • Hit (warm): query Tantivy locally (~10ms) • Miss (cold): GET _head, mirror shards (~4s first time)
  • All queries for same namespace hit same node → cache stays hot

Indexer flow:

  • Background: consume WAL → build shards → upload → CAS _head
  • Next query discovers new version via _head (cached 30s TTL)

The router is stateless and deterministic. Any router instance can compute the same node assignment given the same node list. In practice, you’d run a small load balancer or API gateway that maintains the current node set via health checks or service discovery and forwards queries to the chosen node.

Key properties of this routing:

  • Cache locality: All queries for namespace acme go to node_b, keeping its NVMe cache hot
  • Minimal remapping: If node_c fails, only namespaces hashed to node_c remap (roughly 1/N)
  • Cold start tolerance: First query to a namespace on a node mirrors shards (~4 seconds), subsequent queries hit warm cache (~10ms)
  • Version discovery: Each node independently discovers new versions by reading _head with TTL cache

If you want fault tolerance, you can compute scores for the top two or three nodes and send hedged requests. If the primary node is slow (perhaps fetching shards from S3 for the first time), the backup node can answer. This adds complexity but can significantly improve tail latency without coordination.

Query node design

The query node’s job is to accept search requests, ensure the needed Tantivy shard is available locally, execute the search, and return results. The core insight is that Tantivy’s MmapDirectory lets it read index files via memory-mapped I/O. If those files are on fast local NVMe, queries run at nearly the same speed as a traditional in-memory index (I have only run very basic tests, so YMMV). The trick is getting the files onto NVMe efficiently.

When a query arrives for a namespace and shard, the node follows this sequence:

  1. Discover current version by reading _head (cached with 30s TTL)
  2. Check if the shard directory exists locally at /mnt/nvme/{namespace}/index/v{version}/shard_{id}/
  3. If missing or the version changed, mirror the shard from S3 to local NVMe
  4. Open a Tantivy reader on the local directory (MmapDirectory)
  5. Parse the query, run BM25 scoring with WAND/BlockMax pruning
  6. Return the top-k document IDs and any stored fields

The mirroring step can be done as a simple recursive download of all files under the S3 prefix, or more efficiently by tracking which files already exist locally and their ETags. If the shard hasn’t changed since the last mirror, no download is needed. If only a few segment files were added by a merge, only those new files need to be fetched. All shard downloads include If-Match headers with the epoch ETag to prevent mixing bytes from different versions.

Query API and shard execution

The query interface is straightforward HTTP REST API. Clients send search requests as JSON over HTTP and get back ranked results. This keeps the system language-agnostic and easy to integrate with any application.

A minimal search endpoint looks like this:

POST /api/v1/{namespace}/search
Content-Type: application/json

{
  "query": "distributed search on object storage",
  "filters": {
    "customer_id": ["acme"],
    "created_at": {
      "gte": "2025-11-01",
      "lt": "2025-11-10"
    }
  },
  "top_k": 20,
  "fields": ["title", "snippet", "created_at"]
}

The response includes the ranked results with their scores and requested fields:

{
  "took_ms": 45,
  "total_hits": 1247,
  "results": [
    {
      "doc_id": "doc_12345",
      "score": 8.42,
      "fields": {
        "title": "Building search on S3",
        "snippet": "...distributed search on object storage with...",
        "created_at": "2025-11-05T14:23:00Z"
      }
    }
  ]
}

Tantivy’s QueryParser handles the query string and supports standard search syntax including term queries, phrase queries in quotes, boolean operators, wildcards, and fuzzy matching. The filters are applied as additional constraints on indexed fields before scoring.

Shard scatter-gather and partition pruning

When a query arrives, the critical question is which shards need to be searched. This depends entirely on whether the query includes filters that align with how documents were partitioned across shards.

Scenario 1: No partition pruning

If documents were sharded randomly or by doc_id, and the query has no identifying filters, we must search all shards:

Query: "distributed search"
Filters: none

Router
  ↓
Query Node (coordinator)
  ├─→ Shard 0 → top-20 local results
  ├─→ Shard 1 → top-20 local results
  ├─→ Shard 2 → top-20 local results
  └─→ ... (all 32 shards)

Merge: 32 × 20 results → global top-20

Each shard returns its local top-k results. The coordinator merges these using a min-heap to produce the global top-k. This scatter-gather pattern is standard in distributed search systems. The coordinator can query shards in parallel if they’re cached locally, or forward sub-queries to other nodes that own those shards via rendezvous hashing.

Scenario 2: Partition-aware filtering (best case)

If documents were sharded deterministically by a field like customer_id and the query includes that filter, we can prune to a single shard:

Query: "distributed search"
Filters: customer_id = "acme"

Router calculates:
  shard = hash("acme") % 32 = 7

Query Node
  └─→ Shard 7 only → top-20 results

Return: shard 7 top-20 directly

This is dramatically more efficient. One shard query instead of 32 means 32× less data scanned, 32× fewer S3 requests on cold queries, and much faster response times. The trade-off is that your sharding strategy must align with common query patterns.

Scenario 3: Time-based partitioning

A common pattern is partitioning by date, especially for time-series data like logs. If the index is organized as separate epochs per day and the query includes a date range, we prune to only relevant epochs:

Query: "error logs"
Filters: created_at >= "2025-11-09"

Partition structure:
  index/v_2025_11_07/  ← skip
  index/v_2025_11_08/  ← skip
  index/v_2025_11_09/  ← query
  index/v_2025_11_10/  ← query
  index/v_2025_11_11/  ← query

Query only: v_2025_11_09, v_2025_11_10, v_2025_11_11
Merge results from 3 epochs

This combines well with retention policies where old epochs are garbage collected after a certain period.

The key insight is that the sharding and partitioning scheme should match your query patterns. If most queries filter by customer or tenant, shard by that field. If most queries are time-range scans, partition by date. Without this alignment, you pay the scatter-gather cost on every query.

Version discovery and cache refresh

Query nodes need to know which index version to use. This happens on a per-query basis with a short-lived cache to avoid excessive S3 HEAD requests.

The discovery flow looks like this:

Query arrives
  ↓
Check version cache (TTL: 30 seconds)
  ├─ Cached & fresh → use cached version
  └─ Expired or missing
       ↓
       HEAD s3://{namespace}/_head (or scan for latest _DONE)
       ↓
       Update cache with discovered version + ETag
       ↓
Compare local shard version vs discovered version
  ├─ Match → open local shard (fast path)
  └─ Mismatch → mirror new shard from S3 (cold path)
       ↓
       Download shard files with If-Match: <epoch_etag>
       ↓
       Store in /mnt/nvme/{namespace}/v{version}/shard_{id}/
       ↓
Open Tantivy MmapDirectory on local shard
  ↓
Execute query and return results

The version cache with TTL prevents every query from hitting S3. In practice, a 30-second TTL means that after a new epoch is published, it takes at most 30 seconds for query nodes to discover it. For most workloads this lag is acceptable and dramatically reduces S3 request costs.

An optional optimization is running a background thread that periodically checks for new versions and prefetches shards to NVMe before queries arrive. This amortizes the cold download cost over time rather than making the first query after a publish pay the full penalty. The trade-off is additional S3 bandwidth for prefetching that might not be needed if query patterns are sparse.

When a version changes mid-query (for example, between when the coordinator discovers the version and when it fetches a shard), the If-Match header on the shard download will fail with a 412 error. The query node can then retry with the newer version. In practice this window is small because epochs change infrequently.

The actual implementation would be in Rust using Tantivy directly or via language bindings. The key point is that after the first mirror, subsequent queries hit local NVMe and run fast. The mirror operation itself is bounded by S3 throughput and shard size. A 500 MB shard might take 5–10 seconds to download on the first query, but then serves thousands of queries without additional S3 traffic.

Caching and NVMe as ephemeral storage

The local NVMe acts as a read-through cache of S3 objects. It’s not durable and can be wiped at any time. If a node restarts or is replaced, it starts with an empty cache and rebuilds it on demand as queries arrive. This is acceptable because the cold query penalty is a one-time cost per shard per node, and subsequent queries are fast.

To keep the cache bounded, you can implement a simple LRU eviction policy. Track the last access time for each shard directory and periodically delete the least recently used ones when disk space runs low. More sophisticated admission policies like TinyLFU can improve hit rates, but basic LRU is usually sufficient to start.

The cache key structure is straightforward: each shard directory is stored at a path that includes the namespace and version. If a new version is published, it gets a new directory and the old one eventually ages out. You don’t need explicit invalidation because queries are pinned to a specific version via the _head pointer.

Consistency and version pinning

Because S3 objects are immutable and shards are frozen after upload, there’s no risk of partially updated indexes. A query either sees an entire epoch or doesn’t see it at all. The _head pointer only updates after all shards are uploaded, so readers never observe incomplete epochs.

To prevent mixing bytes from different versions of the same object during a race condition, use ETags from _head. When discovering the current version, cache the _head response and its ETag. When downloading shard files, include If-Match: <epoch_etag> in the GET request. If the ETag has changed, S3 will return a 412 Precondition Failed error and the node can retry with the updated version.

This pattern ensures that a query is internally consistent. All shard files fetched for a single query come from the same epoch. If the epoch changes mid-query, the precondition check will catch it and the query can restart or fail gracefully.

In practice, version churn is low for most search workloads. New epochs are published every few minutes or hours depending on indexing batch size. Queries typically complete in milliseconds to a few seconds, so the window where a version might change during a query is small.

Handling deletes and updates

Tantivy implements deletes by writing tombstone bitmaps that mark documents as deleted without rewriting segments. When you call delete_term() on an index writer and commit, Tantivy creates a .del file alongside the segment. Queries automatically skip deleted documents when scanning postings.

Updates are handled as delete-then-insert: delete the old document and add a new one. This generates a new segment for the insert and a tombstone for the delete. Over time, segments accumulate tombstones and become less efficient. Background merging recompacts segments and discards deleted documents physically.

In our architecture, each indexer run can apply new deletes and updates, then publish a fresh epoch. The old epoch remains on S3 until garbage collection removes it. Query nodes continue serving the old epoch until they discover the new one, then switch over atomically.

For workloads with frequent deletes, you might want to run incremental indexing and merging more often to keep tombstone overhead low. For append-heavy workloads with rare deletes, you can index in larger batches and merge less frequently.

Optional: Strong reads via WAL tail scan

If you need strong consistency where a query immediately reflects just-written documents, you can add a fallback path in the query node. After running the Tantivy query, scan the recent WAL entries that haven’t yet been indexed and filter/merge those results with the search results.

This adds latency and complexity, so it’s only worth it if your application requires it. The implementation would look like:

  1. Run the normal Tantivy query against the latest epoch
  2. Fetch WAL entries newer than the epoch’s indexed position
  3. Decode and filter those documents against the query locally
  4. Merge the two result sets by score and return the top-k

For most use cases, eventual consistency with a few seconds of lag is acceptable and avoids this extra work.

Optional: Uploading Tantivy indexes to S3 vs rehydration

The design as described uploads Tantivy shard directories to S3 after indexing. This makes sense for durability and fast recovery. However, you could also skip the upload and rely on rehydration from the WAL.

In this variant, indexers build Tantivy indexes locally and keep them in memory or on local disk. Query nodes do the same: when they need a shard, they download the relevant WAL entries and build the Tantivy index on the fly. This avoids the S3 upload step and reduces storage costs, but it means every query node pays the indexing cost on first access.

The trade-off depends on your workload. If query volume is low and index rebuilds are cheap (small data, fast indexing), rehydration might be simpler. If query volume is high or indexing is expensive, uploading prebuilt indexes to S3 amortizes the cost across many queries and nodes.

A hybrid approach is also possible: upload only merged, optimized indexes to S3 but keep recent incremental segments in memory. Query nodes fetch the base index from S3 and layer on recent updates from the WAL. This balances freshness with efficiency.

Putting it together: full query flow with timing

Let’s walk through a concrete query to see where time is spent. Suppose we have a namespace with 10 million documents split across 32 shards. Each shard is about 400 MB on disk. A user queries for “distributed search on object storage” with top_k=20.

Cold query (first time this shard is accessed on this node):

Step 1: Router picks node via rendezvous hashing (0.1ms)
Step 2: Node checks local cache, shard missing (1ms)
Step 3: GET s3://search/ns/_head (15ms)
Step 4: Mirror shard_0007 files (400 MB at 100 MB/s = 4000ms)
Step 5: Open Tantivy MmapDirectory (50ms)
Step 6: Parse query and run BM25 (8ms)
Step 7: Return top-20 results (1ms)

Total: ~4075ms

The cold query is dominated by the shard download. For a 400 MB shard, a few seconds is reasonable. Subsequent queries to the same shard hit local NVMe.

Warm query (shard already local):

Step 1: Router (0.1ms)
Step 2: Node checks local cache, shard exists (0.5ms)
Step 3: Open Tantivy reader (reused from previous query, 0ms)
Step 4: Parse query and run BM25 (8ms)
Step 5: Return results (10ms)

Total: ~20ms

The warm query is nearly two orders of magnitude faster. This is why routing and cache locality matter. If queries for a namespace land on the same node, most of them hit warm cache and feel instant.

Cost comparison

Let’s compare costs for a hypothetical workload to explore the economics. These are rough estimates and actual costs will vary significantly based on instance choice, query patterns, and cache hit rates.

Assumptions: 100 million documents, 10 GB of Tantivy indexes across 32 shards, 100 queries per hour, 24x7 operation.

Traditional search cluster setup:

  • 3 nodes with 500 GB SSD each: ~$1500/month
  • Storage on SSD for indexes + data: included in instance cost
  • Idle most of the time but always running

S3-backed Tantivy design:

  • S3 storage: 10 GB at $0.023/GB/month = $0.23/month
  • Query nodes: 2 instances with 500 GB NVMe, average 50% utilization = $750/month
  • S3 GET requests: 100 queries/hour × 24 × 30 × 2 shards avg per query × $0.0004/1k = $0.17/month
  • S3 bandwidth: 100 queries/hour × 24 × 30 × 400 MB × $0.09/GB = $259/month (cold queries only)

Assuming 90% warm hit rate after cache warms up:

  • Bandwidth cost: $259 × 0.1 = $26/month

Estimated total: $0.23 + $750 + $0.17 + $26 = ~$776/month

Important caveats:

These numbers are illustrative only. Real costs vary based on:

  • Instance type and region: i3.xlarge vs i4i.large vs spot instances can change compute costs by 2-5×
  • Cache hit rate: 50% warm vs 90% warm doubles bandwidth costs
  • Shard fan-out: querying 1 shard vs all 32 shards changes GET request costs by 32×
  • Query patterns: Zipfian (few hot shards) vs uniform (all shards equally accessed) dramatically affects cache efficiency
  • S3 transfer: using VPC endpoints vs internet egress, same-region vs cross-region

The more significant advantage is elasticity. Query nodes can scale to zero during idle periods. If your traffic is bursty (for example, only during business hours or certain days), compute costs drop proportionally. With very low query rates (a few queries per day), S3 costs are negligible and you can run query nodes on-demand for a few dollars per month.

The order of magnitude holds: for low to moderate QPS workloads with large data volumes, decoupling storage and compute saves 2-10× compared to always-on clusters, with the exact savings depending heavily on your specific patterns.

When this design fits and when it doesn’t

This architecture is a good fit for:

  • Log search, archive search, or infrequent analytical queries over large datasets
  • Multi-tenant systems where most tenants query rarely but data volume is high
  • Workloads where indexing latency of seconds to minutes is acceptable
  • Cases where cost matters more than sub 50ms p99 latency

It’s not a good fit for:

  • High-QPS search serving thousands of queries per second with strict SLA
  • Applications requiring immediate consistency (writes visible in milliseconds)
  • Very small datasets where the overhead of S3 and NVMe mirroring outweighs the simplicity of keeping everything in memory

For high-QPS scenarios, a traditional in-memory search system or a hybrid design with a hot tier on SSDs makes more sense. But for the growing class of workloads where data is large, queries are infrequent, and cost matters, decoupling storage from compute with S3 and NVMe caching offers a compelling alternative.

Operational simplicity

One of the strongest benefits of this design is operational simplicity IMO. There’s no cluster to manage, no replication to configure, no shard rebalancing to tune. Indexers are stateless and can run as batch jobs, pollers, event-driven systems or even Lambda functions (in theory at least). Query nodes are stateless and can be added or removed without coordination. The only stateful component is S3, which AWS operates for you.

Intentionally skipped topics

This design glosses over several practical concerns that would need attention in a production system:

Write throughput and S3 request limits: S3 has per-prefix rate limits (3,500 PUT/s, 5,500 GET/s). High-frequency WAL writes or burst indexer uploads could hit these limits. Solutions exist—batching writes, partitioning WAL by shard, using shard-specific prefixes for indexes—but they add complexity. For moderate write volumes, the basic design works fine.

Version discovery TTL and freshness: The 30-second cache TTL means queries can lag behind the latest published epoch by up to 30 seconds. This is acceptable for many workloads but not all. You can reduce the TTL or add a background watcher thread to prefetch new versions, or leverage the queue that indexer relies on to accordingly as well

I missed to acknowledge Quickwit and other similar systems or tools in this space like turbopuffer earlier (writing is hard).

Quickwit is a production-ready distributed search engine built on Tantivy and S3 that solves these exact problems. I actually wasn’t familiar with some of the intricacies until after writing this post (whoops). Quickwit is fantastic and what you should absolutely use if you need distributed search on object storage. They’ve solved the hard production challenges like exactly-once ingestion from Kafka/Pulsar, sophisticated control plane scheduling, cluster membership with failure detection, and operational tooling.

This post is an architectural exploration and thought experiment + POC (proof of concept), not a practical alternative. I’m exploring what a minimal design might look (from an educational POV) like if you pushed “simplicity” to the extreme.

Wrapping up

This exploration shows that search on object storage is not only feasible but practical for certain workloads. By building on Tantivy’s immutable segment design and S3’s conditional writes, we get a system where indexers publish frozen snapshots and query nodes lazily mirror shards to local NVMe. Rendezvous hashing keeps queries sticky to nodes for cache locality. CAS on a tiny _head pointer ensures atomic epoch visibility without coordination.

The result is a system that’s cheap to run, operationally simple to operate, and scales independently for reads and writes. Indexers can process data in batches without impacting query performance. Query nodes can scale up during traffic spikes and down to zero during idle periods. S3 storage costs pennies per gigabyte while NVMe caching keeps query latency reasonable for warm queries.

This won’t replace every search system but it was quite informative and interesting for me to explore the design patterns here immutable epochs, CAS coordination, NVMe read-through caching, deterministic routing—apply beyond search and more. The same approach works for many systems where you can precompute read-optimized data structures and serve them from cheap object storage with a fast local cache. Many OLAP systems like ClickHouse and StarRocks seem to be adopting similar patterns as well. Data lake formats like Iceberg and Delta use the same manifest-pointer style for snapshot isolation event.

There’s still plenty of room to push further. Until next time