Press enter or click to view image in full size
Building on Sand vs. Building on Stone
Here’s a truth about databases that took me years to internalize: your storage engine isn’t just another component. It’s the foundation everything else sits on. Get it wrong and you’ll spend the next year rewriting.
When I started this project, I had a choice to make. Pure Elixir elegance with CubDB, or battle-tested C++ power with RocksDB. The benchmarks made the decision obvious, but the reasoning behind those numbers is what I want to share.
What Your Storage Engine Actually Does
Strip away the marketing and a storage engine does five things:
Survives crashes. Write-ahead logs, fsync, careful recovery. Your data must outlive your process.
Handles point operations. get(key) and put(key, value) are the atoms everything else is built from.
Supports range queries. “Give me all keys between A and Z”, this is non-negotiable for anything beyond caching.
Batches atomically. Writing one key at a time is death by a thousand fsyncs.
Manages space. Deleted data must eventually disappear. Fragmentation must stay bounded.
Almost everything else like compression, column families, bloom filters is optimization on top of these primitives.
One Engine, Many Data Models
Here’s what’s fascinating: with clever key design, a single ordered key-value store can emulate almost any data model.
Document database? Structure keys as collection:doc_id for documents and collection:doc_id:index:value for secondary indexes. Range scans become index lookups.
Graph database? Encode edges as edge:from:type:to and reverse:to:type:from. Graph traversal becomes prefix scans. You pay 2x write cost for bidirectional traversal.
Time-series? Keys like metric:timestamp:tags give you natural time ordering. Recent data? Scan from the end. Old data? Delete in bulk by key prefix.
The storage engine doesn’t know what data model you’re building. It just sees ordered bytes. Your key encoding is your schema.
A caveat here: serious time-series and analytics databases use columnar storage for good reason. When you’re working strictly with numbers, you unlock optimizations that key-value stores simply can’t match, like run-length encoding for repeated values, delta encoding for timestamps, SIMD operations across columns, compression ratios that would make RocksDB blush. If your workload is pure analytics, tools like TimescaleDB, ClickHouse, or QuestDB exist for a reason. What we’re building handles time-series as one capability among many, probably good enough for operational data, but not competing with specialized OLAP engines.
The Architecture Trade-offs
Different engines optimize for different workloads. There’s no free lunch.
B-Trees (SQLite, PostgreSQL) give you O(log n) reads and handle mixed workloads well. But random writes cause random I/O, and you fight fragmentation forever.
LSM-Trees (RocksDB, Cassandra) turn random writes into sequential writes, which is perfect for SSDs. The cost? Read amplification (checking multiple files) and write amplification (data gets rewritten during compaction).
Hash-based engines (Redis, Bitcask) give you O(1) lookups but sacrifice range queries entirely. Your keyspace must fit in memory.
For a distributed database with write-heavy workloads and range query requirements, LSM-trees are the right architecture.
The Elixir Dilemma: CubDB vs RocksDB
I wanted to use CubDB. Really, I did.
Pure Elixir means no compilation headaches, no NIF crash risks, easy debugging. CubDB is well-designed and handles concurrency through a clean GenServer interface.
But I ran the benchmarks anyway.
The numbers told a clear story. But the memory usage during batch operations was the real shock: CubDB used 26,000x more memory than RocksDB for the same batch size. It builds the entire batch in memory before writing; RocksDB streams to disk.
For development, testing, or small embedded use cases, CubDB is excellent. For a distributed database handling production workloads? I needed the horsepower.
Living with NIFs
Here’s the uncomfortable truth about RocksDB in Elixir: it’s a Native Implemented Function. NIFs bypass the BEAM scheduler. A crash in C++ code doesn’t trigger supervisor restarts, it will kill your entire VM. The “let it crash” philosophy stops at the NIF boundary.
So how do you live with this?
Isolation. Each shard runs in its own process. A RocksDB crash takes down one shard, not the cluster.
Replication. Data exists on multiple nodes. One node crash doesn’t lose data.
Monitoring. Track NIF call durations. Long-running operations are warning signs.
Testing. Stress tests, chaos engineering, fuzzing. Find the crashes before production does.
RocksDB is mature. Crashes are rare. But you architect for the failure mode anyway.
Column Families: The Hidden Superpower
One RocksDB feature deserves special mention: column families. Think of them as logical databases within a single instance.
{:ok, cf_documents} = RocksDB.create_column_family(db, "documents", [])
{:ok, cf_indexes} = RocksDB.create_column_family(db, "indexes", [])
{:ok, cf_metadata} = RocksDB.create_column_family(db, "metadata", [])Why does this matter?
Different compression per data type, like heavy Zstd for documents, light Snappy for indexes. Separate compaction schedules so frequently updated indexes don’t trigger document rewrites. Independent cache policies allow hot indexes stay cached while cold documents don’t evict them.
This is how this project will implement document collections, graph structures, and time-series semantics, all sharing one storage engine, each tuned for its access patterns.
The Write Amplification Tax
LSM-trees have a dirty secret. That 1KB value you wrote? It gets written again when memtables flush. And again during L0->L1 compaction. And again for L1->L2. A single logical write can become 10–20 physical writes.
For SSDs, this matters. It impacts both throughput and drive lifespan.
The tuning knobs exist with larger memtables, fewer compaction levels, different compaction strategies. But you’re always trading something. Lower write amplification usually means higher read amplification or space amplification.
This is the tax you pay for sequential I/O patterns and excellent write throughput. Worth it for write-heavy distributed workloads. Not worth it for read-dominated single-node databases.
What I Learned from Production Systems
Before committing to RocksDB, I studied how others use it.
CockroachDB documented their RocksDB journey extensively, with things like write amplification and compaction tuning were their biggest challenges. TiKV contributed improvements back upstream and found similar issues at scale. MyRocks at Meta handles massive throughput but required custom patches.
The lesson: default settings are rarely optimal. RocksDB is powerful, but you’ll spend time understanding its internals.
The Foundation is Set
With RocksDB as our storage layer, we have:
- Durable, crash-safe persistence
- Efficient point reads and range queries
- Column families for different data types
- Battle-tested code running at massive scale
Next post, we’ll build on this foundation with distributed consensus, how Raft keeps our metadata consistent, how we separate the CP metadata plane from the AP data plane, and how keys actually route to the right shard.
What storage engines have you used in production? Did write amplification ever bite you?