Lance table format explained simply, stupid

TLDR (but stay for the animations!): Lance is a successor to Iceberg / Delta Lake, more optimized for random reads, and supports adding ad-hoc columns without needing to copy all the data.

Some big things happened in the big data world in 2025:

Iceberg V3 spec got released and added cool stuff like VARIANT.
turbopuffer announced a vector search over object storages (similar to Quickwit).
Apache Fluss lets Flink manage real-time streams with tiering to object storage.
Datadog bought Quickwit.
Databricks bought Neon.

I'm noticing a theme here. If I write about you in a blog post, someone will buy you...

But something way bigger flew completely under my radar, most likely as I was pretty busy building at $DAY_JOB (some pretty cool stuff, I must say).

This thing is called Lance. It's a file format (like Apache Parquet), a table format (like Apache Iceberg), and a catalog spec (like Iceberg's REST catalog spec).

Lance file format is similar to Parquet, but more optimized for random reads (WHERE id = 123), while still preserving Parquet's performance when doing sequential reads.

Official docs here.

Something interesting to test is how would Parquet behave if we configure it to store each page as 64kb instead of the default 1mb 🤔.

Lance table format

Lance table format is similar to Iceberg, but allows adding columns ad-hoc without copying all the data (just to add a value for the new column to all rows), while still preserving Iceberg's MVCC.

Another great feature of Lance tables is they also support indexes, such as BTree, inverted index (FTS), and vectors (e.g. HNSW).

Official docs here.

Thanks to AI?

Apparently there's another open-source Parquet competing file format called vortex created by SpiralDB which seems like a direct competitor to LanceDB.

These technologies only came about because of a need for multi-modal data lakes now that AI is so prevalent.

I wonder what other technologies will come from this AI software era.