DuckTales: A DuckLake Story

Rethinking the Lakehouse with a Duck and a Plan

Press enter or click to view image in full size

I think it was around 11:30AM Tuesday.
I’d just fed my cat Lux and was checking my feed before sitting down for lunch.
There it was: DuckLake.

The tagline caught me first. “SQL as a Lakehouse Format.”
I clicked. I read. I sat there for a long time, not eating.

By that evening, I was already experimenting. Testing the format. Building the repo. Thinking hard about what this means for Iceberg. And by night, I was writing.

DuckLake isn’t an enhancement.
It’s a rethinking.

No manifests. No metastore. No coordination service.
Just a database, some Parquet, and transactions that behave like transactions.

In a space where complexity has calcified, DuckLake felt lucid.
Not naïve-simple, architecturally clear. Direct. Composable.

This article is what emerged after a few hours of chasing a hunch to see if clarity in a lakehouse wasn’t just possible, but real.

The Metadata Maze We Built

Let me paint you a picture of modern lakehouse architecture.

You want to update a single row in your data lake. Here’s what happens:

Press enter or click to view image in full size

For a single row change, you just wrote potentially dozens of files and performed a complex coordination dance across eventually-consistent storage.

We built this because we wanted to avoid databases. Databases were “legacy.” They didn’t scale. They locked you in.

But then something funny happened. As our file-based metadata systems grew more complex, we added databases back. Iceberg catalogs. Delta Lake catalogs. Hive metastores. We needed ACID guarantees for our metadata, so we put our metadata in systems that provide ACID guarantees.

We just never questioned why we were still encoding table state in files when we already had a database in the stack.

The DuckDB team questioned it.

DuckLake: The Database Strikes Back

Here’s DuckLake’s radical proposition: if you’re going to use a database for your catalog anyway, why not use it for all your metadata?

No JSON manifests. No Avro files. No complex file hierarchies. Just SQL tables managing SQL tables.

Press enter or click to view image in full size

The architecture is almost embarrassingly simple:

Data files live in object storage (Parquet, like always)
All metadata lives in a SQL database
Every operation is a SQL transaction

That’s it. That’s the whole format.

When you insert data into a DuckLake table, here’s what happens:

BEGIN TRANSACTION;
  INSERT INTO ducklake_data_file VALUES (...);
  INSERT INTO ducklake_table_stats VALUES (...);
  INSERT INTO ducklake_snapshot VALUES (...);
COMMIT;

One transaction. A few rows in metadata tables. No file coordination. No eventual consistency. No pointer swapping.

The simplicity is almost offensive after years of complexity theater.

What This Actually Looks Like

I spent Tuesday evening building a real-time analytics pipeline on DuckLake. The kind of thing that usually requires Kafka, Flink, and a small army of YAML files.

Here’s how it started:

ATTACH 'ducklake:analytics.ducklake' AS lake;
USE lake;CREATE TABLE events (
    event_id VARCHAR PRIMARY KEY,
    user_id VARCHAR,
    event_type VARCHAR,
    timestamp TIMESTAMP,
    properties JSON
);

Standard SQL. Nothing fancy. But this table now has superpowers:

Time Travel:

-- What did this table look like an hour ago?
SELECT * FROM events AT (TIMESTAMP => '2025-01-15 14:00:00');-- Show me version 42 specifically
SELECT * FROM events AT (VERSION => 42);

Change Tracking:

-- What changed between versions 40 and 42?
SELECT * FROM ducklake_table_changes('lake', 'main', 'events', 40, 42);

Multi-Table Transactions:

BEGIN TRANSACTION;
  INSERT INTO users VALUES ('user123', 'Alice');
  INSERT INTO events VALUES ('evt456', 'user123', 'signup', now(), '{}');
COMMIT;

Both tables update atomically. If either operation fails, neither happens. This is basic database functionality that somehow became revolutionary in the lakehouse world.

The Performance Story

DuckLake’s performance characteristics are different from Iceberg and Delta. Not just faster; fundamentally different.

Press enter or click to view image in full size

Query Planning: Instead of making multiple HTTP calls to reconstruct table state from files, DuckLake issues a single SQL query to the metadata database. The database returns exactly the files needed for your query, pre-filtered by partition and statistics.

Small Changes: Traditional formats write new metadata files for every change, no matter how small. DuckLake can optionally inline tiny changes directly into the metadata database. A single-row insert becomes a few bytes in a database table, not a new file on object storage.

Concurrent Writes: The bottleneck shifts from file coordination to database transaction throughput. PostgreSQL can handle thousands of transactions per second. That’s thousands of concurrent table updates, not the handful that file-based formats struggle with.

I ran a simple benchmark: 100 concurrent processes, each inserting a single row every second. Iceberg started throwing conflicts after about 10 processes. DuckLake handled all 100 without breaking a sweat.

Demo Scenarios That Actually Matter

I’ve built five demo scenarios that showcase what DuckLake enables. These aren’t toy examples; they’re the real-world patterns that break traditional formats.

Demo 1: Transaction Rollback Safety

The Scenario: An e-commerce system where inventory and orders must stay consistent. What happens when something goes wrong mid-transaction?

The Implementation:

# From demos/01_transaction_rollback/demo.py
def demo_failed_transaction(conn):
    """Demonstrate a failed transaction with rollback."""
    try:
        conn.execute("BEGIN TRANSACTION")                # Insert order
        conn.execute("""
            INSERT INTO orders (order_id, product_id, quantity, customer_name)
            VALUES (2, 2, 10, 'Bob')
        """)
                # Update inventory
        conn.execute("""
            UPDATE inventory 
            SET quantity = quantity - 10 
            WHERE product_id = 2
        """)
                # Oops! Duplicate order_id - this will fail
        conn.execute("""
            INSERT INTO orders (order_id, product_id, quantity, customer_name)
            VALUES (2, 3, 5, 'Bob')  -- Duplicate!
        """)
                conn.execute("COMMIT")
            except Exception as e:
        conn.execute("ROLLBACK")
        # Both tables remain unchanged!

Traditional formats can’t roll back across tables. If the order inserts but the inventory update fails, you’re stuck with inconsistent data. DuckLake’s true ACID transactions mean all-or-nothing across any number of tables.

Demo 2: Time Travel Debugging

The Scenario: Customer data was accidentally deleted. When did it happen? What exactly was lost? Can we recover it?

The Implementation:

# From demos/02_time_travel/demo.py
def recover_deleted_data(conn, good_snapshot):
    """Recover the accidentally deleted data."""
    # Count records before recovery
    before_count = conn.execute("SELECT COUNT(*) FROM customers").fetchone()[0]
    print(f"Records before recovery: {before_count}")  # Shows: 0        # Recover the data from a good snapshot
    conn.execute(f"""
        INSERT INTO customers 
        SELECT * FROM customers AT (VERSION => {good_snapshot})
    """)
        # Count records after recovery
    after_count = conn.execute("SELECT COUNT(*) FROM customers").fetchone()[0]
    print(f"Records after recovery: {after_count}")  # Shows: 6
    print(f"Recovered records: {after_count - before_count}")

Data incidents happen. DuckLake provides built-in forensics without external tools. Every change is a snapshot, and you can query any snapshot instantly.

Demo 3: Schema Evolution Without Downtime

The Scenario: Add a new column to a production table while applications are actively writing to it.

The Implementation:

# From demos/03_schema_evolution/demo.py
# Background writer simulates production traffic
def background_writer(catalog_path):
    while keep_writing:
        with DuckLakeConnection(catalog_path) as conn:
            conn.execute("""
                INSERT INTO events (event_type, event_data)
                VALUES ('background_write', '{"count": ...}')
            """)# Meanwhile, schema changes happen transactionally
def add_column_with_default(conn):
    conn.execute("BEGIN TRANSACTION")
    conn.execute("ALTER TABLE events ADD COLUMN priority INTEGER DEFAULT 5")
    conn.execute("COMMIT")
    # Background writer continues without interruption!

Schema changes shouldn’t require downtime. DuckLake makes DDL transactional, so changes are atomic and don’t interfere with concurrent operations.

Demo 4: Small File Optimization

The Scenario: IoT sensors sending data every second. Compare metadata overhead between DuckLake and traditional formats.

The Results:

File System Comparison After 100 Updates:

Traditional Format: 404 files (101 data files + 303 metadata files)
DuckLake: 2 files (1 catalog + 1 data file)
Reduction: 99.5%

Small frequent updates are common in real applications. Traditional formats create a new set of metadata files for each update. DuckLake can inline small changes directly into the catalog, avoiding file explosion entirely.

Demo 5: Catalog Portability

The Scenario: Start with local development, then seamlessly move to production with PostgreSQL.

The Implementation:

# Development with local catalog
catalog_path = "ducklake:local_dev.ducklake"
with DuckLakeConnection(catalog_path, "dev") as conn:
    conn.execute("CREATE TABLE products (...)")
    # Develop locally# Production with PostgreSQL catalog - same code!
pg_catalog = "ducklake:postgresql://localhost/ducklake_catalog"
with DuckLakeConnection(pg_catalog, "prod") as conn:
    conn.execute("CREATE TABLE products (...)")
    # Deploy to production

Development to production transitions should be seamless. DuckLake supports multiple catalog backends without code changes.

The Rough Edges

I’m not here to sell you a perfect system. DuckLake is young, and it shows.

Catalog Dependency: Your metadata database becomes a single point of failure. Traditional formats can limp along with stale metadata; DuckLake can’t. You need proper database operations: backups, monitoring, high availability.

SQL Compatibility: DuckLake requires a “real” SQL database with ACID support and primary keys. Your NoSQL document store won’t cut it. This isn’t necessarily bad, but it’s a constraint.

Ecosystem Maturity: Iceberg has years of production hardening and broad tool support. DuckLake is experimental. You’re an early adopter, with all the risks that entails.

Migration Complexity: Moving from Iceberg/Delta to DuckLake isn’t trivial. The data files are compatible, but the metadata models are completely different.

But here’s what surprised me: most of DuckLake’s limitations feel like reasonable trade-offs, not fundamental flaws. You’re trading file-based complexity for database dependency. For most organizations, that’s a good trade.

Where This Goes

I’ve seen enough technology shifts to recognize the shape of this one.

DuckLake isn’t just a new table format. It’s a bet on a different future: one where we stop pretending that files are databases and start using actual databases for database problems.

The implications ripple outward:

Simplified Architectures: When your lakehouse is just “a database and some Parquet files,” your operational complexity drops dramatically. No custom catalog services. No manifest file management. No eventual consistency debugging.

Better Developer Experience: SQL is the lingua franca of data. When everything is SQL (from table creation to time travel to change tracking) the learning curve flattens.

Operational Sanity: Database operations are well-understood. Backup strategies, monitoring, scaling patterns; we’ve solved these problems. File-based metadata coordination? We’re still figuring that out.

Performance Ceiling: Database transaction throughput is your performance ceiling, not file system coordination. Modern databases can handle tens of thousands of transactions per second. That’s a high ceiling.

The DuckDB team isn’t just building a better mousetrap. They’re questioning whether we need mousetraps at all when we have cats.

Try It Yourself

It’s late Tuesday night now. The demos are in place, the benchmarks are fresh, and the questions are still unfolding.

That’s what happens when you find technology that makes hard problems feel solvable again.

DuckLake is available in DuckDB v1.3.0 “Ossivalis.” Installation is three commands:

# Install DuckDB
brew install duckdb  # or your preferred method# Start DuckDB and install extension
duckdb
INSTALL ducklake;

I’ve built a collection of demos and benchmarks that showcase what DuckLake enables. Real scenarios, not marketing fluff. The kind of problems that keep data engineers up at night.

👉 https://github.com/TFMV/ducktales

The Moral of the Tale

We spent a decade building increasingly complex systems to avoid databases.

Then we added databases back to manage the complexity.

DuckLake asks a simple question: what if we just used the database from the start?

Sometimes the most radical innovation is remembering what worked before we made it complicated.

The duck has landed. The lake is calling.

Lux is asleep now. I’m still turning it over.