DuckLake: Why Early-Stage Startups Should Stop Cosplaying as Netflix

7 min read Original article ↗

Emma Wirt

Press enter or click to view image in full size

Let me be extremely clear about my bias upfront: I work at Fika Ventures, a seed to Series A fund. We typically provide enough capital to give our founders 18–24 months of runway, and hire teams of 5–15 engineers, with zero time for distributed systems cosplay.

So when DuckDB launched DuckLake in May this year, I immediately recognized it as the anti-pattern to everything wrong with modern data infrastructure. Here’s my take on why this matters for early-stage companies, and why mature orgs can stick with Iceberg.

The Architecture Decision That Actually Matters at Series A

Imagine this: your startup has 12 engineers. You’re processing 50GB of data daily. Your biggest data challenge is that your CEO wants ‘real-time dashboards’ because your main competitor claims to have them.

And you’re evaluating Apache Iceberg because… what… Uber uses it?

Okay, so let’s talk about what Iceberg actually is: a distributed metadata management system designed for organizations with thousands of engineers and exabytes of data. It assumes you have a dedicated team for your data platform, an infinite S3 budget for storing redundant manifest data, and engineers who understand the nuances of snapshot isolation.

DuckLake makes a different assumption: you have a PostgreSQL instance and you’d like to query some Parquet files. That’s all!

The Technical Architecture — Iceberg

Iceberg’s arch is truly brilliant — for example, take @Netflix. Here’s what happens during a simple update.

  1. Client reads current metadata pointer from a catalog
  2. Fetches manifest list (S3 GET)
  3. Fetches relevant manifest files (multiple S3 GETs)
  4. Determines which data files are affected
  5. Writes new data files
  6. Creates new manifest files
  7. Creates new manifest list
  8. Creates new metadata file with complete history
  9. Attempts atomic swap in catalog
  10. Handles conflicts if someone else updated

Each step involves network calls, JSON parsing, and careful coordination. The design assumes S3 eventual consistency is your enemy and that conflicts are common because hundreds of jobs are writing simultaneously.

The Technical Architecture — Ducklake

Now here is DuckLake’s architecture for the same update:

BEGIN;
UPDATE ducklake_data_file SET ... WHERE ...;
INSERT INTO ducklake_snapshot VALUES (...);
COMMIT;

It’s just an SQL, in a transaction, with ACID guarantees.

Why This Matters: The Startup Physics of Data Infrastructure

At the seed stage, your entire dataset fits in memory (or possibly a spreadsheet). At Series A, you’re dealing with gigabytes, maybe terabytes. The physics of your problem are fundamentally different from Netflix:

  1. Your entire metadata catalog is <100MB (Netflix: terabytes)
  2. You have <10 concurrent writers (Netflix: thousands)
  3. Your QPS is in the hundreds (Netflix: millions)
  4. You can tolerate 100ms latency (Netflix: what)

Using Iceberg at this stage is like deploying Kubernetes for a Rails app with 100 users. You’re solving problems you don’t have while ignoring the problems you do have.

The Streaming Story Nobody Wants To Admit

Here’s what happens when early-stage companies try to do streaming with traditional lakehouses:

They start with Kafka (because that’s what people do). Then they realize Kafka creates tiny files in S3 (bummer). So they add Flink for micro-batching (nice). But Flink needs state management (what), so they add RocksDB. But now they need checkpointing, so they add more S3. But the small files are still a problem, so they add a compaction job. But the compaction job conflicts with queries, so they add a scheduler…

Six months later, you’ve created 5 systems to maintain and your ‘real-time’ dashboards update every 30 minutes

DuckLake with Data Inlining says: just INSERT into the catalog database. Inline up to 10k rows in PostgreSQL. Flush to Parquet when convenient. Queries see everything immediately

-- This is your entire streaming infrastructure
INSERT INTO events VALUES (...);
-- Data is immediately queryable
-- Background process handles Parquet generation
-- No coordination required

The Interoperability Trap (And Why It Doesn’t Matter)

“But what about vendor lock-in?”

This is simply the wrong question at Series A. The right question is: “Can I ship features fast enough to get to Series B?

But fine, let’s address it: DuckLake stores your actual data in Parquet files. The metadata is in documented SQL tables. Want to migrate? Great, it’s a SQL export and a Python script. Try migrating between Iceberg catalog implementations — I’ll wait….still waiting……

The recent October v0.3 DuckLake? release added Iceberg compatibility, which is clever marketing, but it misses the point. If you’re small enough to use DuckLake you shouldn’t have Iceberg tables to migrate. If you’re big enough to have Iceberg tables worth migrating, you probably shouldn’t be using DuckLake.

When DuckLake is Wrong (I’m not a Zealot) ← I learned that word today

DuckLake is wrong for:

  • Multi-region deployments: PostgreSQL replication across regions is painful. Iceberg’s eventually consistent model actually does make sense here
  • True multi-tenancy at scale: If you’re Snowflake, managing millions of tables, you need something more sophisticated than PostgreSQL schemas
  • Petabyte scale: At some point, even the metadata doesn’t fit in PostgreSQL. This point is way further out than you think (PostgreSQL can handle TB-scale tables), but it exists
  • Organizations with dedicated data platform teams: If you have 10+ engineers just working on data infrastructure, you can afford Iceberg’s complexity. You might even need it

The Uncomfortable Truth About “Modern Data Stacks”

The entire lakehouse movement is built on a false premise: that separation of storage and compute requires reinventing metadata management.

BigQuery uses Spanner for metadata. Snowflake uses FoundationDB. Both handle orders of magnitude more data than most startups ever will. They didn’t reinvent metadata management — they used databases.

DuckLake just makes this pattern accessible to the rest of us. It’s not innovative. It’s obvious. Which is exactly why it’s a great move .

A Simple Framework for Founders

Take this with a grain of salt, I’m just a lover of the duck!

If you’re pre-Series B, here’s your data stack:

  1. Day 1: Duck DB + local files
  2. Product market fit: DuckDB + DuckLake + S3
  3. Series A: Add PostgreSQL for metadata, keep everything else
  4. Series B: Evaluate if you need more complexity (my fingers are crossed you do not)

Total infrastructure cost: <$500/month

Total engineers required: 0.5

Time to implement: 1 week

Compare this to the ‘modern’ approach:

  • Iceberg + Spark + Kafka + Airflow + dbt + Tableau
  • Cost: $15k/month minimum
  • Engineers required: 3–5
  • Time to implement: 3–6 months

The difference? That’s 2 engineers building features for 6 months. That’s the difference between getting to Series B or not.

This is especially critical for fintech and proptech companies where every engineering hour spent on data infra is. an hour not spent on compliance, security, or the domain specific features that actually differentiate you in these regulated markets.

The Real Innovation

DuckLake’s real innovation isn’t technical. It’s cultural. It’s saying: “You don’t need distributed systems to build a company”… “Your metadata is data. Treat it like data”… “Boring technology is good technology”

In a world where every startup wants to be ‘data driven’ but can’t actually query their data because it’s locked behind 17 layers of abstraction, DuckLake offers something radical: simplicity..

Your startup doesn’t need to prepare for Netflix scale. It needs to survive until next year and hopefully many years after that. DuckLake gets this. Most of the data infrastructure world doesn’t.

That’s my stance, and no, we are not investors, and this is not a paid placement. That’s why I tell every portfolio company (and honestly, everyone I meet) to look at DuckLake. Not because it’s wildly innovative, but because it’s beautifully simple.

I work at Fika Ventures, where we invest in technical founders solving hard problems with beautifully simple solutions ;). If you’re building something real and need capital (not complexity), reach out. We fund founders and provide them with guidance on frameworks to help them make smarter technical decisions that enable them to create massively compelling, scalable architectures with minimal technical debt.