____ _ _ _
| _ \ _ _ ___| | _| | __ _| | _____
| | | | | | |/ __| |/ / | / _` | |/ / _ \
| |_| | |_| | (__| <| |__| (_| | < __/
|____/ \__,_|\___|_|\_\_____\__,_|_|\_\___|
+ dlt + dbt + S3 + Neon
─────────────────────────────────────────────
The Modern Open Lakehouse on Your Laptop
DuckLake + dlt + dbt: The Open Lakehouse Stack
A complete, runnable demo of a modern lakehouse architecture using entirely open-source tools — no Spark cluster, no Databricks account, no warehouse license required.
This project demonstrates a realistic ELT pipeline where raw data lands as parquet files in S3, gets ingested into a DuckLake catalog via dlt, and is transformed into analytics-ready tables with dbt.
Architecture
┌──────────────────────────────────────────────┐
│ S3 (Object Storage) │
│ │
│ events_landing/ ducklake/raw/ │
│ ├── users.parquet ├── (parquet) │
generate_data.py ────► │ ├── events.parquet ducklake/dev/ │
(source sim) │ ├── invoices.parquet ├── (parquet) │
│ └── support_tickets └── ... │
│ .parquet │
└─────────┬────────────────────────┬───────────┘
│ │
│ read │ write
▼ │
┌───────────────── ────┐ ┌─────────┴─────────┐
│ dlt (ingestion) │ │ dbt (transform) │
│ │ │ │
│ load_raw.py │ │ staging (views) │
│ • reads landing zone│ │ intermediate │
│ • writes to DuckLake│ │ marts (tables) │
│ • adds _dlt metadata│ │ │
└──────────┬───────────┘ └─────────▲─────────┘
│ │
│ catalog ops │ catalog ops
▼ │
┌──────────────────────────-─────────┐
│ Neon Postgres (Metadata DB) │
│ │
│ schema: raw (DuckLake) │
│ schema: dev_analytics (DuckLake) │
└────────────────────────────────────┘
Data flows through three distinct stages:
| Step | Tool | What happens |
|---|---|---|
| Generate | generate_data.py |
Simulates source system data, writes parquet files to S3 landing zone |
| Ingest | load_raw.py (dlt) |
Reads from landing zone, loads into DuckLake RAW layer with lineage metadata |
| Transform | dbt run |
Builds staging views, intermediate aggregations, and mart tables in DuckLake |
What Makes This Interesting
DuckLake is a new open lakehouse format that separates compute (DuckDB), metadata (any SQL database), and storage (any object store). Unlike Iceberg or Delta, the catalog is just Postgres — no HMS, no Unity Catalog, no REST catalog server.
dlt handles ingestion with its native ducklake destination — you get automatic schema evolution, load metadata (_dlt_load_id, _dlt_id), and state tracking out of the box.
dbt-duckdb attaches to DuckLake catalogs natively, reading from raw and writing transformations to a separate dev_analytics (or prod_analytics) DuckLake — all in-process, no server.
The whole pipeline runs on your laptop but hits real cloud infrastructure (S3 for storage, Neon Postgres for metadata). Swap Neon for RDS, swap S3 for GCS — the architecture stays the same.
Tech Stack
| Component | Role | Why |
|---|---|---|
| DuckDB | Query engine | In-process OLAP, zero infrastructure |
| DuckLake | Lakehouse format | Open catalog, Postgres-native metadata |
| dlt | Data ingestion | Python-native EL with built-in ducklake destination |
| dbt | Transformation | SQL-first modeling, staging → intermediate → marts |
| Neon | Metadata catalog | Serverless Postgres for DuckLake metadata |
| S3 | Object storage | Parquet file storage for both landing zone and lakehouse |
| uv | Package manager | Fast Python dependency management |
Project Structure
ducklake/
├── generate_data.py # Step 1: Generate source parquet → S3 landing zone
├── load_raw.py # Step 2: dlt ingestion → DuckLake RAW
├── ducklake_dbt/ # Step 3: dbt transformations
│ ├── dbt_project.yml
│ ├── macros/
│ │ └── generate_schema_name.sql
│ └── models/
│ ├── staging/ # 1:1 views on raw tables
│ │ ├── _sources.yml
│ │ ├── stg_users.sql
│ │ ├── stg_events.sql
│ │ ├── stg_invoices.sql
│ │ └── stg_tickets.sql
│ ├── intermediate/ # Business logic aggregations
│ │ ├── int_revenue_by_user.sql
│ │ └── int_user_events_sessionized.sql
│ └── marts/ # Analytics-ready tables
│ ├── dim_users.sql
│ ├── fct_daily_active_users.sql
│ └── fct_revenue.sql
├── .env.example # Environment variable template
├── run.py # Run the full pipeline (generate → load → transform)
├── pyproject.toml # Python dependencies (managed by uv)
└── uv.lock
Data Model
The demo generates a realistic SaaS product dataset:
Raw tables (loaded via dlt):
users— 2,000 user accounts with plan tiers and geographyevents— 50,000 product events (page views, signups, upgrades, feature usage)invoices— 5,000 billing records with payment statussupport_tickets— 1,500 support requests with priority and resolution
Mart tables (built by dbt):
dim_users— Enriched user dimension with lifetime revenue and support metricsfct_daily_active_users— Daily active users with event type breakdownsfct_revenue— Revenue facts by date, plan, country, and invoice status
Getting Started
Prerequisites
- Python 3.12+
- uv (
curl -LsSf https://astral.sh/uv/install.sh | sh) - AWS CLI with SSO configured (
aws configure sso) - A Postgres database (Neon free tier works great)
- An S3 bucket
Setup
1. Clone and install dependencies
git clone <this-repo> cd ducklake uv sync
2. Configure environment
cp .env.example .env
# Edit .env with your Postgres URL, S3 bucket, and connection details3. Authenticate AWS
aws sso login --profile $AWS_PROFILERun the Pipeline
The easiest way to run the full pipeline is with run.py:
uv run python run.py # 1 batch of source data (default) uv run python run.py 10 # 10 batches (10x activity data)
This runs all three steps in order:
- Generate source parquet files to the S3 landing zone
- Load into DuckLake RAW via dlt (incremental — only new files are loaded)
- Transform with dbt (staging views, intermediate views, mart tables)
You can also run each step individually:
Individual steps
Step 1 — Generate source data to S3 landing zone
uv run python generate_data.py # 1 batch uv run python generate_data.py 10 # 10 batches
Run this as many times as you want — each run adds new activity data for the same stable set of 2,000 users.
Step 2 — Ingest into DuckLake RAW via dlt
uv run python load_raw.py
This reads parquet files from the S3 landing zone and loads them into the DuckLake RAW layer. dlt tracks which files have already been loaded — only new files are picked up on subsequent runs.
Step 3 — Transform with dbt
# Export temp AWS creds (dbt can't use SSO directly) eval "$(aws configure export-credentials --profile $AWS_PROFILE --format env)" # Source env vars and run dbt set -a && source .env && set +a uv run dbt run --project-dir ducklake_dbt
Verify
Query the results directly with DuckDB:
import duckdb con = duckdb.connect() con.execute("INSTALL ducklake; LOAD ducklake;") # ... attach your DuckLake and query away con.execute("SELECT * FROM dev_analytics.marts.dim_users LIMIT 5").show()
How DuckLake Works
DuckLake splits a traditional data warehouse into three independent layers:
┌─────────────────────────────────┐
│ DuckDB (Compute) │ ← Runs anywhere: laptop, CI, Lambda
├─────────────────────────────────┤
│ Postgres (Metadata) │ ← Schema, partitions, snapshots
├─────────────────────────────────┤
│ S3 (Storage) │ ← Parquet files, append-only
└─────────────────────────────────┘
- No vendor lock-in — metadata is Postgres tables, data is Parquet on S3
- Time travel — DuckLake tracks snapshots, so you can query any prior state
- Multi-writer safe — Postgres handles catalog concurrency
- Zero servers — DuckDB runs in-process, Neon and S3 are serverless
Why This Stack
The traditional modern data stack (Fivetran + Snowflake + dbt) costs real money and requires managed services. This stack achieves the same architecture pattern — extract → load → transform — with:
- $0 compute — DuckDB runs locally
- ~$0 metadata — Neon free tier (0.5 GB storage)
- ~$0 storage — S3 costs pennies for demo-scale data
- Full ELT lineage — dlt tracks every load, dbt tracks every transform
- Production-grade patterns — The same separation of concerns scales up
License
MIT