Data Engineering Whitepapers

A curated list of influential whitepapers in the field of data engineering.

# Data Lakehouse

Data Lakehouse combines the best of data warehouses and data lakes into a single architecture.

Foundational papers on distributed systems that power modern data infrastructure at scale.

Data warehouses and OLAP systems optimized for analytical queries over large datasets.

Data Warehousing: Dremel: Interactive Analysis of Web-Scale Datasets
Dremel Encoding: Dremel: Interactive Analysis of Web-Scale Datasets (Year 2011) used in Apache Parquet, and Apache Drill (which was seeded into Apache Arrow). ^51fc46
DWA: DWA Whitepapers
Redshift Files: Why TPC Is Not Enough: An Analysis of the Amazon Redshift Fleet ^2fbdb2
OLAP - ClickHouse: Lightning Fast Analytics for Everyone
The Snowflake Elastic Data Warehouse: “The mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or “Snowflake” for short.”

Data processing frameworks for batch processing and streaming computation.

DuckDB got his own categories as single-file OLAP database.

MotherDuck: DuckDB in the cloud and in the client - A paper that introduces the 1-5-Tier Architecture. ^9f7136
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
Unnesting Arbitrary Queries: Unnest subqueries queries in SQL.
Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB: This paper introduces FlockMTL, an extension for DuckDB that deeply integrates Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) capabilities into database management systems. See an implementation example
Don’t Hold My Data Hostage – A Case For Client Protocol Redesign: Showcasing how large amount of data from a database to a client program is a surprisingly expensive operation and the difference between the clients. ^caa24f

All about SQL, the domain-specific language to query databases and more.

Spanner: Becoming a SQL System: Google Spanner evolved from a distributed key-value store into a full SQL database system ^786591A
SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL - Pipe Syntax In SQL ^663863
What Goes Around Comes Around by Michael Stonebraker, Joey Hellerstein: This paper provides a summary of 35 years of data model proposals, grouped into 9 different eras. WIth a proposals of each era, and show that there are only a few basic data modeling ideas, and most have been around a long time. ^37c66e

Relational databases organize data into tables with rows and columns, pioneered by Edgar F. Codd.

NoSQL databases trade relational guarantees for horizontal scalability and flexible schemas.

Schema Evolution addresses how databases handle changes to data structures over time.

Patterns for organizing data assets, governance, and discovery across the enterprise.

Git for Data is version control concepts applied to Datasets and data pipelines.

Reproducible data science over data lakes: Replayable data pipelines with Bauplan and Nessie ^a13d97
Git for Data Paper by XetData: Proposes a system that extends Git to efficiently handle terabyte-scale machine learning datasets through content-defined chunking and deduplication. ^2e514f

Academic and industry research on database design, extensibility, and human-data interaction.