A curated list of influential whitepapers in the field of data engineering.
# Data Lakehouse
Data Lakehouse combines the best of data warehouses and data lakes into a single architecture.
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics ^6a7f75
- Building a Database on S3 (2008) - Before Open Table Formats
# Distributed Systems & Storage
Foundational papers on distributed systems that power modern data infrastructure at scale.
- Google File System (GFS): The Google File System ^fdbf43
- Data Engineering Architecture: The Google File System
- MapReduce: MapReduce: Simplified Data Processing on Large Clusters
# Data Warehousing & OLAP
Data warehouses and OLAP systems optimized for analytical queries over large datasets.
- Data Warehousing: Dremel: Interactive Analysis of Web-Scale Datasets
- Dremel Encoding: Dremel: Interactive Analysis of Web-Scale Datasets (Year 2011) used in Apache Parquet, and Apache Drill (which was seeded into Apache Arrow). ^51fc46
- DWA: DWA Whitepapers
- Redshift Files: Why TPC Is Not Enough: An Analysis of the Amazon Redshift Fleet ^2fbdb2
- OLAP - ClickHouse: Lightning Fast Analytics for Everyone
- The Snowflake Elastic Data Warehouse: “The mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or “Snowflake” for short.”
# Processing Engines
Data processing frameworks for batch processing and streaming computation.
- Apache Spark: Spark: Cluster Computing with Working Sets
- Streaming: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- Vectorized Engine: Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask ^db4178
# DuckDB
DuckDB got his own categories as single-file OLAP database.
- MotherDuck: DuckDB in the cloud and in the client - A paper that introduces the 1-5-Tier Architecture. ^9f7136
- Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
- Unnesting Arbitrary Queries: Unnest subqueries queries in SQL.
- Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB: This paper introduces FlockMTL, an extension for DuckDB that deeply integrates Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) capabilities into database management systems. See an implementation example
-
Don’t Hold My Data Hostage – A Case For Client Protocol Redesign: Showcasing how large amount of data from a database to a client program is a surprisingly expensive operation and the difference between the clients. ^caa24f
# SQL
All about SQL, the domain-specific language to query databases and more.
- Spanner: Becoming a SQL System: Google Spanner evolved from a distributed key-value store into a full SQL database system ^786591A
- SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL - Pipe Syntax In SQL ^663863
- What Goes Around Comes Around by Michael Stonebraker, Joey Hellerstein: This paper provides a summary of 35 years of data model proposals, grouped into 9 different eras. WIth a proposals of each era, and show that there are only a few basic data modeling ideas, and most have been around a long time. ^37c66e
# Relational Model
Relational databases organize data into tables with rows and columns, pioneered by Edgar F. Codd.
- A Relational Model of Data for Large Shared Data Banks, 1970 by Edgar F. Codd ^8d1ede
-
C-Store: A Column-oriented DBMS: Organizing data by columns rather than rows. 2005 ^fe90b8
- MonetDB/X100: Hyper-Pipelining Query Execution. The beginning of Vectorized Query Execution by MonetDB ^4e54dd
# NoSQL
NoSQL databases trade relational guarantees for horizontal scalability and flexible schemas.
- Bigtable: A Distributed Storage System for Structured Data, 2006
- Dynamo: Amazon’s Highly Available Key-value Store, 2007
# Schema Evolution
Schema Evolution addresses how databases handle changes to data structures over time.
- 1995: A survey of schema versioning issues for database systems by John F. Roddick ^eb6505
- 2012: Automating the database schema evolution process
# Data Architecture & Governance
Patterns for organizing data assets, governance, and discovery across the enterprise.
- Data Mesh: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- Data Catalog: Ground: A Data Context Service
- Semantic Layer: Measures in SQL by Julian Hyde and John Fremlin
- Analytics Development Lifecycle (ADLC): The Analytics Development Lifecycle (ADLC) by dbt
# Git for Data
Git for Data is version control concepts applied to Datasets and data pipelines.
- Reproducible data science over data lakes: Replayable data pipelines with Bauplan and Nessie ^a13d97
- Git for Data Paper by XetData: Proposes a system that extends Git to efficiently handle terabyte-scale machine learning datasets through content-defined chunking and deduplication. ^2e514f
# Database Extensibility & Research
Academic and industry research on database design, extensibility, and human-data interaction.
- Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility
- HILDA: Human-in-the-Loop Data Analysis: A Personal Perspective ^678997
- Bluesky: The AT Protocol: Usable Decentralized Social Media with Martin Kleppmann ^039de7
# Other Lists
- Schedule - CMU 15-721 :: Advanced Database Systems (Spring 2023) : see linked Papers to each presentation.
- Data Engineer Handbook Whitepapers
Origin: Data Engineering Vault
References:
Created 2024-01-05
