One Year of dbt - NFHN Reader

Press enter or click to view image in full size

dbt (data build tool) is an open source project that focuses on the T in ELT (Extract, Load, Transform). Developers use it to write SQL queries to transform data already loaded into a data warehouse into tables and views for analytics. It is also one of the hottest tools in the data space. However, just over a year ago we were not sure if it was a fit for Devoted. We had a similar internal solution, DAG Builder, that handled our SQL transformations and were not sold that introducing a new tool into our data stack was worthwhile. Our tune changed after we watched the wonderful talks at Coalesce 2020. Inspired by the dbt community, the Devoted Data Engineering team launched a small proof-of-concept with a single model to test dbt’s capabilities. We quickly found the tool lived up to the hype. In the year since, dbt has become the backbone of our data platform. In this post, we’ll cover some of the reasons we migrated to dbt and the unique aspects of our dbt deployment, including our integration with Airflow and the custom tooling we built to make our developers’ lives easier.

Press enter or click to view image in full size

We don’t need dbt

This might seem like an odd statement in a post about using dbt, but it was the initial reaction from our team (and a common refrain in the Data Engineering community). dbt is after all “just” SQL and Jinja! We were already using both those things in our Airflow DAGs through our in-house DAG Builder, so why do we need to introduce another tool?

The answer, in retrospect, is obvious. Our small team could never match the power of community and open source. While watching Coalesce talks, we discovered many features we were planning on adding to our internal tooling already existed in dbt. Additionally, we saw teams adding features on top of dbt like the dynamic data masking that we hadn’t even thought of yet. It became glaringly obvious that we were wasting our time trying to compete with a tool that already offered so much of what we wanted. We needed dbt.

Integrating with Airflow

As heavy Apache Airflow users, we knew from the outset that was how we would deploy dbt in production. That left us with determining how to integrate the two tools. We researched several approaches to the problem including a great series from Astronomer. We also knew we wanted our integration to maintain the best practices from our existing in-house DAG Builder:

Smaller DAGs are preferred
No single failure should prevent unrelated tables and reports from being updated
Only fully tested data is exposed to end users
It is better to have stale data than incorrect data

The solution we settled on was to structure our project so models are grouped by business area in separate directories (we call this a “model group”). In each model config, we then set a staging_schema and final_schema.

Press enter or click to view image in full size

Each model group builds dbt models in the staging_schema. This is a “private” schema that is not exposed to end users or other model groups. After the model group is built, we run dbt test against it and, if it passes, the tables are published to the final_schema. This is a “public” schema that’s exposed to end users and other model groups. If a model is using a table from outside of its model group then it uses the final_schema version of that table, which guarantees that it is using fully-tested data. We achieved this by customizing the ref macro to understand what group a given model is in based on the directory and then set the schema accordingly.

Press enter or click to view image in full size

Each model group is then deployed to Airflow as an independent DAG. Developers can also connect model groups using External Task Sensors, though this is not required. This approach has worked well for us. It has helped limit the impact of any single dbt failure and ensured we only deliver fully tested data to our end users.

Press enter or click to view image in full size

dbtctl

In addition to our custom integration with Apache Airflow, we have developed an internal CLI to improve developer productivity in dbt. The tool, dbtctl (pronounced dbt-cuttle/cuddle), is written in Go and its main use is acting as a wrapper around the standard dbt CLI. It automatically sets the appropriate environment variables and authentication, allowing developers to run commands without needing to worry about setting up any connections.

Press enter or click to view image in full size

Additionally, it has several other commands to make it easier to develop dbt models. Some of the most useful include:

lint

All dbt models must pass Devoted specific linting rules!

init

Generate a dbt model from a “regular” SQL file
Create a schema.yml file from a SQL file
Create a Devoted “standard” DAG file for a dbt model group

clone

Clones all ref and source tables from a given model group to a developer’s dev database allowing them to quickly start developing on a model

Press enter or click to view image in full size

CI/CD

Another area we have improved dramatically thanks to dbt is our CI/CD process for SQL pipelines. In our aforementioned internal solution, queries were not tested via CI and it was up to the individual developer to manually test their SQL changes. With dbt, we have been able to develop a much more robust solution.

On every PR, we create a testing schema (this gets dropped at the end of the CI run), identify the models changed by the PR, and then build the model and all potentially impacted downstream models in the testing schema. We were able to keep tests fast even with hundreds of models by using a custom version of the source macro to include a LIMIT 0 clause on all source tables. This guarantees that our dbt models will at least execute successfully in production. We also use our custom dbt linter to validate every dbt model and schema file meets Devoted’s standards before merging.

Press enter or click to view image in full size

These improvements have vastly reduced the amount of production failures caused by SQL errors and have greatly increased developer productivity and confidence.

dbt unit test framework

Over the past year, we have benefited greatly from dbt’s built-in test functionality. dbt tests are an excellent tool for validating data. We use them when we want to make sure the “id” column in a model is unique, or enforce that a model should never have more than n number of rows. These tests validate the data after the model runs in our staging schemas. However, dbt tests do not test units of business logic that are embedded in our SQL.

After watching the excellent talk presented by Michelle Ark from Shopify at a Data Quality Meetup we became interested in providing a toolset for unit testing our dbt models the way you would unit test a function in python. Since the Shopify code wasn’t open sourced we went about building our own version on top pytest to “mock” models.

In our dbt unit test framework, developers provide mock data in the form of a CSV string that is then loaded into a MockModel class. The MockModel then reads a model’s compiled SQL from dbt’s manifest.json file and replaces relation identifiers with the mocked data.

Press enter or click to view image in full size

This allows developers to write tests for individual pieces of SQL and verify they are working as expected. This has helped increase confidence when changing complicated transformations and improved data quality and trust in our platform.

Closing Thoughts

Over the last year, dbt has become a key piece of the data platform at Devoted and lived up to our wildest hopes and dreams. We have gone from a single proof-of-concept to 1,100+ models. It has fully replaced our previous internal SQL+Jinja tooling and powers almost all of our data transformations.

While dbt itself is extremely powerful, our investment in building on top of dbt and integrating it into our existing stack has been game changing. dbtctl has empowered our amazing team of data analysts and data scientists to own their data models end-to-end. Additionally, our CI/CD tooling and testing framework have made the team more confident in making changes and improved developer productivity. You don’t have to take Data Engineering’s word for it though.

“Running things in CI has really increased my confidence in changes that go out. It catches a lot of column name mismatches, etc., before they can hit prod. We couldn’t do that before dbt.” — Data Scientist
“The entire development process is faster and more robust than our previous approach, due to the ease of out-of-the-box CLI and schema testing, and the tools y’all have built to make that work both faster (clone, init, etc) and more solid (linting, etc)” — Data Scientist
“One of the development improvements that I’ve enjoyed which has increased velocity is having model groups and being able to run and test specific models within the group, rather than having to run the whole group or DAG.” — Data Scientist

Written by Adam Boscarino and Jason Brownstein.

Special thanks to Andrew Hoffman, Jennifer Thompson, and Adam Baratz for reviewing.