Avalanche: Streaming Postgres to Snowflake

7 min read Original article ↗

Adam Boscarino

Overview

Over the past two years, the Devoted Health Data Engineering team has evolved its data architecture to support the needs of a rapidly growing business with increasing data volumes. In this post, we will cover the journey from the team’s initial batch data pipeline to avalanche, a change-data-capture service for streaming Postgres data to Snowflake in near real-time.

Postgres to Snowflake: None to Batch

The team has built a relatively modern data stack using Apache Airflow for workflow management and Snowflake as our data warehouse. These tools replaced our initial “startup” stack of cron jobs and Redshift. Our main data source for Snowflake is our production databases. These 16 Postgres databases (one for each microservice) have over 4,000 tables that are regularly being updated both with new data and DDL changes (read: multiple PRs per day) by our amazing Engineering team. To load them into Snowflake we used our existing stack and did the obvious thing going from None to Batch (as described in Chris Riccomini’s great talk on the Future of Data Engineering). A single Airflow DAG would query each database, identify the table metadata, extract the full datasets from Postgres to S3, and then load it into Snowflake with COPY INTO statements. There were also a series of validation checks within the DAG to give us confidence in the data. It was scheduled to run every hour and the pipeline proved very reliable. Data flowed from production to Snowflake, it became the backbone of our data platform, and everyone was happy!

TL;DR

  • Batch Airflow DAG loads data from Postgres to Snowflake
  • Scheduled to run every hour
  • Full reloads on almost every table

The Breaking Point

As they say, nothing good lasts forever. You may already have identified the biggest issue with our batch solution, fully reloading every table every hour would eventually reach a scaling limit. As a fast growing startup, that point came very quickly. Our membership increased, our data volumes increased, and soon our hourly job was running for 55 minutes!

Knowing we needed a new solution, the Data Engineering team pulled together and began investigating options. Our initial idea was to simply make our existing extract queries incremental. However, we discovered this approach could not accurately replicate our sources because we allow for hard deletes in some tables in our production databases. The performance of the resulting query to identify incremental data was also quite poor on some of the larger tables.

Like good engineers we resisted the urge to immediately start building a homegrown solution and next explored managed ETL services. We ran proof-of-concepts with several popular vendors and these solutions seemed promising initially. However, we soon encountered a number of issues including slow initial syncs and struggles handling tables that had atypical data types like jsonb. Also some vendors didn’t have the ability to automatically handle the frequent DDL changes/additions our Engineering team makes to our production tables. With the speed we are adding new features to our platform, these DDL changes can happen on a daily basis, so manual intervention was a non-starter. No single vendor had a solution that perfectly fit our use case and production environment.

We are also big believers in open-source (Airflow, Amundsen, and other OSS projects are key pieces of our stack), so naturally the next step was to research open-source projects that could help solve our issue. However, we soon discovered the main OSS tool in this space, Debezium, would require introducing new infrastructure (Apache Kafka) to our tech stack that we had neither the expertise nor bandwidth to manage.

Our Debezium research set us down the path of learning about Change Data Capture. CDC is a database process that tracks all DML changes made to tables (inserts, updates, and deletes). It is the technology that powers Debezium (and likely many of the vendors in this space as well). In Postgres, these changes are captured in the Write-Ahead Log (WAL) which can then be used for things like replication to another Postgres server….or a data warehouse. After our exhaustive research and with our newfound knowledge, we decided to build a solution that would meet all of our needs. It was time to move our data pipeline from Batch to Streaming.

TL;DR

  • Hourly full reloads have a scaling limit
  • Incremental extracts didn’t work because of hard deletes
  • Managed services struggled with some of our databases
  • We didn’t want to introduce new infrastructure to our stack (no Kafka)

Avalanche: Batch to Streaming

Obviously, the first thing we did after deciding to write our own Postgres to Snowflake CDC solution was to come up with a name. After tossing around a few ideas, some worse than others, the team decided avalanche was the winner and we were ready to actually start working.

Design Goals

Now that we had a cool name, we established our goals for the project.

Snowflake is the final destination

  • A record should only be acknowledged as being read from the Postgres WAL after it is in our target Snowflake tables. Yes, this tightly couples the two systems but avoids the need for additional infrastructure, such as Kafka or Kinesis.

Reliability/resilience

  • System should be able to recover from failures/restarts
  • No missed records — guaranteed at least once delivery to Snowflake target tables

Low latency

  • Our initial goal was to have new data from Snowflake to Postgres in less than 5 minutes

Minimal impact on production database

  • System should read data from WAL as quickly as possible to avoid filling RDS disk

Minimal manual intervention

  • DDL changes and new tables should automatically flow into Snowflake

Architecture

avalanche is a Go service that works by taking a snapshot of data in Postgres, loading that snapshot into Snowflake, and then consuming the WAL via a replication slot from the point of the snapshot forward. WAL messages are flushed to Snowflake when one of three things happens: 60 seconds has passed, a certain volume of data has been read into memory, or a DDL statement has been identified.

WAL entries are inserted into Snowflake into “raw” tables and then merged into “final” tables. We have at least once guarantees to the “raw” tables” and use the MERGE statements to verify no duplicates are loaded into the “final” tables. avalanche only sends an acknowledgement back to Postgres after a record has been merged to the “final” table.

The other benefit of the MERGE approach is that downstream consumers of this data are reading from tables and not views. These tables are optimized for reads, whereas many other CDC solutions simply use JSON blobs with views and are optimized for writes.

Press enter or click to view image in full size

In addition to normal DML changes, avalanche uses a database event trigger in the Postgres database to capture all DDL changes to a table. These changes then flow to the WAL and we can consume them like normal messages. If avalanche detects a DDL change has happened, it stops streaming, translates the Postgres statement to Snowflake syntax and immediately applies it to the Snowflake table in question. After that it resumes streaming as normal with no manual intervention required.

Press enter or click to view image in full size

We have been running avalanche in production for close to a year now and it has proven to be a reliable system that met all of our initial design goals. It has already become an established piece of Devoted Health’s tech stack and continues to perform and scale as our business (and data volumes) grow.

TL;DR

  • avalanche streams new data from Postgres to Snowflake using the Postgres Write-Ahead Log
  • It loads data to “raw” tables and then merges into “final” tables that are optimized for reads
  • DDL changes are translated from Postgres to Snowflake and applied immediately

The Future

Now that avalanche has proven its value to the organization, we are continuing to add new features and make improvements. A few of our future goals include:

  • Automated Dynamic Data Masking in Snowflake
  • Faster initial full imports
  • A UI!

If this sounds like the type of project you’d be interested in contributing to, we are looking for a Senior Data Engineer to join our team!

Also a big thank you to the brains behind avalanche, Osman Qamar, and other contributors to the project including Mallori Harrell, Karla Goodreau, Jason Brownstein, Jenny Kim, and Josiah Baker.