Show HN: PipeRider – open-source Data Impact Analysis for dbt changes
github.comHi HN! This is CL and we’re building PipeRider[0]. PipeRider is an open source data impact analysis tool, specifically during pull requests for dbt.
Why? In a previous life I worked on distributed version control systems[1] prior to git, weird data systems like in-postgres REST server with plv8, and building civic tech communities with open data. It always startled me when some new characteristics of data were uncovered, and we had to change some data schema & modeling and then recheck all downstream uses of the data (if we even could).
Fast forward to the modern era of data systems, the data engineering stack is certainly becoming more like the software engineering stack, with dbt as of one of the main game changers. But the engineering practices and tooling~~s~~ aren’t quite there, yet.
We are building the missing pieces of the “data-pipeline-as-code” puzzle, by making pull-requests on data systems more informative and bringing confidence to reviewing and testing the impact of code-change. Here’s an example PR of the campaign finance project[2]. The goal is to augment the CI process to help teams move faster when making data and pipeline logic changes, by being aware of intentional and unintentional downstream models and metrics impact.
As a version control nerd, I personally love semantic-aware diffs that can help inform stakeholders, such as diff formatted law amendment proposals in congressional bills. So, what can we do for complex data systems that are now pull-request-able? One thing is the "lineage diff" - A way to visualize the data models you are adding or making changes to, and help you to make sense of the impact.
Here's another example based on a pull request in the Danish Parliament data[3] dbt project[4] (Note the lineage diff is part of advanced impact analysis which is not open source)
You can try out Lineage Diff in this online viewer[5], by uploading two manifests from your dbt project to see the impact on lineage from your code-changes.
I’d love your feedback!
[1] https://news.ycombinator.com/item?id=32668334
[2] https://github.com/g0v/tw_campaign_finance/pull/2
[3] https://github.com/bgarcevic/danish-democracy-data/pull/8
[4] https://cloud.piperider.io/clkao/default/runs/33a74e0bef5f48...
[5] https://cloud.piperider.io/online-viewer
(Edit: link formatting) Curious how piperider compares to great expectation ? Thanks for the question! GX is a great tool providing data testing, which is also useful during dbt PR, similar to dbt tests. PipeRider provides additional comparison about data profiles between the merge base and the PR, and also presenting more dbt-specific information. Co-pilot for the data stack. Cool!