ELT Schedules Can Improve Root Cause Analysis for Data Engineers
montecarlodata.comAt Monte Carlo, we did some work on root cause analysis for data failures, like ETL job failures, timeouts, data delays, etc. I think there's a lot that can be done from a data science perspective to automate RCA, or provide better insights into data pipeline problems.
We put together this blog post, showing how an orchestration DAG (like a dbt schedule DAG) can be converted into a Bayesian network. You can then ask causal attribution questions in the form of conditional probability queries against the BN. The idea is still pretty basic / preliminary, but I think it could be extended in all sorts of interesting ways e.g. attributing bad row-level data to upstream transformations, etc.
Would be interested to hear what people think.