How do you monitor data pipelines?
How are you monitoring your data pipelines? Setting up good, sustainable alerting to catch data problems is important and I’m curious what the current solutions teams are using today.
Things that would be important to catch:
* A table that should be getting new data everyday is no longer receiving data
* A table’s schema changes
* A column that should be unique is no longer unique
* A column that shouldn’t have nulls has nulls
* A numerical column has values that go beyond the expected range
* The distribution of categorical values is past some threshold (ie more than 80% “no” values in a column)
Also are there other obvious things that are important to catch?
No comments yet.