Settings

Theme

Why can't we automate data cleaning?

7 points by obi-wan 5 years ago · 4 comments


st1x7 5 years ago

We can and we do automate it. Given enough sample data and context of the data source and how the cleaned dataset will be used, you can write a script that can automate the data cleaning for a future sample which has the same context. You might get some errors if you future samples are very different from what you used to build the script or if you just missed some edge cases. This is still automation - build once, run multiple times with no additional effort.

What we can't do is create a generalised function called clean(data) that can be applied to any dataset. The reason is that your original format, data source, goals, domain knowledge, personal judgement vary so much that it makes it pointless to even try.

thedevindevops 5 years ago

We have, there are quite a few data wrangling, or 'data munging' tools out there.

  • st1x7 5 years ago

    On a side note: Does "data wrangling" and (especially) "data munging" sound cringey to anyone else? It's hard to put my finger on why exactly but the terms are really offputting.

LeviIsaac 5 years ago

It’s clear that data cleaning, like modelling, is not immune to automation. As a result, it’s likely that data scientists will find themselves leaning more and more into their subject matter expertise, communication and engineering skills in the future, rather than spending their time on dealing with missing values, hyperparameter optimization or model selection.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection