An open-source framework for data-centric AI
github.comSenior data scientists know great ROI in real-world ML projects comes from finding/fixing issues in the dataset rather than tinkering too much with models. But this is done manually today via ad hoc scripts (Jupyter notebooks). In data-centric AI, we also use software that can automatically detect data issues (mislabeled examples, outliers, etc) to make all this more systematic (better coverage, reproducibility, efficiency, etc). While some companies are starting to offer commercial platforms for data-centric AI, cleanlab is: fully open-source, a complete software framework that can be used for many data-types and ML tasks, and I've published all of the novel algorithms cleanlab uses to help you improve messy real-world ML datasets.
In one-line of python, cleanlab can automatically:
(1) find mislabeled data + train robust models (2) detect outliers (3) estimate consensus + annotator-quality for datasets labeled by multiple annotators (4) suggest which data is best to label or re-label next (active learning)
It has quick 5min tutorials for many types of data (image, text, tabular, audio, etc) and ML tasks (classification, entity recognition, image/document tagging, etc).
Engineers used cleanlab at Google to clean and train robust models on speech data, at Amazon to estimate how often the Alexa device doesn’t wake, at Wells Fargo to train reliable financial prediction models, and at Microsoft, Tesla, Facebook, etc. Hopefully you'll find cleanlab useful in your ML applications, it's super easy to try out!
Beyond feature engineering, data-centric AI can help in Kaggle. This notebook shows how easily cleanlab can improve the training dataset for an XGBoost model, producing 12% reduction in error without any change to the existing model+training+data-processing code:
https://www.kaggle.com/code/ulytkch/cleanlab-data-centric-ai...
We are looking for more contributors to cleanlab in 2023. Help shape the future of data-centric AI and ensure it remains free software, especially if you love Python and practical tools for real-world data science!
Learn more about data-centric AI from Andrew Ng: