Ask HN: Dataset version control for ML / data science?
To data scientists, machine learning engineers and data engineers -- how do you manage your datasets? What tools and workflows, if any, do you use to version your data alongside your code?
Currently, my workflow for data analyses / modelling is essentially:
1. Write SQL query for desired dataset
2. Run query to produce CSV
3. Hash the file as an identifier
4. Upload the file to S3
5. Reference the file in Jupyter notebook / scripts etc.
6. Return to step 1 or 2 (depending on if I'm updating a report, or creating a new experiment with new data).
I'm curious if people have experience using tools such as DVC [0] for managing experiments. Git LFS could be useful, but it seems to be aimed more at binary assets, not large datasets of many GBs.
[0] https://dvc.org/
No comments yet.