I’ve worked on many data science and machine learning projects — big and small. From projects that took only a few notebooks to the ones that grew into tens of thousands of lines of code.
I found that starting a project assuming it’s going to grow is always a good thing.
It only takes a small amount of extra effort at the beginning to be able to maintain long-term productivity.
Using git to manage code is a must. A way to manage data is a must as well, but the jury is out on what is the best way.
Here I’m going to present a library I recently open-sourced called lazydata that is built specifically to manage data for growing ML projects.
Traditional ways of managing data
If you are working on a machine learning project you will inevitably be working with various versions of the dataset (raw, processed in various way, train/dev splits, augmented), and will be trying out different models (baselines, different parameters, architectures, algorithms).
You will also need to move these from your local computer to the cloud, and also share them with your teammates and your future self.
To manage this proliferation of data and models files traditionally there are two options: manual file management and storing everything it git.
The table below summarises the different approaches and their biggest risk.
Press enter or click to view image in full size
The lazydata way
The core concept of lazydata is, well, laziness. It is the assumption that as the project grows you will produce a lot of files of which you will immediately only need a small subset.
It’s always good to be able to go back to that model you produced a few months ago.
But, it doesn’t mean that everyone who wants to check out your project needs to download the whole history — they most likely only want a small subset.
Enabling lazydata is a couple of lines of code
So how do you use lazydata? First, install it (this requires Python 3.5+):
$ pip install lazydata
Then go to your project root directory and type:
$ lazydata init
This will create a new file lazydata.yml which will hold a list of your tracked data files.
Next you can use lazydata in your code to track your data files:
# my_script.pyfrom lazydata import track
import pandas as pddf = pd.read_csv(track("data/my_big_table.csv"))print("Data shape:" + df.shape)
When you run this script for the first time, lazydata is going to start tracking you file.
Tracking involves:
- Creating a copy of your file in
~/.lazydata— This is your local cache where your file versions are preserved so you can go back to them if you need to and to push them to the cloud. - Adding a unique reference (SHA256) of your data file to
lazydata.yml. You add this yml file you to git to track the data files you had in this specific commit. This enables you to do forensics later on if needed, and neatly tracks your data file as if they were code.
And you are done! If you ever change your data file, a new version will be recorded and you can simply continue working as you normally do.
To share your data files with your teammates, or simply to back it up, add a remote to your project and push your files:
$ lazydata add-remote s3://mybucket
$ lazydata pushOnce your teammates pull the latest code they’ll also get the lazydata.yml file. They can then use the S3 remote you set up and the SHA256 unique reference to pull the file with lazydata pull.
Alternatively, they can simply run your script to lazy-download the missing file:
$ python my_script.py
## lazydata: Downloading tracked file my_big_table.csv ...
## Data shape: (10000,100)And that is it! You can now treat your S3 as a permanent archive of all of your data files, without the fear that your collaborators will have to download any files they don’t need.
In the above example we applied lazydata to data inputs, but you can also use track() on data outputs, e.g. on a model weights file you just saved — it works in exactly the same way.
This lazy semantics enables you to get the best of both worlds: store everything but also keep your data files managed and organised as the project grows.
To learn more about the library checkout the GitHub page: