Committing changes to a 130GB Git repository without full checkouts [video]
youtube.comHey HN, would appreciate feedback on a version control for data toolset I am building, creatively called the Data Manager. When working with large repositories with data, full checkouts are problematic. Many git-for-data solutions will create a new copy of the entire datasets for each commit and none of them allow contributing to a data repo without full checkouts, to my knowledge. In the video, a workflow that does not require full checkouts of the datasets and still allows to commit changes in Git is presented. Specifically, it becomes possible to check out kilobytes to commit changes to a 130 gigabyte repository, including versions. Note that only diffs are committed, at row, column, and cell level, so the diffing that appears in the GUI will seem weird, since it will interpret the old diff as the file to be compared with the new one, when in fact they are both just diffs. The goal of the Data Manager is to version datasets and structured data in general, in a storage-efficient way, and easily identify and deploy to S3 datasets snapshots, identified by repository and commit sha (and optionally a tag) that need to be pulled for processing. S3 is also used to upload heavy files that are then pointed by reference, not URL, in Git commits. The no-full-checkout workflow shown applies naturally to adding data and can be extended to edits or deletions provided the old data is known. That is to ensure the creation of bidirectional diffs that enable navigating Git history both forward and backward, useful when caching snapshots.
The burden of checking out and building snapshots from diff history is now borne by localhost, but that may change, as mentioned in the video. Smart navigation of git history from the nearest available snapshots, building snapshots with Spark, and other ways to save on data transfer and compute are being evaluated. This paradigm enables hibernating or cleaning up history on S3 for datasets no longer necessary to create snapshots, like those that are deleted, if snapshots of earlier commits are not needed. Individual data entries could also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.
The prototype already cures the pain point I built it for: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary. Some background: I am building natural language AI algorithms (a) easily retrainable on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (sounds impossible), and (b) that explain decisions back to individual training data. LLMs have fixed training datasets, whereas editable datasets call for a system to manage data efficiently, plus I wanted to have something that integrates naturally with common, tried and tested tools such as Git, S3, and MySQL, hence the Data Manager.
I am considering open-source: is that the best way to go? Which license to choose? Looks like you've just reinvented GVFS (https://github.com/microsoft/VFSForGit) for a specific use case? Or is this just a partial clone? Or a shallow clone? Or both? It's unclear from the video if this is 130 GB of current state at the branch head or 130 GB of commit history. It's like GVFS, but for pieces of a file at a time as well: rows, columns, or cells. A snapshot is recreated by putting those pieces together. If you have ten million rows in one file and only add a thousand rows daily, each commit will only contain those thousand rows in its tree, not the sum total that will then be diffed by your favorite diff tool, but really just the bidirectional diff. It is the low-level materialization of the diff Git paradigm whereby during merges and rebases the 3-way difference between object trees is taken and acted upon, but placing the diffs themselves (data-level diffs on top of file-level ones) under those trees, overriding Git semantics in that Git will now deduplicate the diffs, not the entire original files, in order to recognize them as new objects and commit the new tree. In git, you can see in the video the same file diff being overwritten, representing a new piece in every commit. While you don't need the ten million rows to commit each new thousand rows, they are needed upon merging in order to detect conflicts. Object content referenced by S3 pointers is fetched if and when needed, but the git objects themselves are fetched since they are really small. It is neither partial nor shallow clone strictly, as all the objects and trees are downloaded in the current implementation, but the S3 pointers enable similar delaying and filtering behavior, like with DVC. Sorry if the repo size is unclear, hope this is better:
~180 kB: current state at the branch head, includes pointers to S3 (exact size depends upon packs and indices), plus full history, also with pointers
~890 MB: current state at the branch head, after downloading all files referenced by pointers in the Git history from S3, plus full history, with pointers
~130 GB: commit history, this is what the repo would weigh on DVC or Git LFS, this repo corresponds to a use case with many small updates With increasing repo size (even when the 2nd, 890 MB state, increases in size, let alone the fully materialized history), this enables working on the 1st (kBs) and still commit changes. I use git lfs. There are filter options for all commands so you don't need to checkout any more data than you want to/need to. Works like a charm for me! I'd be curious to hear what features you are missing. We have repositories that would be as big as 100GB if you downloaded all large files for the full history, but I guess I don't see why you would want do that? Which repo size after the filters do you work with on your machine and how many GBs do you have in Git LFS, that is, in the cloud? I hear people complain about costs, but it depends upon scale and change frequency, which can increase total repo size. We wrote our own LFS API server (which is actually not very hard, about 100 lines of python was enough and it performs at scale) so we can directly leverage azure blob storage. If you don't walk this path and enable LFS in github or something like that the costs are obscene, yes. For us it's dirt cheap. If I check out head of my repo and don't filter anything, it's a couple GBs. Inside the azure blob storage container that backs our LFS API server, there's probably terabytes of data. It's really very very much. We don't have any performance problems. One API instance can handle it. Of course we did make sure to implement it well... It's Uvicorn/Starlette, all IO is async and all CPU "intensive" work like JSON (de)serialization runs in a background threadpool. That is really interesting and begs the question of how frequently you have changes in your data that lead to new commits. I am assuming here that you don't dedupe anything, that is, you throw the entire files into Azure with each version, since it's cheap enough for your purposes. Also, how frequently do you move head, even without committing anything new, perhaps to use another branch? LFS stores files by content hash, so deduplication happens that way. But you're right that if you frequently make small changes to a single large file, it's wasteful. In our case though we don't frequently change files, we just get lots and lots of new big files coming in all the time. Moving head, as in, to check out another branch locally? Somewhat regularly I guess. I suppose you're wondering about performance in that scenario? It's usually quite good since git-lfs does some local caching as well. I've never needed to wait longer than a couple of seconds. I'm usually on a wired 1000/1000 Mbit optic fibre connection, and transfers are directly to and from an azure blob storage container (the LFS API server only generates download and upload URLs, it intentionally doesn't transfer any data), with parallel connections and chunking etc, so it doesn't really get any better than that. And all of that is out of the box functionality too. :) Sorry I should have been more specific, I meant block deduplication, or any form of deduplication at a level lower than the entire file. File deduplication can only get you so far, depending on the use case. XetHub does block deduplication, whereas I am implementing data-level deduplication, which is slower in recreating dataset snapshots (can be parallelized and delegated), but allows savings on disk space with small but frequent changes and can be tied to collaborative features to show diffs, comment on them, and revert or edit changes where needed, all while pointing clearly to specific commits. And also potentially fork data or cumulative changes. Yes I meant either checking out other branches locally, or in the general case pointing to another branch to indicate to any services to make data from that branch available to wherever it's consumed. I am assuming that each incoming new file is then added to data pipelines, possibly just a few. Sounds like you are in the sweet spot where you have the speed you want and, given unfrequent changes, you are fine with the versions taking up terabytes on Azure, since they are mostly new data. Just read the second paragraph. Currently expanding merge resolution assistance to deal with the general merge conflict case, as well as implementing revert and cherry-pick assistance. Unsure if that is what you were wondering? You probably don't want to do that with 100 GB if most of your commits are new data, rather than changes, yet I wonder whether all incoming new files are then queued into the same pipelines and the reason why they are separate files is not to have to deal with one giant cumulative file for which the older parts would not be deduped in Git LFS, which is a great reason, or whether those files are anyway different file types in terms of contents and intended use, another great reason, or something else altogether. Are you processing data by including anything in a given folder in a given commit? *I have just (only now) read the second paragraph in your message. Not sure if that came across correctly, that first sentence was too compressed. Where do git sparse check outs stop and this begin? The two are not mutually exclusive, in principle. Depending on workflows, sizes, and change frequencies, each has advantages. Sparse checkouts are useful with small files that specific individuals can focus upon, among other things. This is for keeping server-side repo size small in versioning larger datasets that change daily and for avoiding the requirement to have a sparse checkout of those large datasets in workflows with incoming data, in the first place. This is an interesting use case: https://news.ycombinator.com/item?id=35763004 Honestly, it’d be nice if there was like a PNGCrush for git repos. Or maybe even if Git offered zstd compression would be cool too. Git does do compression on repos, but the fact that versioning repositories with (huge) data is still an open problem suggests that it is not the kind that fixes it. I might be mistaken, are you aware of any interesting compression methods applied to version control? I know git uses deflate hence my first paragraph. Doesn’t mean those deflate trees are optimal, as you can see with tools like OxiPNG optimizing the deflate compression to reduce png file sizes by about half. The same optimization could be applied to git blobs in theory, it would be cool if there was a tool that did that. —— My second point was more about if git was upgraded to add a new compression algorithm like zstd instead of Deflate, like switching the hash algorithm from SHA-1 to SHA-2 (though if I was in charge, I’d go with Blake3 because it’s far faster) Fully agree. Compression in many cases removes the ability to diff easily, however. In a large dataset where, in terms of size, 1% of the original data undergoes changes, or new data the size of 1% of the original dataset is added, I think compressing does not compare with just deduplicating the unchanged 99% in terms of storage, but when speed is the #1 factor, the discussion is more nuanced. It might be interesting to have a combination of deduplication and better compression of the changes, in some form, to get the optimal tradeoff. Repo sizes in ML these days are high, I'm curious which repository compression techniques are being evaluated and deployed.