Benchmarking Versioning Tools: S3, DVC, Git LFS, and XetHub
about.xethub.comInteresting. I'm always skeptical of blog posts companies make about themselves, but I can definitely see how block-level deduplication could be a huge help.
I wonder how it scales with the numbers of blocks (e.g. with 10 TB of data)? I would guess that it would really help with the amount of data stored but then slow way down as more data gets added due to the extra overhead of tracking and deduping.
Also, does it work across repositories?
What specifically is meant by "Modern Development Experience"? I see github eliminating a lot of useful development pain points with its tools for repository collaboration and integration, which is what I associate with the current trend with repositories, but this post is focusing just on having data with the code which is an interesting assumption...
re: "modern development experience" - more and more technical projects can be seen as "just" software projects with huge data dependencies. For gaming, the data dependencies are binary texture files, sound files, and more. For biotech these are binary formats from equipment and for analysis. And for ML these are images, video, text (and binary formats for text like Parquet).
One of core motivations behind XetHub is to enable teams across industries to benefit from the workflow we've used in software for 15+ years. We've used this workflow for so long it is easy to overlook its benefits.
Software teams have a clear picture of who is working on what, what is in flight, what is in review, and what is remaining. Anyone on the team can easily pick up work in progress from someone else or start a new derivation of work without concern about interference. Teams can be distributed across timezones and yet everyone feels connected to the project and is able to contribute without disruption.
The power of a GitHub-style workflow for team collaboration comes from being able to experiment freely (branches or forks), review easily (pull requests), and observe (passively learn) best practices from the team (issues, code review feedback).
The dedupe is optimistic and is designed to scale to 1-10 PB range. There is a more complicated architecture blog post we are working on. We can dedupe across repositories but we do not right now largely for privacy reasons so that blocks are not shared across different people as that can cause information leakage.
(disclaimer: XetHub co-founder here) What other tools should we add to this benchmarking set?
Last year we benchmarked this set along with LakeFS, should we add LakeFS back to this set?