SnowFS – a fast, scalable version control file storage for graphic files

github.com

86 points by karateka 5 years ago · 28 comments

Reader

lucideer 5 years ago

From the README:

> Why not Git/Git-LFS, libgit2, or SVN?

> Disadvantages:

> (Without Git-LFS): Heavy cost with zipping, packing, and delta-compression for larger files

Given the caveat (without Git-LFS) it seems odd to include this in the list

> If not properly tracked, binaries become accidentally part of "base" history

That's a big "if", and not an inherent problem. This could easily be resolved by any good design-focused UI (e.g. SnowTrack), so this seems a poor argument against using Git as a backend.

> Removing older commits is cumbersome due to Gits commit hashing integrity

This (like the first bullet point) does not apply to Git-LFS.

> Complicated rewriting history procedure

What?

> Issues with binaries >4GB on Windows

A known bug in Git-LFS that they're working to fix. There are workarounds provided in the linked tickets (that could be leveraged by a UI / abstraction layer like SnowTrack).

This is the first item in the bullet list that is a real disadvantage of Git LFS, but the workaround for it seems much less effort than developing a new VCS backend from scratch.

> Slow in binary modification detection

I'm not sure if this applies to Git or Git LFS; there's little detail provided. But if it's significant, this is probably the only really compelling disadvantage listed.

> Git uses a restrictive license

And finally we see the real reason for not using Git.

---

NOTE: I don't mean to make out that building an alternative VCS to Git is not worth pursuing. Nor that it needs any specific justification. Just that listing a justification that seems (to me) mostly disingenuous is worth pointing out.

sebastian_io 5 years ago

Thanks for listing out the points you disagree with! The project is still in early alpha, so is the Readme. Therefore anything that is ambiguous or not clear, is worth to address.
The main requirement is performance (a missing point in your list). If Git would be a good candidate as a versioning system for DCC software packages, it would have been picked up by now, but it didn't happen, among others because of the reasons listed above. Git addresses a completely different target audience and lifecycle than SnowFS. The commit hash integrity is a problem in CG/VFX productions, so is the 4GB limitation, as well as the I/O performance for large binary files. The fact that these issues are still there are fully understandable, given the responsibility and dependencies of this project. That's why SnowFS tries to address the niche requirements with its light implementation.
In terms of the license, this is intentionally the weakest argument of all. It doesn't prevent anyone under the GPL to ship Git as an external program with a commercial software, same counts for libgit2 with its linking exception. So there is not even a real benefit here. But the chosen MIT license is an open invitation for everyone.
P.S. Certain features and technical solutions will be feature-proposed to libgit2
- lucideer 5 years ago
  
  > Thanks for listing out the points you disagree with!
  > The main requirement is performance (a missing point in your list)
  The points I listed are not selected by me! I listed all the points in the project README verbatim. Other than the 2nd-to-last point (slow binary mod detect), performance wasn't otherwise listed in the README, so I don't know why you're calling me out for omitting it.
  > If Git would be a good candidate as a versioning system for DCC software packages, it would have been picked up by now
  You seem to be implying something else was picked up instead of Git? Other than SnowFS (which you say is in early alpha), what else has surpassed Git in this space? If nothing else has yet been picked up in place of Git, this argument isn't applicable.
  > The commit hash integrity is a problem in CG/VFX productions, so is the 4GB limitation, as well as the I/O performance for large binary files
  You seem to be again comparing SnowFS to Git-without-Git-LFS (other than the 4GB limitation which I already addressed in my comment). This is, as I said, disingenuous. Why keep making this selective comparison?
  > P.S. Certain features and technical solutions will be feature-proposed to libgit2
  That's cool, and I wish them the best of luck with developing these solutions. As I said, I've no issue with SnowFS; new approaches are always cool. I just think their listed justification is disingenuous; they'd be better of simply stating they want to develop something new and leave it at that.
  - sebastian_io 5 years ago
    
    > Other than the 2nd-to-last point (slow binary mod detect), performance wasn't otherwise listed in the README, so I don't know why you're calling me out for omitting it.
    The ones I was referring to for performance are:
    - Support for instant snapshots
    - Support for instant rollback
    About alternatives, Perforce and PlasticSCM are currently commonly used. But I understand your objections, and will check if I can handle certain things differently in the README. Thanks again for your input!
    
    lucideer 5 years ago
    
    Thanks, and apologies it if came across overly critical.
    I just think things like this can be impactful to perceptions of the work done on efforts such as Git-LFS, and describing things based on their own merits is often a better approach than pointing out lack elsewhere.
    A VCS UI for design is something I've been looking for for a LONG time, so I signed up to the SnowTrack public beta immediately. I was just a little confused/concerned to learn it won't have a widely-supported backend to ease things such as synchronisation across devices, sharing resources via a hosted service, etc. Curious to see how this gets handled in the final product.
    
    sebastian_io 5 years ago
    
    I really appreciate your critical view on the project, because it makes me reflect my own stand and arguments and to see if they are correct or wrong. E.g. I just removed the "without LFS" argument because you are right, this is not a sustainable argument and I will address a few more soon to clear things up.
    I learned a lot during the development of SnowFS and the open-source-community is the best place to share my experiences, that's why I put it on GitHub. At the end, I would be super happy if these insights can make it over to Git and Git-LFS.
    We have a Discord channel, you are very welcome to swing by anytime for a virtual beer :-)
- unqueued 5 years ago
  
  Have you looked into git-annex?
  Git annex lets you track references to binary files, only using git for storing references to file hashes.
  And you can use custom back ends to efficiently store differential data.
  For example, I have an annex repo that stores about 150G of text files, but it uses bup to compress it down to about 20G, while I can still have access to different versions via git.
  1: https://git-annex.branchable.com/special_remotes/bup/
  - sebastian_io 5 years ago
    
    Impressive numbers! Unfortunately I know git-annex only on paper. I gave it a try a while ago, but it was a bumpy start, admittedly most likely user-error. Would you mind sharing some details about it (e.g. file numbers, etc)? Can I invite you for a chat? Doesn't need to be long, but might be more suitable for a chat
    
    unqueued 5 years ago
    
    Sure, how would you like to get in touch? You have a discord, right? I actually was looking at your project and was thinking of opening a simple PR. (same username)
    I have some more examples git-annex repos:
    This is an annex repo I made of this popular abandonware website:
    https://github.com/unqueued/repo.macintoshgarden.org-fileset
    And some podcasts
    https://github.com/unqueued/radiolab-fileset
    https://github.com/unqueued/ratholeradio-archive
    What's cool is that people can use standard pull requests to add files to the repo. And the repo itself is small, but it can represent huge filesets. Datalad has some really fascinating medical imaging data repos that are massive (https://www.datalad.org/datasets.html).
    If you wanna see a really good example of a repo with versioned binary files, check this out the git annex repo of previous git-annex binary releases:
    https://downloads.kitenet.net/.git/
    You can just use standard git workflows to see previous revisions of a file (well, previous hashes) but it is really easy to hook into.
    
    sebastian_io 5 years ago
    
    Very excited for a PR. Any help and support is very welcome. :-)
    I just cloned one of the repos, seems I really should look more into annex. Feel free to join the Discord channel, that would be the easiest to go from there

mrmrcoleman 5 years ago

This is great, well done. While at (now defunct) Dotscience we did a lot of work on Dotmesh which you might find interesting: https://github.com/dotmesh-io/dotmesh

I would also look at data science/ML as a potential use for the tool as there are real issues with using Git for training data.

Last point which is more of a tip: Show don’t tell. If you did some side by side workflow walkthroughs showing the difficulties with other tools it will make it easier for people to see that this problem is real (which it definitely is)

sebastian_io 5 years ago

Interesting project, thanks for sharing the link! I will check that out. And for the tip, agree 100%! :-)

jarym 5 years ago

I need to look into this more but unless you already know C/C++ really well then Rust may be a better choice.

sebastian_io 5 years ago

Same here, the aimed goal is a C++ backport, bc I am a C++ dev. But Rust is on my radar and happy to learn

amelius 5 years ago

But what if I have software and images (e.g. as in game software)? Which version control system do I use?

Honestly, I think the effort was better spent on an improved version of Git.

lhoff 5 years ago

Maybe a mixed approach similar to what dvc [0] is doing. Fully integrated with git but tracked and stored elsewhere.
[0] https://dvc.org/
sebastian_io 5 years ago

Programmers, game makers and 2D/3D artists are very different target audiences with very different needs. To name one example, the commit hash integrity which is a foundation of Git is a must-have for software projects, but might be not useful in environments like VFX or CG productions
- mbreese 5 years ago
  
  Also, the project is very clear that it is for individuals with no collaboration tools built in. While git does have utility for a solo developer, having a separate system for a solo artist (that isn’t working with assets for developers), starts make more sense.
- amelius 5 years ago
  
  > Programmers, game makers and 2D/3D artists are very different target audiences with very different needs.
  Yes, so if you want to make them use the same version control system (not a strange requirement, see Google monorepo), then the system should be all-encompassing, and not limited to one type of user.
  And it may turn out that features meant for one type of user are also useful for other types of user. For example, programmers may need to store large image files as part of a deep-learning image classification system.
  I'm usually not a fan of packing as many features as possible into a single tool, but in case of version control it makes sense. (A pluggable/modularized approach should of course be used if possible).
  - sebastian_io 5 years ago
    
    You are absolutely right! A tool that can cover everyones needs is a great goal, and a pluggable or modularized approach might be even the best. SnowFS is not a one-way street, there are certain features that are currently investigated to be feature-proposed for libgit2. Means, this project is a great way to explore the possibilities without any historical or technical barriers because it is so lightweight. Best proof are the benchmarks
iaml 5 years ago

AFAIK big studios use perforce.
- sebastian_io 5 years ago
  
  Most productions I've seen Perforce is still the way to go, sometimes PlasticSCM for Unity

kevlar1818 5 years ago

Very interesting. I'd like to learn more about how it works. How does this compare to DVC[1], for instance?

I'll throw in a shameless plug for my tool in this area, Dud[2]. Dud is to DVC what Flask is to Django.

Are the mentioned benchmarks published somewhere?

[1]: https://dvc.org [2]: https://github.com/kevin-hanselman/dud

sebastian_io 5 years ago

SnowFS is still at the beginnings, so it will shape over time. For the benchmarks, these are the results on a MacBook-2019 with APFS for a 4GB file:
[git add texture.psd: 20164ms] [snow add texture.psd: 4596ms] [git rm texture.psd: 575ms] [snow rm texture.psd: 111ms] [git checkout HEAD~1: 9739ms] [snow checkout HEAD~1: 1ms]
You might get slightly slower speeds on NTFS for 'add' and 'checkout' but still very performant

erlend_sh 5 years ago

There is a somewhat related discussion going on about Pijul’s suitability for use with game editors: https://discourse.pijul.org/t/using-pijul-for-a-game-editor/...

sebastian_io 5 years ago

That looks really cool! I will check the project out, thanks!

pjmlp 5 years ago

JavaScript[0] isn't what I would call a fast storage repository, but I guess it works out for prototyping.

On the context of porting to C and C++, or make it execute faster, I can see two options with minor rewrites.

Use AssemblyScript and generate native code via WebAssembly AOT compilers.

Try to adapt TypeScript to C++ compiler from Microsoft's MakeCode project.

Implement your own C++ code generator.

It would be much easier than keep using multiple code bases in parallel, plus any memory corruption issues would be most likely bugs on the code generator.

[0] - Yes I know the source code is Typescript.

sebastian_io 5 years ago

Indeed, TS/JS is great for its quick turnaround times for prototyping. But the I/O performance is executed by the underlying C/C++ layer. For the rest TS/JS is fast enough. But a full C++ backport is still on the horizon

Settings

SnowFS – a fast, scalable version control file storage for graphic files

Keyboard Shortcuts