Palantir Foundry's dataset version control, a diff-based Git for data

blog.palantir.com

3 points by eliomattia 3 years ago · 4 comments

Reader

I just found this blog post. It seems Palantir Foundry, which does not come up often when researching git for data tools, includes a version control system for datasets that stores diffs in their own cloud-based filesystem. According to the author, one of the founding engineers of the platform, diffs are:

> particularly useful for append-only datasets of immutable records such as system logs or sensor readings which are often among the largest (and fastest-growing) datasets our customers use

Diffs seem to consist of additional files in separate folders:

> behind the scenes we effectively store each diff in a separate folder in the backing file system (e.g.,datasetA/diff1, datasetA/diff2, …) so that the whole dataset is simply represented by datasetA/*.

Without exposing technicalities, the author suggests that the delete use case is taken care of logically and not physically, since datasetA/* may not reflect the actual whole dataset. I infer that they might be logging changes under the hood in a Git-like fashion.

> It’s a bit more complicated than this because users can selectively delete files from those diffs

However, it seems that the versioning raw data they manage are not available to clients or users directly:

> a simple request that we frequently get from our customers: “can we export our datasets from Palantir Foundry to our existing data lake or S3 bucket?“ While this is of course possible, it is important to understand that such exported datasets lack precisely those versioning and sandboxing features that make Foundry a great tool for collaborative data engineering.

This could be a mechanism for vendor lock-in, tied to the very important ACID guarantees of their implementation.

I came across their post while doing research on existing solutions for dataset versioning. Some extra background here: https://news.ycombinator.com/item?id=35930895

RealQwertie 3 years ago

Check out terminusdb. Has most/many of the same features and is open source.
- eliomattiaOP 3 years ago
  Really interesting, also diff-based, and 3.5 years in development.
  On the homepage I read "An in-memory, distributed, and open-source document graph database". Do you know whether the whole database, including all documents, needs to be in memory, and what happens when the datasets exceed available memory space? Or is it perhaps in-memory per document, one-by-one?
  Do you have to create diffs manually with terminusdb (CLI example below from https://terminusdb.com/products/terminusdb/), or can they be detected automatically from, e.g., SQL database tables or files in a folder, similarly to Git monitoring a working directory and committing changes based on its contents?
  # Add more philosophers to new branch echo '{ "name": "Plato" }' | terminusdb doc insert admin/philosophers/local/branch/changes echo '{ "name": "Aristotle" }' | terminusdb doc insert admin/philosophers/local/branch/changes # Look at the difference between branches terminusdb diff admin/philosophers --before-commit main --after-commit changes | jq # Apply the differences to main terminusdb apply admin/philosophers --before-commit main --after-commit changes
  - ggleason 3 years ago
    
    TerminusDB represents the data using succinct data structures which reduces the required memory substantially over many other representations. Each branch needs to be capable of being loaded into memory completely - but individual revisions are loaded separately.
    Diffs can be constructed between two objects, or you can get sets of diffs of objects between commits automatically. You can manually construct diffs and use them to patch branches.
    We don't have a conversion tool from SQL database tables, but it's something on my list.

Settings

Palantir Foundry's dataset version control, a diff-based Git for data

Keyboard Shortcuts