Graviton Database: ZFS for key-value stores

86 points by autopoiesis 5 years ago · 58 comments

Reader

aftbit 5 years ago

>Graviton is currently alpha software.

More like the "BTRFS for key-value stores" ;)

Kidding aside, I dislike when new unproven software claims the name of industry standards like this. When I saw the headline, I was hoping this somehow actually leveraged ZFS's storage layer, but actually it is just a new database that thinks Copy-on-Write is cool.

innagadadavida 5 years ago

Title is very click baity, this is just another kv store, completely unrelated to ZFS.
- mayama 5 years ago
  
  Even their README is click baity then. I quickly glanced at their repo and thought it somehow is related to ZFS, before reading comments here.
BryanG2 5 years ago

Every new unproven software is alpha during launch. Yet to see any software perfect since launch. Graviton provides ZFS leverages and features beyond ZFS. :)

ysleepy 5 years ago

Nice!

I implemented pretty much the same trade off set in an authenticated storage system.

single writer, radix merkle tree, persistent storage, hashed keys, proofs.

I guess it is a local maxima within that trade off space.

I like how the time travelling/history is always touted as a feature (which it is), but it really just means the garbage collector/pruning part of the transaction engine is missing. Postgres and other mvcc systems could all be doing this, but they don't. The hard part of the feature is being able to turn it off.

I'll probably have a look around later, the diffing looks interesting, not sure yet if it's done using the merkle tree (likely) or some commit walking algorithm.

mulander 5 years ago
> I like how the time travelling/history is always touted as a feature (which it is), but it really just means the garbage collector/pruning part of the transaction engine is missing. Postgres and other mvcc systems could all be doing this, but they don't.
Postgres actually did tout it as a feature in "THE IMPLEMENTATION OF POSTGRES" by Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama[1] search for "time travel" in the PDF. I added the relevant quotes below for easier access ;)
This was back when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:
> The second benefit of a no-overwrite storage manager is the possibility of time travel. As noted earlier, a user can ask a historical query and POSTGRES will automatically return information from the record valid at the correct time.
Quoting the paper again:
> For example to find the salary of Sam at time T one would query:
```
    retrieve (EMP.salary)
    using EMP [T]
    where EMP.name = "Sam"
```
> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.
[1] - https://dsf.berkeley.edu/papers/ERL-M90-34.pdf
- rmetzler 5 years ago
  
  Is this still possible with Postgres?
  - mulander 5 years ago
    
    Yes and no or to be precise - to a certain degree but not through an exposed language feature.
    PostgreSQL still does copy-on-write so the old versions of the row exist and are present in storage. However now there is an autovacuum process going over the records regularly marking those no longer seen by any transactions as re-usable so eventually the old records would get overwritten.
    You can get at the older versions of the rows directly on disk or perhaps it would be possible to get the db to return such older versions of the rows. It seems that by default even trying to get at them with `ctid` is not possible so that may require hacking PostgreSQL itself or using some extension which seem to actually exist[1].
    [1] - https://github.com/omniti-labs/pgtreats/tree/master/contrib/...

derefr 5 years ago

Does anyone know of an embedded key-value store that does do versioning/snapshots, but doesn’t bother with cryptographic integrity (and so gets better OLAP performance than a Merkle-tree-based implementation)?

My use-case is a system that serves as an OLAP data warehouse of representations of how another system’s state looked at various points in history. You’d open a handle against the store, passing in a snapshot version; and then do OLAP queries against that snapshot.

Things that make this a hard problem: The dataset is too large to just store the versions as independent copies; so it really needs some level of data-sharing between the snapshots. But it also needs to be fast for reads, especially whole-bucket reads—it’s an OLAP data warehouse. Merkle-tree-based designs really suck for doing indexed table scans.

But, things that can be traded off: there’d only need to be one (trusted) writer, who would just be batch-inserting new snapshots generated by reducing over a CQRS/ES event stream. It’d be that (out-of-band) event stream that’d be the canonical, integrity-verified, etc. representation for all this data. These CQRS state-aggregate snapshots would just be a cache. If the whole thing got corrupted, I could just throw it all away and regenerate it from the CQRS/ES event stream; or, hopefully, “rewind” the database back to the last-known-good commit (i.e. purge all snapshots above that one) and then regenerate only the rest from the event stream.

I’m not personally aware of anything that targets exactly this use case. I’m working on something for it myself right now.

Two avenues I’m looking into:

• something that acts like a hybrid between LMDB and btrfs (i.e. a B-tree with copy-on-write ref-counted pages shared between snapshots, where those snapshots appear as B-tree nodes themselves)

• “keyframe” snapshots as regular independent B-trees, maybe relying on L2ARC-like block-level dedup between them; “interstitial” snapshots as on-disk HAMT ‘overlays’ of the last keyframe B-tree, that share nodes with other on-disk HAMTs, but only within their “generation” (i.e. up to the next keyframe), such that they can all be rewritten/compacted/finalized once the next keyframe arrives, or maybe even converted into “B-frames” that have forward-references to data embedded in the next keyframe.

noahdesu 5 years ago

Worked on a project with similar goals [0]. By no means a production level implementation, but much of the system exists as a proof-of-concept.
[0]: https://makedist.com/projects/cruzdb/
random3 5 years ago

You need something like HBase, but embedded. MVCC would give you the snapshot isolation (perhaps there's something with less guarantees?) and you'd need key lexicographic ordering to do efficient scanning. You'd only need the memory layout (e.g. LSM) if you'd keep a write-ahead-log from which to recover.
LevelDB / RocksDB (and related) may be close, but not sure about MVCC aspects (see https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/)
- derefr 5 years ago
  
  You misinterpreted, I think. The point isn’t “snapshot isolation” in the MVCC sense (working with multiple snapshots-in-progress); it’s the ability to, in est, work with the database the way git works with commits: opening a transaction “on top of” a base commit, then “committing” that transaction to create a new commit object, with its own explicit ref, where you can later “check out” an arbitrary ref.
  Except, unlike git, this database wouldn’t need to be able to create new commits off of anywhere but the HEAD; and also wouldn’t need to be able to have more than one in-progress write transaction open at a time. No need for MVCC at all; and no need for a DAG. The “refs” would always just be a dense linear sequence.
  Also, unlike git (or a cryptographically-verified / append-only store), there’s no need to keep around “deleted” snapshots. It would actually be a huge benefit to be able to purge arbitrary snapshots from the database, without needing to do a mark-and-sweep liveness pass to write out a new copy of the database store.
  The key constraint that differentiates this from e.g. a Kafka Streams-based CQRS/ES aggregate, is that you should be able to reopen and work with any historical database version instantly, with equal-amortized-cost lookups from any version, without needing to first do O(N) work to “replay” from the beginning of time to the snapshot, or to “rewind” from the HEAD to the snapshot. This system would need all snapshots to be equally “hot” / “online” for query access, not just the newest one.
  In other words, such a database should work just like filesystem snapshots do in a copy-on-write filesystem.
  - ec109685 5 years ago
    
    It seems like coupling a database checkpoint process with file system’s snapshot process should be theoretically possible: 1) Database informed snapshot needed 2) Database finalizes any in progress writes and starts logging new writes to another file 3) Take file system snapshot 4) Inform database snapshot is fine
    Between #3 and the file system snapshot, you should have a perfect and quick representation of the database at that point in time (when the database was informed it should stop logging).
    
    derefr 5 years ago
    
    File system snapshots are a system that have the analogous desired properties for files; but filesystem snapshots are actually quite heavyweight, because they deal with dirents, inodes, extents, etc. CoW filesystem snapshots are designed for ops-task-granularity usage, e.g. daily backups; not for per-transaction historical archiving. CoW filesystems tend to fall over once you get to 100K snapshots. (I tested!) A database that took a snapshot after every CQRS/ES transaction, could be expected to potentially have billions of snapshots.
    A system that did its snapshots “inline” to itself, by e.g. managing a pool of pages with a free-list the way LMDB does — but where Txs ultimately add a new version to the root bucket as a snapshot, rather than replacing the root bucket page with themselves — would get a lot closer to allowing one to have at least tens-of-millions of snapshots online. At that point, to achieve a billion snapshots online, you “only” need to shard your timeline across a cluster of 100 nodes.
    This is precisely one of the experiments I’m trying. :)
smbrian 5 years ago

Have you looked at LSM k-v stores (RocksDB being the obvious one)?
Under the hood they are based on building immutable layers of data that are implicitly merged. Clones that share data are cheap (check out rocksdb-cloud for a rocksdb fork which adds this copy-on-write-esque snapshotting). When overwriting a key, the old value will get lazily garbage-collected, but there are ways around that.
I haven't explored it for this usecase, but seems like it might work...
- derefr 5 years ago
  
  LSM trees are fast for writes, but not fast for reads. In fact, there’s essentially no difference in performance between an LSM tree, and a Merkle-tree whose keyspace is encoded within a B-tree.
shrubble 5 years ago

If I am understanding you properly, couldn't you do this with SQL by specifying the range of data that represents the timeseries you care about, selected via materialized view?
If you used a stored procedure to compute the range that becomes the view, then all you need to store are the parameters to feed to the stored procedure again, which data you could itself store in a separate table.
- derefr 5 years ago
  
  As I said in a sibling comment — every version needs to be “hot” / “online” at the same time. The point of this system is to allow for random access to OLAP queries for arbitrary historical versions of the system; and, in fact, to even do time-series reports that perform a given analysis against every available version of the data, hopefully with some degree of parallelism. In matview terms, that means that every version of the data needs to be concurrently materialized.
  Given just 100M keys (let’s call it a 20GB exported snapshot size), and 1M versions, that’s an overwhelming amount of data — and 99.9999% of it is redundant copies of the same information, i.e. the stuff that didn’t change between versions.
  Solving the problem of the concurrent materializations requiring petabytes of storage for almost-entirely-redundant heap tuples, is essentially solving the problem of creating a tuple-deduplicating DBMS storage engine — which is equivalent to the problem of building a versioned embedded database :)
  - shrubble 5 years ago
    
    So use views instead of materialized view. How much time/effort could be saved by simply making better queries? Are you sure that you are looking for or controlling for the right problems?
    
    derefr 5 years ago
    
    The goal is to host an infrastructure to accelerate arbitrary user queries, ala a Business Intelligence data-warehouse backend. We don't get to specify what queries the users are doing. It's the classical problem that SQL RDBMSes were introduced to solve: having the data, and having to shape it in advance of knowledge of reporting workload.
ysleepy 5 years ago

Well, it all depends on what operations you need. exact key lookup/write? -> Easy, just use any KV store and append the version to the key, then do a ceil/floor lookup in the KV with the key+target version.
Supporting efficient range scans is hard though.
EDIT: yeah, OLAP will be hard.

moralestapia 5 years ago

I love the idea but I think you (author) need a lot of time/support polishing this. You need a team probably.

Also,

>Superfast proof generation time of around 1000 proofs per second per core.

Does this limit in any way things like read/write perfomance or usability in general?

jjirsa 5 years ago

> You need a team probably
The cardinal rule of database development:
http://www.dbms2.com/2013/03/18/dbms-development-marklogic-h...
- moralestapia 5 years ago
  
  Yup, I do not mean to discourage the authors.
  I truly like the project and I have a few things in mind that could make use of it already. (Heck, one of them is pretty much Graviton + a front end).
  But I cannot just jump into it as there's some real money involved and no one wants to experiment with that.
  I see a bright future for Graviton, once it becomes tested and stable in production environments.
  - jopari 5 years ago
    
    I'm sure the devs would be interested in hearing about your use cases, should you open an issue on the repo, or got in touch via https://dero.io/#contacts-section
    (I'm not on the team, just interested in the project.)
ysleepy 5 years ago

The proof is just the nodes in the path of the merkle tree to the root. (or all sibling nodes of the path to the root).
So proof "generation" is just fetching the nodes and sending them to the client.
jopari 5 years ago

I asked the devs, and apparently "proof generation won't affect general read/write performance". But I guess this is the kind of thing where benchmarks might be useful, and there don't seem to be any public yet.

Rochus 5 years ago

What is the use case? Why is it important that "All keys, values are backed by blake 256 bit checksum"?

jopari 5 years ago

It seems to be intended as the backend database for the Dero blockchain smart contract platform: https://medium.com/deroproject/graviton-zfs-for-key-value-st...
The post claims: "The features included in Graviton provide the missing functionality that prevented Stargate RC1 from reaching deployment on our mainnet."
I'm not sure, but I guess that this checksumming is relevant for storing the Merkle trees encoding the blockchain. I don't know why the previous choice of database wasn't suitable.
naivedevops 5 years ago

ZFS stores the checksums of files to prevent bit rotting. Since they are comparing their database to ZFS, I guess it stores the checksums for the same reason. If bit rotting occurs, you don't need to discard the entire database, just the affected entry. If the entry was already there for some time, you might even be able to restore it from a backup.
- Rochus 5 years ago
  
  I can understand it with a file system; but in a typical key/value store application the data elements are much smaller (likely even smaller than the hash result).
- GordonS 5 years ago
  
  Isn't a 256-bit Blake hash a little OTT, versus a simple CRC, or even a faster, smaller hash like MurmurHash or Jenkins-one-at-a-time?
  - jlokier 5 years ago
    
    It's a cryptographic hash, so it will detect tampering with the data, which a simple CRC, MurmerHash or Jenkins would not.
    
    GordonS 5 years ago
    
    Still, I'd like an option to use a faster, more efficient CRC or hash - bit rot is usually the main threat, rather than tampering. Not to mention that if a user can tamper with the data they can probably just create a new hash at the same time.
    Using a cryptographic hash as a souped-up CRC seems rather odd, given how many more CPU cycles and RAM it will use, but I don't know the reasoning behind the decision; there must be one.
    
    jlokier 5 years ago
    
    > if an attacker can tamper with the data they can probably just create a new hash at the same time
    That's true for ordinary databases, but this was developed for a blockchain and uses a Merkle hash tree.
    An attacker can only tamper with the data and create a new hash for a data item by also creating a new hash for every node up to the root of the tree. In a blockchain context, even that isn't enough, they'd have to modify the blockchain nodes as well, as I presume they periodically record tree root hashes.
    The hash tree gives it some other interesting features too. O(n) diff time, where n is the number of changes output in the diff, is probably due to having a hash tree.
    The fast diff would also work with a non-cryptographic hash, but it would be considered not quite reliable enough against occasional, random errors. With a cryptographic hash, for non-security purposes we treat the values as reliably unique for each input. For example, see Git which depends on this property.
    
    GordonS 5 years ago
    
    I meant to say "attacker", rather than "user".

bdcravens 5 years ago

You can run a Graviton database. You can also run a database on a Graviton:

https://aws.amazon.com/about-aws/whats-new/2020/07/announcin...

For best results, run Graviton on a Graviton:

https://aws.amazon.com/ec2/graviton/

Mister_Snuggles 5 years ago

Naming things is difficult.
- ethbr0 5 years ago
  
  Not according to git!
  We can just call this: ad58cd9088995cfb528187b11c275dad60ce2ec5
  And the chip: 59b54f61dd17c27744e884542e35b34172e2cc79
  So easy!

TomTinks 5 years ago

This is definitely something to look into. so far dero looks like a pretty solid project with out of the box thinking.

byteshock 5 years ago

If latency and performance is an issue there are also solutions like RocksDB or LevelDB

jopari 5 years ago

There's a brief comparison with RocksDB and LevelDB in the README file, which concludes: "If you require a high random write throughput or you need to use spinning disks then LevelDB could be a good choice unless there are requirements of versioning, authenticated proofs or other features of Graviton database."
- byteshock 5 years ago
  
  This was a reply to another comment in the thread that suggested a user use sqlite. I commented using the Octal ios app. Not sure why it didn’t post it correctly....

ramoz 5 years ago

Comparison to Badger? Badger is also go-native and, for me, has been exceptional at scale and for read-heavy workloads on SSD.

Ref: https://github.com/dgraph-io/badger

jopari 5 years ago

I think the key differentiating feature of Graviton is the tree of authenticated proofs of data consistency. (AFAICT this is particularly important for scalably updating and verifying a large blockchain history.)
- ramoz 5 years ago
  
  Ah, figured as much but am not as familiar with that use case. Thanks!
jjirsa 5 years ago

Define “at scale”
- ramoz 5 years ago
  
  sounds like you want to define that for me

BryanG2 5 years ago

Someone paste timing results of diffing for very large data sets.

nickcw 5 years ago

What I'd really like is a multiprocess safe embeddable database written in pure Go. So a database which is safe to read and write from separate processes.

Unfortunately I don't think this one is multiprocess safe.

sneak 5 years ago

I too feel the “pure Go” pull, but is your use case so precarious or latency-sensitive that you can’t simply use SQLite? That’s what I do in these situations.
jopari 5 years ago

I checked with the devs and they say writing is only safe from a single process, but reading is multiprocess safe. So I think this confirms your thought.
carlosf 5 years ago

Not sure if easy to embed, but Consul does offer strong consistency if explicitly configured.
ramoz 5 years ago

Badger supports concurrent ACID.
https://github.com/dgraph-io/badger

AtlasBarfed 5 years ago

... doesn't cassandra do a lot of this?

ramoz 5 years ago

Cassandra is not, traditionally/practically, an embedded db.

Settings

Graviton Database: ZFS for key-value stores

Keyboard Shortcuts