Introduction to LSM Trees: May the Logs Be with You

priyankvex.wordpress.com

128 points by priyankvex 7 years ago · 14 comments

Reader

I found this older introduction to be pretty good and part of a series: https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2...

Besides the use of LSM trees in RocksDB and leveldb-like databases, there is also the WiscKey approach (https://www.usenix.org/node/194425) that helps read/write amplification by keeping the LSM tree small and mostly removing the values to a separate value log. There's a pure Go implementation of the WiscKey approach used by dgraph: https://github.com/dgraph-io/badger

alexott 7 years ago

He (Alex Petrov) is also writing a book for O'Reilly on the database internals: https://www.goodreads.com/book/show/44647144-database-intern...
- dominotw 7 years ago
  
  weird o'reilly doesn't list that on their own website.
  - clumsysmurf 7 years ago
    
    Seems like OReilly really goes out of their way to hide ALL books these days. When you go to the main site (https://www.oreilly.com/), where do you see anything related to books? All I see is "online learning", "blended courses", "conferences" and "ideas".
    I'm a bit upset by this, because I've found the Safari experience terrible.
indogooner 7 years ago

I found the build up to this through flavors of disk IO in parts 1 and 2 to be very useful to me. Basic stuff but because I had not done system programming since my college days it was a good refresher (and eye-opener).

lichtenberger 7 years ago

If you just need to fetch values by a key, for the main storage (might be even as simple as generated by a sequence generator) you can even avoid the asynchronous background compaction overhead and thus unpredictable read- or write-peaks and so on by hashing the keys if it's not already an integer/long based identifier: Basically storing a persistent (both on-disk persistence as well as in the functional sense immutable) hash array based trie. This can easily be extended to store a new revision through copy-on-write semantics. Instead of storing whole page snapshots however, storage advanced now permit fine granular access to your data. Thus you can basically apply lessons learned from backup systems to version the data pages itself and even improve on that.

Disclaimer: I'm the author of a recent article about a free open source storage system I'm maintaining, which versions data at it's very core: "Why Copy-on-Write Semantics and Node-Level-Versioning are key to Efficient Snapshots": https://hackernoon.com/sirix-io-why-copy-on-write-semantics-...

zzzcpan 7 years ago

You don't actually need to do asynchronous background compaction at all. You can do compaction whenever in small incremental steps not causing any spikes in read or write latencies. Just spreading it across all writes gets you slightly slower, but latency capped writes. It's unfortunate that LevelDB popularized this compaction in a thread idea. It's pretty bad one.
- lichtenberger 7 years ago
  
  Good catch :-) right, but still merging/compaction work has to be done. Maybe too much, if you just need to fetch a value by its key and thus just an equality scan is needed (no range scans or other comparisons). For the latter case I've implemented an AVL-tree, which is also versioned and stored in our record pages and best read fully in-memory (but doesn't have to). For sure there are plenty of optimizations and for instance also spatio-temporal indexes or full-text indexes possible, but I guess first looking into cost-based rewrite rules for the query compiler and replication/partitioning for horizontal scaling. Too many ideas I guess ;-) but the best would be to have a great open source community :-)
- dominotw 7 years ago
  
  would you know why LevelDB choose compaction in thread vs the method you are describing.
eeZah7Ux 7 years ago

Any though on https://en.wikipedia.org/wiki/NILFS ?
- lichtenberger 7 years ago
  
  Haven't heard of, but the storage system is heavily inspired by ZFS and by putting some of the ideas (plus adding our own obviously ;)) to the sub-file level: https://kops.uni-konstanz.de/bitstream/handle/123456789/2769...
eeZah7Ux 7 years ago

I'd like to see a comparison with using mmap and letting the kernel do the paging.
- lichtenberger 7 years ago
  
  Basically, I'd like to provide the I/O layer with memory mapped file regions, such that it's simply a configuration option if you use the RandomAccessFile implementation or maybe one based on memory mapped files. I think for the JVM there's Chronicle. Thanks so much for asking :-) and by the way any contribution would be the best I can hope for

webshit155 7 years ago

I don't understand the hash index part. I guess that for every segment on disk, you also have a hash table for it, correct? Also, this part doesn't seem to be very memory-efficient

Settings

Introduction to LSM Trees: May the Logs Be with You

Keyboard Shortcuts