Settings

Theme

Ask HN: Why are there no open source NVMe-native key value stores in 2023?

99 points by nphase 2 years ago · 73 comments · 1 min read


Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.

Why do you think that is? Are there possibly other projects out there that I'm not familiar with?

diggan 2 years ago

I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:

- https://github.com/DataManagementLab/ScaleStore - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"

- https://github.com/unum-cloud/udisk (https://github.com/unum-cloud/ustore) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."

- https://github.com/capsuleman/ssd-nvme-database - "Columnar database on SSD NVMe"

formerly_proven 2 years ago

There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it's dead now)

[1] These slides claim up to 32 bytes, which would be a practically useful length: https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...

  • londons_explore 2 years ago

    Presumably there is some way to use the hash of the actual key as the key, and then store both key and value as data?

    16 bytes is long enough that collisions will be super rare, and while you obviously need to write code to support that case, it should have no performance impact.

    • formerly_proven 2 years ago

      A 32-byte key would allow using NVMe KV directly for content-addressed storage; many of those systems use 256-bit / 32-byte cryptographic hashes as keys. Notable exception would be git with 20-byte keys.

      • londons_explore 2 years ago

        You can still do this... Just use the first 16 bytes as the key, and the 2nd 16 bytes as the start of the data.

  • londons_explore 2 years ago

    I think some devices built the block storage on top of the key-value store. Ie. when you write "hello..." (4k bytes) to address 123, it actually saves key: 123, value "hello...".

    If so, that is probably the reason for a 16 byte key - there is just no way anybody needs a key bigger than 16 bytes for an address anytime soon.

  • londons_explore 2 years ago

    I could imagine that if this mode isn't widely used, drive manufacturers haven't given much thought to performance, and it therefore might suck.

jiggawatts 2 years ago

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.

The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.

Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

  • rwmj 2 years ago

    Is there not any danger of tenants rewriting the firmware on these drives, and surprising (or compromising) future tenants? AIUI this is the central reason why even "baremetal" cloud instances still have a minimal hypervisor between the tenant and the hardware.

    • nixgeek 2 years ago

      I’m not sure what makes you think an “minimal hypervisor” exists — Oracle Cloud Infrastructure doesn’t have a hypervisor of any sort between you and its .metal instance types. Don’t think Amazon EC2 does either.

      • rwmj 2 years ago

        Amazon have their own partitioning hypervisor for this purpose. It sits below any hypervisor that might be visible to the tenant.

    • wmf 2 years ago

      The top clouds (AWS/Azure/Google) have custom firmware to solve this problem. Second-tier clouds probably don't so customers can reflash firmware.

      • otterley 2 years ago

        If your second sentence is true -- and I hope it isn't! -- that would be a gaping security hole.

        • wmf 2 years ago

          To be fair, some of the bare metal providers reflash firmware when the machine is reprovisioned. In theory firmware "implants" could survive reflashing but I don't know if such a thing has ever been seen in the wild.

          • nerpderp82 2 years ago

            This needs to be taken into account when running on metal instances with different cloud providers. You would also want an assurance that metal instances aren't ever repurposed to be VM hosts in the future.

    • idanp 2 years ago

      Virtualization can happen in the hardware itself, e.g. SR-IOV.

gavinray 2 years ago

Why do you mean by non-embedded?

You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:

https://github.com/OpenMPDK/xNVMe

https://github.com/OpenMPDK/KVSSD

https://github.com/OpenMPDK/KVRocks

nerpderp82 2 years ago

Eatonphil posted a link to this paper https://web.archive.org/web/20230624195551/https://www.vldb.... a couple hours after this post (zero comments [0])

> NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.

[0] https://news.ycombinator.com/item?id=37899886

threeseed 2 years ago

Crail [1] which is a distributed K/V store on top of NVMEoF.

[1] https://craillabs.github.io

nerpderp82 2 years ago

Aerospike does direct NVME access.

https://github.com/aerospike/aerospike-server/blob/master/cf...

There are other occurrences in the codebase, but that is the most prominent one.

bestouff 2 years ago

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

  • chaos_emergent 2 years ago

    I believe that NVMe uses multiple I/O queues compared to serialized access with SATA and I think you’d be able to side unnecessary abstractions like file systems and block-based access with an NVMe-specific datastore.

    I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism.

  • creshal 2 years ago

    Latency ought to be much better, since you're skipping several abstraction layers in the kernel.

    But that's about it. And the latency is still worse than in-memory solutions.

    Between that and the non-trivial effort needed to make this work in any sort of cloud setup (be it self-hosted k8s or AWS), it's a hard sell. If I really need latency above all, AWS gives me instances with 24TB RAM, and if I don't… why not just use existing kv-stores and accept the couple of ns extra latency?

    • klodolph 2 years ago

      Agreed. The classic reason is when you have latency needs, but your data set is large enough that RAM is cost-prohibitive, and random-access enough that disk won’t work. The cost savings from switching to NVMe have to justify the higher NRE cost, and simultaneously, you have to be sensitive to latency.

      • creshal 2 years ago

        Individual NVMe drives are also rather small – the biggest I can find is 30TB, which is still more than what AWS offers me as RAM, but not much. Once you start adding custom algorithms to spread your data over multiple "raw" NVMe drives to get more capacity, the latency gap between your custom solution and existing, well-optimized file system stacks starts to erode. Might as well stick to existing kv stores on ZFS or something, rather than roll your own project that might be able to beat it, maybe.

    • adgjlsfhk1 2 years ago

      While you can get 24TB ram, there is a pretty big cost difference. 2 TB of ram costs roughly $10000 compared to $130 for NVME storage (or $230 for 12 TB of a good hard drive). Sure the NVME is ~3.5x more expensive, but the latency will be dramatically lower and the throughput will be dramatically higher. Sure you can build a 24 TB ram system, but at that point the cost of the server will be entirely the ram. The reason for NVME based storage at this point is that at only ~3.5x the cost of a hard drive, you can switch all your storage over and as long as you don't need tons of storage (i.e. less than 100TB), the SSDs will be a minority of the cost of the system.

      • creshal 2 years ago

        All that applies to regular kv stores abstracted through filesystems and block device layers just fine.

        But when your latency requirements are so tight that you cannot possibly afford the latency penalty of a filesystem, you better have a good business case to justify either developing a custom bare-metal-nvme (which is $$$$$ and takes time) or getting a multi-TB RAM system, which is also $$$$$, but far more predictable, and can be put into production today, not 6+ months later when you finish developing your custom kv store.

        For the other 99.999% of use cases, sure, just go with NVMe backing your regular virtualization/containerization infrastructure.

  • threeseed 2 years ago

    Significant gains if you want a distributed key-value database because you can take advantage of NVMEoF.

  • di4na 2 years ago

    Yes, mostly on the durability side. NVMe actually has the relevant API to be sure that a write was flushed, while posix like filesystem API usually do not handle it.

delfinom 2 years ago

https://github.com/OpenMPDK/KVRocks

Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

  • londons_explore 2 years ago

    NVME's allow namespaces to be made - effectively letting multiple users all share an NVME device without interfering with each other.

    • moondev 2 years ago

      A note for those unaware, consumer grade NVME devices (basically all m.2 formfactor drives from my experience) only support a single namespace. If you want to explore creating multiple namespaces you will need an enterprise grade u.2 drive.

      Some u.2 drives even support thin provisioning, like how a hypervisor treats a sparse disk file but for physical hardware.

otterley 2 years ago

Because you haven't written it yet!

infamouscow 2 years ago

I work on a database that is a KV-store if you squint enough and we're taking advantage of NVMe.

One thing they don't tell you about NVMe is you'll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it's hard to say what the future holds. It's all very exciting.

caeril 2 years ago

> non-embedded key value stores or DBs out in the wild yet

I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS.

You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.

You don't get to do both simultaneously.

Embedded is a feature for performance-aware software, not a bug.

rubiquity 2 years ago

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

CubsFan1060 2 years ago

Interesting article here: https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...

Utilizing: https://memcached.org/blog/nvm-caching/,https://github.com/m...

TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.

javierhonduco 2 years ago

There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB https://github.com/apache/kvrocks

  • eatonphil 2 years ago

    Does RocksDB speak NVMe directly?

    > High-performance storage engines. There are a number of storage engines and key-value stores optimized for flash. RocksDB [36] is based on an LSM-Tree that is optimized for low write amplification (at the cost of higher read amplification). RocksDB was designed for flash storage, but at the time of SATA SSDs, and therefore cannot saturate large NVMe arrays.

    From this slightly tangent mention, I am guessing not.

    https://web.archive.org/web/20230624195551/https://www.vldb....

Already__Taken 2 years ago

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

espoal 2 years ago

I'm building one: https://github.com/yottaStore/yottaStore

zupa-hu 2 years ago

Is there any performance gain over writing append-only data to a file?

I mean, using a merkle tree or something like that to make sense of the underlying data.

  • dboreham 2 years ago

    Writing to append-only files is a terrible idea if you want to query quickly.

    (yes it's fashionable, but it's still terrible for random read performance)

    • zupa-hu 2 years ago

      Care to elaborate? How is reading from an append-only file backed by a memory indexed DB slower compared to either 1) a mutated file, or 2) either append-only or mutated raw NVMe disk storage?

      I mean, what's the trick NVMe can do to be meaningfully faster?

      • LAC-Tech 2 years ago

        Your views are intriguing and I wish to subscribe your news letter.

        But seriously, I've been thinking about an append-only files + memory indexed DB for the past couple of weeks - any prior art or links or papers or anything, lay it on me.

        • zupa-hu 2 years ago

          I've been using it in production for 8 years in Boomla. It's closed source though. I haven't found any prior art myself, so just went from first principles. Take a look at the data structure of Git for inspiration. (Merkle tree)

          Write speed wasn't my primary motivation though. I wanted a data storage solution that is hard to fuck up. Hard to beat append only in this regard. Plus everything is stored in merkle trees like in Git, so there is the added benefit of data integrity checks. Yes, bit rot is real, and I love to have a mechanism in place to detect and fix those.

znpy 2 years ago

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

altairprime 2 years ago

“Lazyweb, find me an NVMe key-value store” is how we phrased requests like this twenty years ago.

Who could afford to develop and maintain such a niche thing, in today’s economy, without either a universal basic income or a “non-free” license to guarantee revenue?

brightball 2 years ago

SolidCache and SolidQueue from Rails will be doing that when released.

Otherwise though…you have the file system. Is that not enough?

  • andruby 2 years ago

    Is that discussion/implementation of nvme available somewhere in public?

    https://github.com/rails/solid_cache didn't include anything about NVME that I could find.

    • andrenotgiant 2 years ago

      I think the original question came up after the recent Rails keynote where they mention that, with NVMe speeds, disk is cheaper and almost as fast as memory, so Redis is not as vital. https://youtu.be/iqXjGiQ_D-A?t=2836

      So Solid Cache and Solid Queue just use the database (MySQL), which uses NVMe.

      So now, in addition to: "You don't need a queue, just use Postgres/MySQL", we have "You don't need a cache, just use Postgres/MySQL"

      • andruby 2 years ago

        Right, that is cool, but unrelated to the OP of using NVME directly and bypassing the filesystem. Or does MySQL have a storage driver that talks directly on the NVME level? (I haven't used MySQL in more than a decade, mostly PostgerSQL now)

ilyt 2 years ago

It becomes complex when you want to support multiple NVMes

Even more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database.

Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.

So you're left with 1% needing that kind of speeds

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection