Ask HN: Why are there no open source NVMe-native key value stores in 2023?
Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.
Why do you think that is? Are there possibly other projects out there that I'm not familiar with? I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat: - https://github.com/DataManagementLab/ScaleStore - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA" - https://github.com/unum-cloud/udisk (https://github.com/unum-cloud/ustore) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel." - https://github.com/capsuleman/ssd-nvme-database - "Columnar database on SSD NVMe" See https://www.snia.org/educational-library/key-value-standardi... for some description of the special command set to get an nvme drive to natively work as a key-value store. Also https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... How do you tell which NVMe drive models support the KV API? Is this something that you can experiment with on a consumer drive or do you need specific enterprise ssd models? Samsung's uNVMe evaluation guide (from 2019) device support section just states: I can't find detailed spec sheets detailing which NVMe command sets are supported even for their enterprise drives. I reached out to Samsung support to ask. After being sent from one department to another and receiving some very clearly incorrect advice from their sales support they eventually sent me to an online form for the memory department. Am still waiting for a response a week later. Interesting, never heard of this before! Do you have any other resources to share? How can I play with this today? How is that implemented? Btree, hashtable? Implementation-defined. The API resembles a map access, though. RonDB is open-source and supports on-disk data on NVMe disks.
http://mikaelronstrom.blogspot.com/2022/04/variable-sized-di... Hey, thanks for the mention! UDisk, however, hasn't been open-sourced yet. Still considering it :) you could also configure Redis to transact everything to disk and choose nvme as the target That would save via file system, not bypass the kernel to access the NVMe drive directly from user space. NVMe drives themself have a bunch of features that make them amenable to K/V storage directly. Good overview: https://www.mydistributed.systems/2020/07/towards-building-h... There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it's dead now) [1] These slides claim up to 32 bytes, which would be a practically useful length: https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va... Presumably there is some way to use the hash of the actual key as the key, and then store both key and value as data? 16 bytes is long enough that collisions will be super rare, and while you obviously need to write code to support that case, it should have no performance impact. A 32-byte key would allow using NVMe KV directly for content-addressed storage; many of those systems use 256-bit / 32-byte cryptographic hashes as keys. Notable exception would be git with 20-byte keys. You can still do this... Just use the first 16 bytes as the key, and the 2nd 16 bytes as the start of the data. I think some devices built the block storage on top of the key-value store. Ie. when you write "hello..." (4k bytes) to address 123, it actually saves key: 123, value "hello...". If so, that is probably the reason for a 16 byte key - there is just no way anybody needs a key bigger than 16 bytes for an address anytime soon. I could imagine that if this mode isn't widely used, drive manufacturers haven't given much thought to performance, and it therefore might suck. Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way. The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example. Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las... Is there not any danger of tenants rewriting the firmware on these drives, and surprising (or compromising) future tenants? AIUI this is the central reason why even "baremetal" cloud instances still have a minimal hypervisor between the tenant and the hardware. I’m not sure what makes you think an “minimal hypervisor” exists — Oracle Cloud Infrastructure doesn’t have a hypervisor of any sort between you and its .metal instance types. Don’t think Amazon EC2 does either. Amazon have their own partitioning hypervisor for this purpose. It sits below any hypervisor that might be visible to the tenant. The top clouds (AWS/Azure/Google) have custom firmware to solve this problem. Second-tier clouds probably don't so customers can reflash firmware. If your second sentence is true -- and I hope it isn't! -- that would be a gaping security hole. To be fair, some of the bare metal providers reflash firmware when the machine is reprovisioned. In theory firmware "implants" could survive reflashing but I don't know if such a thing has ever been seen in the wild. This needs to be taken into account when running on metal instances with different cloud providers. You would also want an assurance that metal instances aren't ever repurposed to be VM hosts in the future. Virtualization can happen in the hardware itself, e.g. SR-IOV. Why do you mean by non-embedded? You might also be interested in xNVMe and the RocksDB/Ceph KV drivers: https://github.com/OpenMPDK/xNVMe Super helpful, thanks. What I mean is something akin to a single-node daemon with network capabilities. Something as basic as a memcached or redis type of interface to start. I think there's actually a standard defined for networked KV API over NVMe, written by the SNIA (as others have mentioned) Though I'm not super knowledgeable about it. I think Redfish/Swordfish are maybe meant for this sort of thing: https://www.snia.org/forums/smi/swordfish There's a video on NVMe and NVMe-oF management for instance: https://www.youtube.com/watch?v=56VoD_1iGIs&list=PLH_ag5Km-Y... Eatonphil posted a link to this paper https://web.archive.org/web/20230624195551/https://www.vldb.... a couple hours after this post (zero comments [0]) > NVMe SSDs based on flash are cheap and offer high throughput.
Combining several of these devices into a single server enables
10 million I/O operations per second or more. Our experiments
show that existing out-of-memory database systems and storage
engines achieve only a fraction of this performance. In this work,
we demonstrate that it is possible to close the performance gap
between hardware and software through an I/O optimized storage
engine design. In a heavy out-of-memory setting, where the dataset
is 10 times larger than main memory, our system can achieve more
than 1 million TPC-C transactions per second. Crail [1] which is a distributed K/V store on top of NVMEoF. Aerospike does direct NVME access. https://github.com/aerospike/aerospike-server/blob/master/cf... There are other occurrences in the codebase, but that is the most prominent one. Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ? I believe that NVMe uses multiple I/O queues compared to serialized access with SATA and I think you’d be able to side unnecessary abstractions like file systems and block-based access with an NVMe-specific datastore. I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism. SATA also has multiple I/O queues. It’s called “NCQ” The exact semantics vary per protocol but it’s a feature of most protocols at least in the currently used revisions:
https://en.wikipedia.org/wiki/Native_Command_Queuing That's one queue per drive. NVMe allows multiple queues per drive, commonly used to assign one queue per CPU core. Most filesystems will make use of multiple IO queues - ie. if an application sends many different read requests, they may be satisfied out-of-order. Latency ought to be much better, since you're skipping several abstraction layers in the kernel. But that's about it. And the latency is still worse than in-memory solutions. Between that and the non-trivial effort needed to make this work in any sort of cloud setup (be it self-hosted k8s or AWS), it's a hard sell. If I really need latency above all, AWS gives me instances with 24TB RAM, and if I don't… why not just use existing kv-stores and accept the couple of ns extra latency? Agreed. The classic reason is when you have latency needs, but your data set is large enough that RAM is cost-prohibitive, and random-access enough that disk won’t work. The cost savings from switching to NVMe have to justify the higher NRE cost, and simultaneously, you have to be sensitive to latency. Individual NVMe drives are also rather small – the biggest I can find is 30TB, which is still more than what AWS offers me as RAM, but not much. Once you start adding custom algorithms to spread your data over multiple "raw" NVMe drives to get more capacity, the latency gap between your custom solution and existing, well-optimized file system stacks starts to erode. Might as well stick to existing kv stores on ZFS or something, rather than roll your own project that might be able to beat it, maybe. While you can get 24TB ram, there is a pretty big cost difference. 2 TB of ram costs roughly $10000 compared to $130 for NVME storage (or $230 for 12 TB of a good hard drive). Sure the NVME is ~3.5x more expensive, but the latency will be dramatically lower and the throughput will be dramatically higher. Sure you can build a 24 TB ram system, but at that point the cost of the server will be entirely the ram. The reason for NVME based storage at this point is that at only ~3.5x the cost of a hard drive, you can switch all your storage over and as long as you don't need tons of storage (i.e. less than 100TB), the SSDs will be a minority of the cost of the system. All that applies to regular kv stores abstracted through filesystems and block device layers just fine. But when your latency requirements are so tight that you cannot possibly afford the latency penalty of a filesystem, you better have a good business case to justify either developing a custom bare-metal-nvme (which is $$$$$ and takes time) or getting a multi-TB RAM system, which is also $$$$$, but far more predictable, and can be put into production today, not 6+ months later when you finish developing your custom kv store. For the other 99.999% of use cases, sure, just go with NVMe backing your regular virtualization/containerization infrastructure. Significant gains if you want a distributed key-value database because you can take advantage of NVMEoF. Yes, mostly on the durability side. NVMe actually has the relevant API to be sure that a write was flushed, while posix like filesystem API usually do not handle it. https://github.com/OpenMPDK/KVRocks Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low. NVME's allow namespaces to be made - effectively letting multiple users all share an NVME device without interfering with each other. A note for those unaware, consumer grade NVME devices (basically all m.2 formfactor drives from my experience) only support a single namespace. If you want to explore creating multiple namespaces you will need an enterprise grade u.2 drive. Some u.2 drives even support thin provisioning, like how a hypervisor treats a sparse disk file but for physical hardware. Because you haven't written it yet! I work on a database that is a KV-store if you squint enough and we're taking advantage of NVMe. One thing they don't tell you about NVMe is you'll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it's hard to say what the future holds. It's all very exciting. > non-embedded key value stores or DBs out in the wild yet I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS. You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency. You don't get to do both simultaneously. Embedded is a feature for performance-aware software, not a bug. I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential. Interesting article here: https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl... Utilizing: https://memcached.org/blog/nvm-caching/,https://github.com/m... TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs. There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB https://github.com/apache/kvrocks Does RocksDB speak NVMe directly? > High-performance storage engines. There are a number of storage engines and key-value stores optimized for flash. RocksDB [36] is based on an LSM-Tree that is optimized for low write amplification (at the cost of higher read amplification). RocksDB was designed for flash storage, but at the time of SATA SSDs, and therefore cannot saturate large NVMe arrays. From this slightly tangent mention, I am guessing not. https://web.archive.org/web/20230624195551/https://www.vldb.... A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway. I'm building one:
https://github.com/yottaStore/yottaStore Is there any performance gain over writing append-only data to a file? I mean, using a merkle tree or something like that to make sense of the underlying data. Writing to append-only files is a terrible idea if you want to query quickly. (yes it's fashionable, but it's still terrible for random read performance) Care to elaborate? How is reading from an append-only file backed by a memory indexed DB slower compared to either 1) a mutated file, or 2) either append-only or mutated raw NVMe disk storage? I mean, what's the trick NVMe can do to be meaningfully faster? Your views are intriguing and I wish to subscribe your news letter. But seriously, I've been thinking about an append-only files + memory indexed DB for the past couple of weeks - any prior art or links or papers or anything, lay it on me. I've been using it in production for 8 years in Boomla. It's closed source though. I haven't found any prior art myself, so just went from first principles. Take a look at the data structure of Git for inspiration. (Merkle tree) Write speed wasn't my primary motivation though. I wanted a data storage solution that is hard to fuck up. Hard to beat append only in this regard. Plus everything is stored in merkle trees like in Git, so there is the added benefit of data integrity checks. Yes, bit rot is real, and I love to have a mechanism in place to detect and fix those. I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff. “Lazyweb, find me an NVMe key-value store” is how we phrased requests like this twenty years ago. Who could afford to develop and maintain such a niche thing, in today’s economy, without either a universal basic income or a “non-free” license to guarantee revenue? SolidCache and SolidQueue from Rails will be doing that when released. Otherwise though…you have the file system. Is that not enough? Is that discussion/implementation of nvme available somewhere in public? https://github.com/rails/solid_cache didn't include anything about NVME that I could find. I think the original question came up after the recent Rails keynote where they mention that, with NVMe speeds, disk is cheaper and almost as fast as memory, so Redis is not as vital. https://youtu.be/iqXjGiQ_D-A?t=2836 So Solid Cache and Solid Queue just use the database (MySQL), which uses NVMe. So now, in addition to: "You don't need a queue, just use Postgres/MySQL", we have "You don't need a cache, just use Postgres/MySQL" Right, that is cool, but unrelated to the OP of using NVME directly and bypassing the filesystem. Or does MySQL have a storage driver that talks directly on the NVME level? (I haven't used MySQL in more than a decade, mostly PostgerSQL now) It becomes complex when you want to support multiple NVMes Even more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database. Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed. So you're left with 1% needing that kind of speeds
https://github.com/OpenMPDK/uNVMe/blob/master/doc/uNVMe2.0_S... Guide Version: uNVMe2.0 SDK Evaluation Guide ver 1.2
Supported Product(s): NVMe SSD (Block/KV)
Interface(s): NVMe 1.2