We Replaced an SSD with Storage Class Memory

engineering.mongodb.com

125 points by ssvss 5 years ago · 70 comments

Reader

Andy Pavlo talks about this in his class at CMU. You shouldn’t expect to get better performance by running a disk-optimized storage engine on memory, because you’re still paying all the overhead of locks and pages to work around the latency of disk, even though that latency no longer exists. Instead, you have to build a new, simpler storage engine that skips all the bookkeeping of a disk-oriented storage engine.

https://youtu.be/a70jRWLjQFk

pocket_cheese 5 years ago

For anyone interested in how a database works underneath the hood, I could not recommend a better MOOC than Andy Pavlo's database courses. His intro to database course is so freaking good.
mackle_hair 5 years ago

i believe there are other issues with SCM arent there, for instance if you flip the same bit over and over, you'll wear out the media.. so just like with SSD's there has to be some wear leveling algorithms, but because the media has gotten so fast, the wear leveling would theoretically become a larger % of the bottleneck of using the drive.
but i totally agree, rethinking how data is stored is going to be key to the adoption of these types of new media. there's a company called vast data that has built out a full storage solution utilizing the unique properties of SCM, very cool https://vastdata.com/

renewiltord 5 years ago

Jesus Christ, this is insane. Almost a Terabyte of 12.6 Gbps reads? I have a bunch of geospatial entity resolution workloads that I could absolutely smash with this. For way cheaper than the fat mem instances.

jiggawatts 5 years ago

GBps, not Gbps! They were getting 12.6 gigabytes per second, which would be 100 gigabits per second almost exactly.
However, even their 6-DIMM test produces only 300 Gbps, which is insufficient to saturate a modern 400 GbE network adapter for either reads or writes.
This would be most relevant on a "single master" system storing some sort of simple data where consistency requirements means that the "writer" cannot be distributed. In a situation like this, the NIC and the storage bandwidth are the ultimate limits.
In general, Intel SSDs and Intel Optane have poor but consistent bandwidth, and consistently low latency. Coupled with the high price and small capacity, they have their niche, but they're not a clear winner in any category.
As a reference point for how crazy high bandwidths are these days, NVIDIA sells a turnkey solution with 200 GB/sec network bandwidth (1.6 Tbps!): https://www.nvidia.com/en-au/data-center/dgx-a100/
- lmilcin 5 years ago
  
  Where this is useful is when the application needs to process much more data than it then needs to transmit over the network.
  For a database system like mongodb this could be perfect depending on workload.
lrem 5 years ago

Back in the day when I interviewed for Google I had this beautiful question. The interviewer fished for a basic distributed key-value store. I just kept coming up with single machine solution to his numbers. "No, I really can have that storage+bandwidth, here's the part number."
I'm still wondering if that interview costed me a level.
- lmilcin 5 years ago
  
  Being interviewer myself I can say interviewers tend to be stubborn. They have an excellent question (at least they think they have) that they have spend a lot of time thinking through and discussing with other candidates. Your answer causes him/her to miss the opportunity to have a real discussion and understand your knowledge of the topic.
  It is better to play along and only mention the real solution at the end, to finish on a high note.
  - thechao 5 years ago
    
    Alternately, the interviewer should recognize that their question is flawed & move on. I’ve had candidates nail a 30 minute discussion question with a 2 minute nonobvious answer. Shit happens: you’re not the smartest person in the room that day.
    
    dijit 5 years ago
    
    It should go without saying that as an interviewer; I am never the smartest person in the room. I am a person with different experiences and I need to find if you’re a fit- not if you’re unintelligent. If you got to me, you’re not unintelligent.
    
    lmilcin 5 years ago
    
    Tell this to my candidates:)
    I have interviewed couple candidates that were offended by the fact that I am asking hard questions that I know answers to.
    This resulted in remarks like "I have never seen this question on the Internet" or "So what is the answer to this question anyway?" or "If I knew questions would be this hard I would not bother to come".
    Somebody explain to people that it doesn't matter how hard the questions are but how the candidates compare. I am fully prepared that the candidates study questions that are available on the Internet, I want to see how they deal with something that requires a bit more than couple hours of effort and rote memorization.
    
    mushi 5 years ago
    
    Is this because you were interviewing candidates who were incapable of saying “I don’t know?” That’s a good filter during any interview.
    
    AnIdiotOnTheNet 5 years ago
    
    Most interviewers though aren't actually any good at it and very often don't like not being the smartest person in the room.
    
    lmilcin 5 years ago
    
    Whether you are interviewer or interviewee, putting your ego before the goal of the interview is just going to waste the time of both parties.
    I can't hire a person that can't restrain their ego for the duration of the interview and an interviewer that is focusing on anything else than figuring out if the candidate is right person for the job is causing disservice to everybody.
    
    sukilot 5 years ago
    
    The thing to remember is that in some companies the interviewer doesn't care about hiring and is only their because they were forced to.
    
    dhosek 5 years ago
    
    I guess it depends on where you're doing the interviewing. I've had surprisingly unintelligent people get to me in the interviewing stages. It amazes me how many developers, experienced even, who can't write a simple recursive function. And we're not even looking at, "can you come up with a recursive function as a solution to this problem?" we're looking at "write a recursive function to calculate the factorial" as the question.
    
    joefourier 5 years ago
    
    How often do recursive functions come up in your daily programming life? For me they're rare enough that it makes sense that many otherwise competent developers draw a blank when asked to write one in the context of a high-stress interview.
    They may look clever, but they're often not the ideal solution compared to a simpler to understand, and often easier to optimise iterative solution. I write plenty of DSP, GPGPU programming, and computer vision, and I can't honestly remember the last time I wrote a recursive function.
    
    sukilot 5 years ago
    
    Recursive functions come up quite often when investigating the root cause of outages.
    Unclear if that means we need candidates who are better at recursion, or better at avoiding recursion.
    
    dhosek 5 years ago
    
    Recursion is pretty fundamental in functional programming—it's preferred over loops. A smart compiler will do tail-call optimization and eliminate the overhead of increasing the stack size. Even if you're working in a non-FP development environment, knowledge of FP techniques makes one a stronger programmer. And recursion is not some esoteric technique in any event.
    
    throwaway_pdp09 5 years ago
    
    > I write plenty of DSP, GPGPU programming, and computer vision ... I can't honestly remember the last time I wrote a recursive function.
    Not remotely surprising is it. Hopefully if you went for a suitable job they wouldn't ask unsuitable questions. Then again, writing a recursive factorial should pain no-one.
    
    dhosek 5 years ago
    
    Exactly. I wasn't asking people to do something difficult. I don't think that there's a simpler recursive function than this, if you can't do this, then you don't have basic knowledge of what a recursive function is. I learned how to do this in high school.
    
    lmilcin 5 years ago
    
    It is not about how often they come up but whether you can think in an abstract way on a convoluted problem.
    Think about IQ test which shows some abstract puzzles that have no connection with the real world. They are part of the test because they provide measurement points on tasks that require different types of abstract thinking without prerequisite knowledge.
    Again, it is not possible to verify you have all the necessary knowledge for the job, nor would it be useful (no candidate knows everything for the position and if he did he would likely be overqualified).
    I of course ask questions that test candidate's knowledge in the most useful areas, but at least part of the interview I want to devote to finding out how smart he/she is on abstract tasks that don't require some special knowledge.
    
    sukilot 5 years ago
    
    I don't want to let abstract convoluted problem solvers anywhere near my production systems
    
    mlyle 5 years ago
    
    You probably want to avoid people with a propensity for viewing things in abstract and convoluted terms... not to avoid people with the ability to solve abstract and/or convoluted problems.
    
    dekhn 5 years ago
    
    They come up about once a year for me. That's about how often I have to write a tree traversal.
    
    sukilot 5 years ago
    
    Why traverse a tree recursively and risk blowing your stack?
    
    mlyle 5 years ago
    
    You'd sure need a pretty tiny stack and/or an unbelievable amount of context on stack to worry too much about this.
    If our per-callstack frame is 10k bytes, and we can spend 1 megabyte of stack on recursion (both are pessimistic).. we can go 100 levels deep. Which means we can expect to run into problems when there are more than ~5E29 elements.
    This assumes we don't write the recursion in a way that the compiler can elide it entirely with tail-call optimization.
    I'm not a huge fan of recursion, but let's not resort to too much hyperbole in our arguments against it. :P
    
    jlokier 5 years ago
    
    > If our per-callstack frame is 10k bytes, and we can spend 1 megabyte of stack on recursion (both are pessimistic).. we can go 100 levels deep. Which means we can expect to run into problems when there are more than ~5E29 elements.
    This is incorrect.
    Your limit is only for fully balanced trees. On a fully unbalanced tree your solution will crash at about 100 elements.
    Do we see a lot of unbalanced trees? Yes, most of the time in my experience. If there's a balanced tree, there's probably a data structure and the function is already supplied. Writing a tree traversal function come up when working with unbalanced things like program ASTs, JSON/HTML/XML parsed data, Lisp-style lists, filesystem traversal, etc.
    > This assumes we don't write the recursion in a way that the compiler can elide it entirely with tail-call optimization.
    The compiler cannot elide away tree traversal with tail-call optimisation. Only one branch.
    A really smart compiler could transform it into a loop with an explicit node-stack using an array, or avoid the stack and use in-place pointer-reversal if concurrency rules permit and there's a spare bit. But I've never seem a really smart compiler do either of those things.
    
    mlyle 5 years ago
    
    > Your limit is only for fully balanced trees. On a fully unbalanced tree your solution will crash at about 100 elements.
    5E29 assumes not-quite-balanced. Yes, I'm assuming some kind of balanced binary tree data structure, not other things that are tree-like. There are some things that are much better than 5E29, like btrees. :P
    If you have really deep unbalanced trees, you may want to have a smaller stack frame and/or pay the pain of doing things iteratively with your own stack. (Or have upward links in your tree and do it purely iteratively but slow).
    > The compiler cannot elide away tree traversal with tail-call optimisation. Only one branch.
    Yah, sorry. Just for search, not for full traversal.
    
    joefourier 5 years ago
    
    I have 20KB total RAM on the MCU I'm currently programming for. It's not too hyperbolic to imagine real-world cases where some recursive functions would be unsuitable.
    They're handy for some parsing and scripting scenarios however, I'll grant you that.
    
    dekhn 5 years ago
    
    On my 1995 DOS-based computer, the C compiler I used at the time could handle roughly 6 levels deep.
    
    mlyle 5 years ago
    
    Probably you were limited to a single 64k segment for stack... still would have required a really big amount of context per call to blow stack so quickly.
    Back in the mid 80's on DOS I had no problem recursing 30+ levels deep.
    
    dekhn 5 years ago
    
    Hmm, let's see. The last time I blew the computer's stack due to tree recursion was in 1995 (it was six levels deep). Since then, I've never seen a stack overflow due to tree recursion (presumably because none of the trees I've operated on were deeper than the hardware stack, and I switched from C to Python where the stack is far, far deeper).
    If I were writing a server that needed better memory requirements, I could certainly transform my code if desired.
- asdfasgasdgasdg 5 years ago
  
  Even if you can do a particular problem on a single machine, sometimes that isn't the right call. In a work scheduled cluster environment, a task that wants the entire machine may have trouble getting a slot unless it has the priority to preempt everything else going on on that machine. We call such VMs "picky" and they don't get scheduling guarantees.
  - lrem 5 years ago
    
    Heh, sure. But Borg was not part of the stated question, just of the expected answer. It wasn't even close to being needed to meet the stated parameters (IIRC the whole data set would fit into a single SSD card, so one could even have room for growth without scaling out).
  - michaelt 5 years ago
    
    I'm having trouble thinking of how you'd end up with a "cluster" that couldn't provide the power of a single machine?
    
    asdfasgasdgasdg 5 years ago
    
    To make it concrete: consider a cluster of ten machines with 256G of ram. Team A and team B both schedule jobs with a requirement of 129G of RAM and five replicas each. The replicas get scheduled, one on each machine. Team C wants to schedule a task that takes an entire 256GB of ram. If they don't have sufficient priority to preempt the other jobs, then they will not schedule.
    In real heterogenous-workload production clusters, every available machine likely has several VMs scheduled on it if the cluster isn't idle. There is never a full machine that's free unless some special effort has been taken to make it so.
    
    fragmede 5 years ago
    
    The knapsack problem is known to be NP complete, and your algorithm that fixates on getting jobs the size of a single machine to run, will fail to run multi-machine jobs as successfully as an algorithm that does not. It's a far more interesting algorithm to think about than sorting lists of numbers. Job priorities are easy enough to add in, but the far more practical issue is with noisy neighbors. Even limiting things to single jobs on a single machines, network and storage bandwidth has bottlenecks the cluster scheduler has to optimize for.
    To make things more complicated, a long-lived cluster is going to be made up of different classes of machines, from different CPU micro-architecture, so 'single machine' is overly constraining. Eg it's not interesting that a job with a 4 MiB requirement can always run if your job needs 32 GiB.
    
    throwaway_pdp09 5 years ago
    
    http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
    "Rather than making your computation go faster, the systems introduce substantial overheads which can require large compute clusters just to bring under control.
    In many cases, you’d be better off running the same computation on your laptop."
    My limited experience fits this in that a bit of smarts on a single box beats a bunch of boxes.
    (the link is a very good read BTW)
    
    yencabulator 5 years ago
    
    The biggest problem with that write-up is ignoring availability. "Fast all the way up to the crash" can be much worse than slow and steady.
    Of course, for a batch job with a runtime under 11 minutes, that probably doesn't really matter too much. Just don't generalize that too much.
- sukilot 5 years ago
  
  Your interviewer may have lacked skill in guiding the process, but you could have taken the question in the spirit it was given and answered "now if the usage grows 10x as much as hardware advances" or "if we need to be resilient to hardware failures" or "if we need to roll out a logic upgrade slowly" and continued to solve the problem.
  > I'm still wondering if that interview costed me a level.
  Unlikely. Candidate level is decided before the interview.
- twic 5 years ago
  
  An interviewer once gave me a problem about scraping some information out of a log file. I wrote a short shell pipeline. The interviewer asked for a more complicated analysis. I wrote a longer pipeline. The interviewer added more requirements, with aggregation, state, etc in order to push me to write an actual program. I wrote an even longer pipeline.
  By this point we had both realised that this was a battle of wits: could he come up with a problem that i couldn't solve with a pipeline?
  At the end of the interview, i had a pipeline that took up most of a piece of A4 paper to write out. I had won the battle, and was offered the job.
  Of course, i would not advise you to actually write a pipeline like that in production, but it's a fun exercise.
  Anyway, the moral of this story is that if the interviewer wants you to solve a problem a certain way, and you can solve it in a simpler way, then a good interviewer will mark you up, not down. Perhaps at Google they didn't; they don't really seem like a company that has it together.
  - sukilot 5 years ago
    
    Your thesis is that a company that has it together would only write software that is either a shell pipeline or can be solved in 45minues?
    > would not advise you to actually write a pipeline like that in production,
    So when you are interviewing for a production job, why would you fight for non-production quality solutions?
- ur-whale 5 years ago
  
  If the interviewer was gauging your political skills/ adaptability, it doesn't look like you fared too well indeed.
  - Frost1x 5 years ago
    
    That seems like an excuse for poor behavior moreso than a deeper level of testing (testing when you know you're right but someone with more authority tells you otherwise). While possible, I'd have to think it's unlikely.
  - lrem 5 years ago
    
    Do you test new grads for their political skills? I don't.
- dominotw 5 years ago
  
  > solution to his numbers.
  Why did he have concrete numbers. Couldn't he simply have increased those numbers when you proposed single machine solution.
Jabbles 5 years ago

Does anyone have good numbers for a) RAM, and b) the cost of each of these solutions?
(RAM is obviously volatile.)

cm2187 5 years ago

Stupid question. What are the use cases for such massively fast write speeds?

If you are storing data to disk at that speed, you fill even the biggest optane drives in a couple of minutes. So it would be an application where you need to overwrite a huge amount of data over and over again.

stonewhite 5 years ago

That also necessitates a matching network bandwidth as well.
It reminds me of the Bugatti Veyron situation, its tires lasting 15 minutes at top speed yet it runs out of fuel in 12.
- falcolas 5 years ago
  
  I'd rather run out of fuel and drift to a stop, instead of having a blowout and crash while driving at top speed. I'm not sure where the problem lies.
  - ReactiveJelly 5 years ago
    
    It's not a problem with the car, it's an analogy that the fuel and the network are bottlenecks where normally the tires and storage might be.
    
    falcolas 5 years ago
    
    Still doesn’t make sense. The car is correctly configured (for safety purposes), and it would also make sense to be able to write to storage faster than your network speeds (no database writes exactly what it receives from the network, it adds metadata and structuring).
dr_zoidberg 5 years ago

Digital forensics is a situation where you need extreme IO but you aren't really writing all that much -- it's mostly searches over the forensic copy of computers you are analizing.
As for write speed, you don't really need it unless you're doing file/data recovery, but even then you can just make indexes to the disk image.
AFAIK, very few (if none) are using SCM in their forensic expert workstations, so I can't really tell if you'd saturate the storage capacity or the bandwidth. But in digital forensics, NVMe's have been a must for a while, so this would definitely help.
yxhuvud 5 years ago

You can for example flush a temporary db table to storage while doing computations if it ends up too big to fit in memory. Just because it is fast doesn't mean it need to be sustained throughput.
lmilcin 5 years ago

A lot of problems don't scale that easily. You may want to prolong the life of the solution by just giving it faster storage. Think in terms of giving faster storage for Oracle DB vs migrating your app to some better scaling nosql solution. SCM might seem expensive until you see the bill for the development costs.
Another use is to have local storage. You want most of the communication to be between your CPU and RAM/SCM and networking only to transmit results. In case of mongodb that would be something like very heavy aggregation that needs to look at a lot of data but transfers relatively small amount of results to the client (think counting number of objects meeting an arbitrary set of specifications). MongoDB actually requires storage for writing intermediate results of some types of operations so it is not purely about read speed.
deepstack 5 years ago

Timeseries db is one thing that comes to mind.

lichtenberger 5 years ago

You can efficiently read 256 Byte granular data (4 cache lines) with Optane Memory (due to checksums). I think it makes much more sense to read/write fine granular changes for instance at least align pages to 64 or 256 Bytes instead of 4kb pages, where you often times first of all write too much data and secondly you pollute the caches with probably unnecessary data. There's a paper about how to add cache line aligned mini-pages (16 cache-lines): https://db.in.tum.de/~leis/papers/nvm.pdf

lichtenberger 5 years ago

I think along those lines, reading and writing fine granular data, especially when storing the full history of the data is the way to go (in addition to path copying and a log-structured storage). Furthermore pointer swizzling might be needed to get almost in-memory database performance.
I'm trying to address this with my versioned Open Source DBMS project, where only page-fragments are stored and bunch of them is read in-parallel to reconstruct a full page in-memory. Adding mini-page ps plus a simple cache at some point with hot data from several page fragments at least is orthogonal:
https://github.com/sirixdb/sirix | https://sirix.io
ayende 5 years ago

Writing on a page (4KB / 8KB) boundary is almost always a good idea. Because there is an overhead per page that you have to account for. It can be in the range of 16 - 64 bytes, so having a 256 mini page is probably a bad idea.
Most of the _data_ is also not going to fit in 256 bytes anyway.
- lichtenberger 5 years ago
  
  It might add some overhead, but I guess it depends on the page implementation, but 16 bytes seems to be the minimum (and Optane Memory might , I agree. That said if someone changes only one record the best thing is to write in the smallest granularity possible on the storage medium.
  So, it might well be that someone is only interested in only one or a few records. Why then fetch and cache a whole 4Kb page if latency is good in both cases (4kb and 256 bytes)? On the other hand I agree that you should probably cache more data from a hot page.
wtallis 5 years ago

FYI, Intel Optane Memory is a wildly different product from the one under discussion, which is Optane DC Persistent Memory or Optane DCPMM. The former is a low-capacity consumer SSD caching software solution, and the latter is persistent memory modules in a DIMM form factor.
- lichtenberger 5 years ago
  
  Yes, I mean Optance DC Persistent Memory. It's hard because they seems to market three different product categories under almost the same name (Optane Memory, Optane DC Persistent Memory and SSDs).

Pelic4n 5 years ago

So you need to do THAT to get decent perfs with MongoDB. Good to know!

jiggawatts 5 years ago

Unless I'm missing something, their benchmark graphs at the bottom of the report show that there is no significant benefit to using SCM with MongoDB! Their internal overheads must be high enough to swamp the few microseconds gained from the faster storage.
- Pelic4n 5 years ago
  
  That was a joke. I've been using MongoDB in production for years and I'm salty. :)
fortran77 5 years ago

That was my first thought, too. You need these tricks to make MongoDB "WebScale"
http://www.mongodb-is-web-scale.com/
(I can get better results for Key-Value storage by using an SQL Database--Postgres or MariaDB--for key/values over MongoDB. And if you know SQL well, you can get even better results using a real relational database and optimal queries than pulling out keys/values and ad-hocking your relations in some Javascript code or whatever these web kids are doing.)
- jd_mongodb 5 years ago
  
  You don't pick a database like MongoDB purely for its key/value query performance. SQL will do really well until it does really badly, for example in cases where you exceed the limitations of the box/VM it is running on.

Settings

We Replaced an SSD with Storage Class Memory

Keyboard Shortcuts