Settings

Theme

Testing disks: Lessons from our odyssey selecting replacement SSDs

bbc.co.uk

131 points by yavor-atanasov 8 years ago · 77 comments

Reader

wtallis 8 years ago

The biggest lesson to take away from this is probably that they thought they knew how to test a SSD, but were quite obviously clueless:

> we run a fairly comprehensive set of block-level tests using fio, consisting of both sequential and random asynchronous reads and writes straight to the disk. Then we throw a few timed runs of the venerable dd program at it.

Running dd as a benchmark is a major red flag. It show that they didn't know what they were doing with fio, and didn't trust its results. They later started using IOzone and a custom-written tool to accomplish stuff they should have done with fio in their initial testing.

They also did not mention pre-conditioning the drives or ensuring that their tests run long enough to reach a steady state. This is one of the most important aspects of enterprise SSD testing and they would have known that if they'd consulted any outside resources on the subject instead of making up their own testing guidelines from a position of extreme ignorance about the fundamentals of the hardware they were using and the details of their own workload.

They really should stop calling any of their tests "comprehensive".

  • jmiserez 8 years ago

    I think you just expected to much from this article.

    This is not a comprehensive guide to testing SSDs, it’s the story of what the author went through when trying to test SSDs. It’s well written and the author seemed to really engage with the topic and describe all the setbacks he had and research they did. I did not think he presented himself as an expert, just a software engineer tasked with upgrading their SSDs. And who knows, maybe this was only a 20% time project.

    There are a lot of blog posts on HN that have much less actual content and where the authors have much less of a clue, yet often the response is overwhelmingly positive because someone took the time to write it up. You should really be more charitable here.

    And to adress the calls for “outside experts”: If everyone called in outside experts for everything hardware (or software) related, we software engineers would never get to do anything cool or learn some new framework. We’d just be watching an outside expert do their thing. And outside “experts” are not necessarily better, often they might just sell themselves better. And who is going to check their work if the knowledge is all outsourced?

    I think it’s great that the BBC lets their engineers do this and learn along the way, and a place where that is possible sounds like a nice place to work. It’s not like they had any downtime or anything because of this.

    • wtallis 8 years ago

      > And to adress the calls for “outside experts”:

      You completely misinterpreted (and misquoted) me on that one. I wasn't implying that they should hire a consultant for this kind of thing, but they should at least have bothered to read anything about the methodology used by SSD reviewers or the industry standard storage testing methodology freely published by organizations like SNIA. It's clear the BBC guys didn't even spend an afternoon trying to read up on how to evaluate SSD performance; they just jumped in and started re-inventing the wheel, hitting all the foreseeable problems along the way. It looks like they now have a clue and have learned a lot from the process, but this is not how you should handle this kind of upgrade.

      • jmiserez 8 years ago

        EDIT: no need to downvote wtallis, he has a valid point!

        Sorry I misread and misunderstood that, I did not mean to misquote.

        What you’re saying is of course absolutely true, they should have done much more research at the beginning.

        However, at the start of a seemingly easy task, research may not be first thing to spring to mind (although it should be).

  • fjsolwmv 8 years ago

    BBC published a technical (but really PR) article written by amateurs posing as pros, instead of consulting reputable experts?

    • tomcart 8 years ago

      To be clear, this isn't a news article written by our journalists - it is a piece written by the team themselves that we felt may be of interest to others, and that might help us do things better in the future. While I enjoyed reading it, I can assure you that SSD performance testing doesn't move the BBC PR needle compared to the identity of the new Doctor Who.

      We're acutely aware that we've still got much to learn in this space, so if there are thoughts you have on how we could do better we're all ears.

      Finally, while I assured you it wasn't a PR piece we're always looking for engineers in this area (and across the whole BBC) so if you'd be interested in helping us improve, get in touch.

      • mrguyorama 8 years ago

        Do you accept American Engineers? /s

        I found the piece to be wonderful. I don't do large scale storage work, so I'm very un-knowledgeable in the area, but it's great to see someone else's struggles other than Amazon or a backup service. And it is yet another indicator that the BBC cares about quality content instead of just pushing up some stock price.

        Thanks for you write up

      • matt4077 8 years ago

        Don't let the gratuitous negativity get to you! I may only see a very small slice of what the BBC does, what with being one of those continental imperialist and all. But what does find its way to me has always seemed to be excellent.

        (Not a job application. Unless you know something about the EU/British future that I don't.)

    • wtallis 8 years ago

      The problem isn't that the amateurs wrote the article, it's that the amateurs made the purchasing decisions that created the story in the first place.

    • sandworm101 8 years ago

      No. They were duped by their own IT department. The reporters thought they were talking to experts on all things IT-related. The reality was that their team, while no doubt experts on many things, knew little about testing SSDs. This wasn't PR but a simple mistake in journalism. They should have talked to outside experts before publishing.

nickcw 8 years ago

This is the problem IMHO

> We also looked up whether our HBA used TRIM in its current configuration. It turns out, in RAID mode, the HBA did not support TRIM. We did do some trim-enabled testing with a different machine, but these results are hard to compare fairly. In any case, we can't currently enable TRIM on our production systems.

In our experience SSD write performance goes to sh*t if you don't regularly TRIM them.

Running fstrim once a day is enough to keep them healthy.

RAID cards not passing TRIM is a big problem for us too...

(Experience from day job at Hosting Provider)

  • masklinn 8 years ago

    > In our experience SSD write performance goes to sht if you don't regularly TRIM them.

    Interesting, is that because of the load? It seemed "modern" SSDs have GCs good enough that trim isn't quite necessary anymore to ensure good performances in consumer loads.

    > RAID cards not passing TRIM is a big problem for us too...

    Are there NVMe RAID cards? I assume they'd necessarily pass the command along considering deallocate* is just one parameter/option of the DATA SET MANAGEMENT command, or do RAID cards just drop the entire command?

    • takeda 8 years ago

      > Interesting, is that because of the load? It seemed "modern" SSDs have GCs good enough that trim isn't quite necessary anymore to ensure good performances in consumer loads.

      A drive has no way to tell whether filesystem is using a given block or not. TRIM is a way for the filesystem to tell it that. So I would imagine the GC that you're referring to is working on the blocks marked with TRIM.

      BTW, besides running fstrim from cron on Linux, you can also use discard flag to mount the drive, so the filesystem sends TRIM command when files are deleted.

      • wtallis 8 years ago

        > So I would imagine the GC that you're referring to is working on the blocks marked with TRIM.

        Not necessarily. Since flash doesn't support in-place modification of data, any change to a portion of a file (or other FS data structure) that writes less than a contiguous 16MB (depending on the flash) will create a need for GC on the drive with or without TRIM. You can put a drive into a state of needing to do a lot of GC even without changing the quantity of live data.

    • wtallis 8 years ago

      Modern consumer SSDs still benefit from TRIM, but are mostly able to keep up with GC just fine without it when subjected to typical consumer IO workloads, which are full of idle time for the drive to catch up. But if you fill an SSD to the brim, it'll slow down, and the cheapest SSDs will slow down a lot.

      There are no hardware RAID solutions for NVMe, though there are now several hardware platforms supporting software RAID for NVMe devices in their motherboard firmware so you can boot from a NVMe RAID array. As with any other RAID solution, translating trim/unmap/deallocate commands takes a bit of effort, and less mature NVMe RAID solutions don't necessarily bother.

linsomniac 8 years ago

This reminds me of testing I did years ago on ... CD-ROMs. Funny how lessons from old technology can apply to new technology.

Around 15 years ago my company did a Linux distribution on CDs: KRUD. It was updated monthly, and we had something like 400 subscribers. For various reasons we burned these CDs in house on a cluster I built.

We would burn, eject, read and checksum, and if the read test succeeded we would ship it out. We found some users with some discs had problems reading them. We contacted these users and paid them to return the CDs and did further testing on them.

Our initial test was using dd, and we found that the discs that were not obviously damaged in shipping, would tend to pass tests on some of our CD-ROM drives, but fail on others. But when they did succeed, they would tend to take longer than normal.

I wrote a new test program that instead of using dd directly used SCSI read commands, and timed every one. It would then count the number of reads that were "slow" (like 2x normal) and those that were "really slow" (like 5x), and if these got over a certain threshold we would throw away the disc.

Being able to time the raw operations was incredibly useful, and seems like it could have shown the authors of this paper problems before being deployed to production.

Except, they didn't really seem to do very thorough testing of the drives. Running stress testing on a 1TB drive for an hour seems pretty short.

Also in my above job we did hosting. We found that if we burned in disks by reading/writing to them 10 times ("badblocks -svw -p 10"), we would almost never experience drive failures on the Hitachi drives we were using. If we didn't do this, the drives would have a fairly high chance of falling out of the RAID array in production.

As drive sizes increased from 20GB to 200GB to 1TB, these tests started taking weeks to complete. But, they were totally worth it.

HarryHirsch 8 years ago

Flash memory has three operations, read, write and erase, the last two destructively. If you pretend they are harddisks with two operations of read and write you go through all sorts of contortions. Sometimes you fall flat on the face, as seen here.

Why don't operating systems treat SSDs more flash memory, and why doesn't the file system cooperate with the underlying hardware instead of pretending it's a disk? For home use that may even work, but in a demanding environment the extra complexity will invariably fail.

This is a genuine question, I'm an amateur here.

  • pkaye 8 years ago

    Speaking as someone who used to work on SSD firmware, here are some rambling thoughts... Yes moving some of the FTL to the OS will help a lot in reducing the complexity for the SSD developer but the problems are just moved up the levels. The OS will probably still have to use a COW scheme aware of the block and page size restrictions of the underlying flash. And you can't do a raw disk copy without accounting for defective blocks. Maybe the SSD will still handle basic ECC protection and data scrambling but the OS will now have to handle read disturb, wear leveling, defect management, and data recovery using signal processing. But many of these characteristics will change from one NAND technology to another so someone will have to characterize and update the algorithms. I would actually say it is this last bit that really trips up SSD firmware design. Otherwise you would think after an iteration or two of firmware we would have a solid design but the flash technology tends to bring up some new requirements with each node that introduces more complexity.

  • wtallis 8 years ago

    There is some work on Open-Channel SSDs, that move most of the flash translation layer (FTL) to the host system. There are two major problems with this approach:

    1. Each OS that wants to use the drive needs a compatible implementation of the FTL. Consumer systems always have at least two operating systems in play (UEFI counts for these purposes). Enterprise systems are where you will actually find non-boot data-only drives.

    2. Flash memory changes. The FTL needs very different parameters depending on whether you're using Toshiba flash or Samsung flash, and even depending on whether you're using last year's Toshiba flash or the stuff they're manufacturing today.

    These aren't insurmountable problems, but they're enough to keep such products confined to a small niche. Instead, we're seeing a trend of SSDs accepting optional hints that allow them to perform the kinds of optimizations you'd expect from a fully host-managed SSD. The ATA TRIM command was just the tip of this iceberg.

    • _urga 8 years ago

      Could you provide more details on these hints? Are they ioctl calls? Assuming one is using the disk as a raw block device, without a filesystem.

      • wtallis 8 years ago

        I was referring to extensions to the command set the OS uses to interact with the drive itself. Some of these are quite like a madvise() call, but at a lower layer. Others permit the drive to expose a bit more information to the OS so that it can better optimize its IO patterns. I summarized the most recently standardized changes at [1], but there are several other features in the NVMe spec [2] that fall into this category. The extension for IO determinism has been approved for the next standard but the official spec for it hasn't been published. (I'm referring here mostly to NVMe stuff, but there are SCSI/SAS analogs to many of these features.)

        [1] https://www.anandtech.com/show/11436/nvme-13-specification-p...

        [2] http://www.nvmexpress.org/resources/specifications/

  • DSMan195276 8 years ago

    > Why don't operating systems treat SSDs as flash memory, and why doesn't the file system cooperate with the underlying hardware instead of pretending it's a disk? For home use that may even work, but in a demanding environment the extra complexity will invariably fail.

    The simple reason is because the SSDs themselves expose a regular HD interface and then does a lot of the flash-memory related stuff itself. For example, if you don't include TRIM support (Which early SSDs did not have) there is no 'erase' command the OS can send to an SSD.

    With that in mind, SSDs also have memory controllers on them that map the blocks the OS sees to actual SSD blocks (scattered across the memory chips). So when the OS writes to block 1 it may write to block 15 internally on the SSD, and then block 2 might write to block 4002. Combine this with caching and other various details on the SSD side, and it leaves little predictable behavior for the OS to exploit.

pxlfkr 8 years ago

Plugging SATA drives into a SAS HBA may not be optimal: "SAS/SATA expanders combined with high loads of ZFS activity have proven conclusively to be highly toxic" http://garrett.damore.org/2010/08/why-sas-sata-is-not-such-g...

  • equalunique 8 years ago

    Interesting. This may explain a strange incident that I once encountered. One day I came home to my CSE-847 machine with SATA drives hooked to SAS expanders on an mpt device. The whole system was unresponsive and the drives were all as hot as a fresh pot of coffee. I immediately shut down the system and let the drives cool out on the concrete floor. Everything seemed to work later, but it was quite a scare. It was 12 2TB drives setup as 6 zraid2 mirrors.

mjw1007 8 years ago

One lesson here is that when reusing a previous test setup you ought to look for assumptions you made which are no longer valid.

If they'd been starting from scratch, while thinking about modern SSDs, it's quite likely they wouldn't have built an application load tester using files containing only dots.

But as it was an existing system, it didn't get the same amount of attention.

barrkel 8 years ago

I built my home system early this year using the Samsung 960 Evo 1TB M2. Actual speeds were nowhere near advertised speeds until I enabled write-back cache on the drive, which gave me pause for concern about data persistence reliability. AFAIK the Samsung drivers (as opposed to the MS drivers I originally used) just turn this on without needing to be twiddled in settings.

Just to confirm, I have seen the behaviour described herein, with write-back cached making enormous difference with the Samsung EVO product in particular.

  • wtallis 8 years ago

    Microsoft's NVMe driver is generally regarded within the storage industry as a bad joke. The fact that it has very different cache control behavior from their SATA driver despite the two using the same check box and same help text is really inexcusable.

  • dr_ick 8 years ago

    Watch the following video that explains why the 960 EVO has poor write performance after exhausting its cache:

    https://youtu.be/RqaZTwW_X2o?t=511

    For sustained write performance you need the 960 PRO.

  • olavgg 8 years ago

    I also have a Samsung 960 Evo. Its performance is what I consider a joke, fio and pg_test_fsync make it almost look as slow as spinning SAS drives.

    For example on a 4kb sync write with 16 threads test, the 960 Evo cannot do more than 1000 iops. In comparison the Intel P4800X (Optane) does friggin 500 000 iops on the same test. That is a 500X difference.

    https://forums.servethehome.com/index.php?threads/did-some-w...

    • wtallis 8 years ago

      The 960 EVO is a consumer grade SSD with firmware tuned for bursts of I/O (through eg. the use of SLC write caching) at the expense of sustained write throughput. It doesn't have power loss protection capacitors, so it can't perform safe write caching when you're issuing synchronous writes. 4kB is much smaller than the underlying page size of its NAND flash, so performance is going to suck without write combining. You're testing it in shackles, with a workload that doesn't at all match its intended use case. That doesn't make it a joke, it just makes it the wrong kind of drive to use for stereotypical enterprise applications.

      • olavgg 8 years ago

        So what is the use case for this drive? The 960 Evo/Pro are supposed to be premium models, but a better investment would be a cheaper SSD drive with more storage. And if you rarely write that much, more ram will increase the read speed significantly.

        • basch 8 years ago

          a consumer pc.

          the pro does not have the buffer the evo does. the evo is not a premium model, it is entry level cutting edge

          • olavgg 8 years ago

            That is not very specific, a consumer PC would be fine with a 750 Evo also, maybe two of them in raid 0 for twice the sequential read & write speed. I believe for most consumers, having more SSD storage per $ is more important.

            • basch 8 years ago

              a 750 evo is a sata drive, a 950 is an nvme drive. completely different technology. if my laptop has an nvme slot why would i buy a sata drive. 2 750s is not faster than 1 950

    • barrkel 8 years ago

      It is nowhere near as slow as spinning drives, that's ludicrous. Mega IOPS are simply not required in a desktop. I'm not trying to run multiple VMs with multiple databases on this thing. In fact it's rarely writing at all.

      • olavgg 8 years ago

        Have you checked your SMART data for how much your read/write ratio is? I think you will find those results surprising, even when you think that you don't write that often.

        • wtallis 8 years ago

          It's not just a matter of how many bytes are being written; desktop workloads rarely need to do synchronous writes.

          • olavgg 8 years ago

            There are a lot of desktop applications that does synchronous all the time. Chrome, Firefox and Spotify are a few examples as they use SQLite which does fsync() system call. But yeah, you will be fine even with a spinning hard drive. My point is, what use case is there for paying a premium for the Samsung NVMe's when you would be just fine with a cheaper model.

            • wtallis 8 years ago

              The IO done by applications like web browsers and Spotify may be synchronous from their perspective (i.e. write(2) syscalls instead of aio), but those applications definitely don't need the sort of hard ACID transactional guarantees that require all disk caches to be disabled and every single write command sent to the drive to be a barrier. The fact that they're running atop journaling filesystems is good enough.

            • barrkel 8 years ago

              I've had about 20 spinning drives fail on me in the past 25 years or so. The more I can keep spinning drives out of my life, the happier I am, generally.

              You're right about the volume of data written; it's nearly as much as read, likely due to continuously syncing browsers - I leave my desktop on 24/7, and run 2 browsers, all of which probably doesn't help. It doesn't add up to an argument for high IOPS though.

              I chose the Samsung drive primarily because of all the stories about premature SSD failures - all the drives I've bought have either been Intel or Samsung. I got this model because it was the 1TB option that was both SSD and either Samsung or Intel.

              I cannot tell you how many days I've wasted replacing disks in home raid arrays, or before that recovering data from backups. I'm also done with trying to shuffle data between tiny SSDs and large spinning disks based on required speed of access. I keep all my bulky files on my ZFS NAS, but I don't worry about how many apps or games I have installed - 1T has, for now, been more than enough.

pcfe 8 years ago

You could give blkreplay a go next time you decide on which disks to buy. I find the additional effort is worth it, but ymmw. Use one of the shipped loads for a quick test, but you really want to run blktrace against your current setup and feed that data to blkreplay.

have_faith 8 years ago

Great article, easy to follow considering it's far away from my normal domain.

I noticed they didn't mention any brands by name though, why is that?

  • tankenmate 8 years ago

    The BBC has a very strong product prominence policy[0] (i.e. avoid naming brands when possible), being government funded is a large driver of this policy.

    [0] http://www.bbc.co.uk/editorialguidelines/guidelines/editoria...

    EDIT: fixed policy name and added link

    • anoother 8 years ago

      It's a shame this is so selectively applied.

      See, for example:

      - The constant mention of speaking to people 'over Skype' on the News

      - Publicization of Twitter hashtags on Questiontime and other programs

      - Hours worth of Top Gear footage (and the entire Arctic Special) that were effectively Toyota Hilux advertisements

      • pbhjpbhj 8 years ago

        Many, many hours worth of prime time TV is advertising for movies/books/etc.. "So you're over here promoting your new movie" says eg Mr Norton, "Yeah, I'm getting US$20M for this movie so I appreciate the free advertising here ..." (or some bs the publicist wrote for them) says the guest, "cue trailer".

        When I was on the BBC I had to change my logo polo-shirt so as not to advertise my single-location SME business. Yet Nike, et al., logos are fine, as is advertising by sports teams, etc..

        It's all very inconsistent.

      • gaius 8 years ago

        The constant favourable coverage of Google and Apple, always talking about hipster-friendly Flickr when boring old Photobucket was doing 20x the volume... they apply their rules very selectively....

        • fjsolwmv 8 years ago

          Photobucket was for link sharing to other sites. , no? While flicker was a destination for publishing albums and browsing, with much higher quality photos.

      • arethuza 8 years ago

        Isn't most of Top Gear about coverage of different automotive brands?

        • anoother 8 years ago

          It is. I'm referring to:

          - A running gag that, iirc, covered multiple seasons, about the 'indestructibility' of a red 90s Toyota Hilux. Performing destructive stunts and then driving away with minimal/no repairs. No other cars were involved in this segment.

          - A special, hour-long episode involving a grueling trip to the pole in a fleet of brand new, red Toyota Hiluxes.

          Whichever way you look at it, this was fantastic advertising and brand reinforcement for Toyota.

      • faceplanted 8 years ago

        I mean the hashtags have no real comparison, but those other two are a bit shady.

    • keypress 8 years ago

      It's not Government funded. There's a mandatory TV tax (for TV viewers), the fee of which is set by Government.

      • tankenmate 8 years ago

        You are right, but considering that until recently not paying your TV license was a crime in effect it is government funded; ultima ratio regum and all that...

        • Angostura 8 years ago

          Stealing my car is a crime too. Today I learned my car is government funded.

          • tankenmate 8 years ago

            Your argument is non sequitur. If others were required to pay for your car via legislation then you might have an argument.

          • barrkel 8 years ago

            Property rights are a government-funded right, correct.

            The roads your drive on are almost certainly subsidised from beyond the direct taxes you pay on your car and fuel too.

      • fjsolwmv 8 years ago

        "Taxes" are money paid to the government for use in government funded activities.

  • dspillett 8 years ago

    > I noticed they didn't mention any brands by name though, why is that?

    As a public service broadcaster in the UK, the BBC must be very careful about naming specific brands and products due to rules laid out in their legal remit (and fear of legal action if from a party that feels unfairly disadvantaged by a competitor getting a good mention or them getting a bad one).

    This is way Blue Peter always use "sticky backed plastic" instead of "sellotape" or "scotch tape", and people on BBC shows "vacuum" where the common parlance is "to hoover" (hoover being a brand name that got verbed like Google -> to google).

    • afandian 8 years ago

      (This is almost too petty a point to reply, but "sticky backed plastic" is a sheet of adhesive transparent film, like you'd use to cover books. I think they probably would have said "sticky tape")

      • laumars 8 years ago

        I don't know why you are being downvoted because you are correct. I'd further your post and also point out the correct term to vacuum is still "vacuum". The BBC didn't invent that term like they did "sticky backed plastic".

        The radio is often quite amusing when there are guests on there who accidentally slip a brand name. Presenters often then reply with the same rehearsed quote:

            "Other brands of [cola] also exist."
      • rjsw 8 years ago

        The main UK brand of "sticky backed plastic" was Fablon.

      • bonaldi 8 years ago

        They literally did say "sticky-backed plastic" for sellotape, which left me bemused as a nipper

        • dboreham 8 years ago

          Especially bemused because that's not what they meant. SBP is Fablon or plastic film that comes in a wide (2+ ft) roll and is primarily used to cover books.

    • Bromskloss 8 years ago

      > like Google -> to google

      Hang on! Surely, "to google" refers to performing a search specifically with Google, right?

      • grzm 8 years ago

        Just like a kleenex must be a tissue produced by Kimberly-Clark or one can only xerox on equipment manufactured by a particular company? That ship is in the process of leaving the port :)

      • pbhjpbhj 8 years ago

        Not a chance. To google means "to search online [using a search engine]".

        Have you never witnessed someone googling something using Bing?

        • Bromskloss 8 years ago

          I have never witnessed anyone using Bing at all! My father used DuckDuckGo once, although he wasn't aware of it.

          In any case, if I say "to google", I mean "search using Google". I try not to say it, though, because I think it sounds silly to use product names like that.

noir_lord 8 years ago

God damn that was well written, excellent post!

fulafel 8 years ago

Sounds like they are observing transparent data compression in the SSD controller and FTL. SandForce controllers even made a marketing point if it back in the day. It manifests as faster IO with repetitive data, along with reduced flash wear.

pricechild 8 years ago

One of my favourite parts of this article is how Elliot Thomas describes himself as a "Software Engineer".

We may be writing software, but without a working knowledge of hardware it's not worth much!

  • blowski 8 years ago

    I have barely any knowledge of hardware, but I have built plenty of pieces of software that have helped people.

    • ape4 8 years ago

      Civil Engineers don't understand chemistry or quantum mechanics - how they build a bridge.

      • pbhjpbhj 8 years ago

        They must know some chemistry surely, like weathering effects on concrete; effects of potential chemical spills, eg on roadways, metal-concrete-surfacing interactions.

        ?

    • fjsolwmv 8 years ago

      The article is about pieces of hardware

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection