A filesystem corruption bug breaks loose

8 min read Original article ↗
Ready to give LWN a try?

With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!

Kernel bugs can have all kinds of unfortunate consequences, from inconvenient crashes to nasty security vulnerabilities. Some of the most feared bugs, though, are those that corrupt data in filesystems. The losses imposed on users can be severe, and the resulting problems may not be noticed for a long time, making recovery difficult. Filesystem developers, knowing that they will have to face their users in the real world, go to considerable effort to prevent this kind of bug from finding its way into a released kernel. A recent failure in that regard raises a number of interesting questions about how kernel development is done.

On November 13, Claude Heiland-Allan created a bug report about a filesystem corruption problem with the 4.19.1 kernel; other users joined in with reports of their own. Initially, the problem was thought to be in the ext4 filesystem, since that is what the affected users were using. Tracking the problem down took a few weeks, though, because few developers were able to reproduce the problem. There were some attempts at using bisection to find the commit that caused the problem, but they proved to be worse than useless, as they identified the wrong commits and caused developers to waste time on false leads.

It took until December 4 for Lukáš Krejčí to correctly bisect the problem down to a block-layer change. Commit 6ce3dd6eec, added during the 4.19 merge window, optimized the handling of requests in the multiqueue block layer. If there is no I/O scheduler in use, and if the hardware queue is not full, this patch causes new I/O requests to be placed directly into the hardware queue, shorting out a bunch of unnecessary processing. It's a harmless-seeming change that should make I/O go a little faster.

Things can go bad, though, if the low-level driver for the block device is unable to actually execute that request. This is most likely to happen as the result of a resource shortage — memory, perhaps, or something related to the hardware itself. In that case, the driver will return a soft failure, causing the I/O request to be requeued for another attempt later. While that request sits in the queue, the block layer may merge it with other requests for adjacent blocks, which should be fine. If, however, the low-level driver has already done some of the setup for the request, such as creating scatter/gather DMA mappings, those mappings may not be updated to match the larger, merged request. That results in only part of the request being executed by the hardware, with bad effects on the data involved.

The problem was partially fixed with this commit, but one more fix was required to fix a new problem caused by the first. Both fixes were included in the 4.20-rc6 release; they also found their way into 4.19.8. The original patch was never selected for backporting to older stable kernels, so those were not affected.

How this happened

Naturally, some developers are wondering how a problem like this could have made it into a final kernel release without having been noticed. Surely automated testing should have been able to find such a bug? Not all parts of the kernel have exhaustive automated testing regimes (to put it charitably), but the block layer is better than most. That is a function of the severity of bugs at that layer (which can often cause data loss), but also of the relative ease of testing that code. There are few hardware-specific issues to deal with, so testing can usually be done in virtual machines.

This particular case, though, turned up a hole or two in the testing regime. It required a filesystem on a device configured to use no I/O scheduler at all (a relatively rare configuration on the affected filesystems), and that device needed to run into resource limitations that would cause it to temporarily fail requests. The driver for the device also needed to store state in the request structure and use that state in subsequent retries. Finally, the multiqueue block layer must be in use, which only happens by default for SCSI devices as of 4.19. That last change is unlikely to have been picked up by many developers or testing setups, since kernel configuration files tend to be carried forward from one release to the next.

As a result of all those factors, nobody doing automated testing of the block layer reported this particular problem, and developers found themselves unable to reproduce it once it came to light. Ubuntu users who install from the kernel-ppa repository, instead, got a shiny new configuration file with their bleeding-edge kernel and were thus exposed to the problem. This sort of occurrence is one of the reasons why developers tend to like to remove configuration options; more options create more configurations that must be tested. In this case, block maintainer Jens Axboe has said that he will be writing a new test for this particular issue. He also noted that "this is the first corruption issue we've had (blk-mq or otherwise) in any kernel in the storage stack in decades as far as I can remember".

Doing better next time

There have been some suggestions that the kernel community should have done more to protect users once the problem was discovered. It is not clear that there is a whole lot that could have been done, though. Arguably there should be a mechanism to inform users that a given kernel might have a serious issue and should not be used, but the community lacks such a mechanism. This particular issue was, in fact, less visible than many since it was not discussed on the mailing lists. The only people who were aware of it were those who were watching the kernel bugzilla — or who were busy restoring their filesystems from backups.

Laura Abbott noted that the problem left Fedora users in an awkward position: they could either run a 4.19 kernel that might mangle their data or run the 4.18 kernel, which was no longer being supported. Perhaps, she said, there should be a way to respond to problems like this?

I'm wondering if there's anything we can do to make things easier on kernel consumers. Bugs will certainly happen but it really makes it hard to push the "always run the latest stable" narrative if there isn't a good fallback when things go seriously wrong.

Willy Tarreau responded that he ensures that the previous long-term stable release works on the systems he supports for just this reason. Dropping back to 4.14 is unlikely to be a pleasing alternative for many users, but it's not clear that something better will come along. Some people would surely like it if the previous release (4.18 in this case) were maintained for longer but, as stable maintainer Greg Kroah-Hartman put it, that "isn't going to happen"; the resources to do that simply are not available.

Kroah-Hartman did have one suggestion for situations like this: tell him and the other stable maintainers about it. In this case, he was only informed, by accident, shortly before the bug was tracked down and fixed. There is not much that could have been done even if he had known sooner, since nobody knew what the origin of the problem was. But keeping the stable maintainers in the loop regarding serious problems that have appeared in stable kernels can only help to get the fixes out more quickly.

One other aspect of this bug is that, depending on how one looks at it, it could be seen as resulting from either of two different underlying issues. One is that the multiqueue block layer is arguably still not sufficiently mature, so it is turning up with severe bugs. The other is that maintaining two independent block subsystems is putting a strain on the system and letting bugs get through. One's point of view may well affect how one views the prospect of the legacy block API being removed in the next merge window, which is the current plan. Ted Ts'o let it be known that this idea "is not filling me with a lot of joy and gladness". But for many others, the time to make this transition is long past and, in any case, the number of devices that can only use the multiqueue API is growing quickly.

The good news is that problems of this severity are rare and, when they do happen, they get the full attention of the developers involved. Some early adopters were burned, which is never a good thing, but the vast majority of users will never be affected by this issue. Some testing holes have been identified that will hopefully be closed in the near future. But no amount of testing will ever reveal all of the bugs in the system; occasionally a serious one will escape and bite users. With luck and effort the number and severity such events can be minimized, but they are not going to be entirely eliminated anytime in the near future.

Index entries for this article
KernelBlock layer
KernelDevelopment model/Stable tree