Two new block I/O schedulers for 4.12

Did you know...?
LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

The multiqueue block layer subsystem, introduced in 2013, was a necessary step for the kernel to scale to the fastest storage devices on large systems. The implementation in current kernels is incomplete, though, in that it lacks an I/O scheduler designed to work with multiqueue devices. That gap is currently set to be closed in the 4.12 development cycle when the kernel will probably get not just one, but two new multiqueue I/O schedulers.

The lack of an I/O scheduler might seem like a fatal flaw in the multiqueue code, but the truth is that the need for a scheduler was not clearly evident at the outset. High-end drives are generally solid-state devices lacking rotational delay problems; they are thus not as sensitive to the ordering of operations. But it turns out that there is value in I/O scheduling even for the fastest devices; a scheduler can coalesce adjacent requests, reducing the overall operation count, and it can prioritize some operations over others. So the desire to add I/O scheduling for multiqueue devices has been there for a while, but the code has been lacking.

Things got closer in the 4.11 merge window, when the block layer gained support for I/O scheduling for multiqueue devices. The deadline I/O scheduler was ported to this mechanism as a proof of concept, but it was never seen as the real solution going forward.

When I/O scheduling support was added, the first intended user was the Budget Fair Queuing (BFQ) scheduler. BFQ has been in the works for years; it assigns an I/O budget to each process that, when combined with a bunch of heuristics, is said to produce much better I/O response, especially on slower devices. Users with rotational storage benefit from BFQ, but there are also benefits when using slower solid-state devices; as a result, there is a fair amount of interest in using BFQ on devices like mobile handsets, for example.

The idea that BFQ is an improvement over the CFQ scheduler found in mainline kernels is fairly uncontroversial, but getting BFQ merged was still a lengthy process. One of the final stumbling blocks was that it was a traditional I/O scheduler rather than a multiqueue scheduler. The block subsystem developers have a long-term goal of moving all drivers to the multiqueue model, and merging a non-multiqueue I/O scheduler did not seem like a step in that direction.

Over the last several months, BFQ developer Paolo Valente has worked to port the code to the multiqueue interface. The known problems with the port have been resolved, and block subsystem maintainer Jens Axboe has agreed to merge it for 4.12. If all goes to plan, the long wait for the BFQ I/O scheduler will finally be at an end.

The interesting thing is that BFQ will not be the only multiqueue I/O scheduler entering the mainline in 4.12; there will be another one, developed over a much shorter time period, that should also be merged then. One might well wonder why a second scheduler is needed, especially since kernel developers place a premium on general solutions that can address a wide variety of use cases. But it seems that there is, indeed, a reasonable case to be made for merging a second multiqueue I/O scheduler.

BFQ is a complex scheduler that is designed to provide good interactive response, especially on those slower devices. It has a relatively high per-operation overhead, which is justified when the I/O operations themselves are slow and expensive. This complexity may not make sense, though, in situations where I/O operations are cheap and throughput is a primary concern. When running a server workload using solid-state devices, it may be better to run a much simpler scheduler that allows for request merging and perhaps some simple policies, but which mostly stays out of the way.

The Kyber I/O scheduler, posted by Omar Sandoval, would appear to be such a beast. Kyber is intended for fast multiqueue devices and lacks much of the complexity found in BFQ; it is less than 1,000 lines of code. Its policies, while simple, would appear to be an interesting echo of the bufferbloat work done in the networking stack.

I/O requests passing through Kyber are split into two primary queues, one for synchronous requests and one for asynchronous requests — or, in other words, one for reads and one for writes. A process issuing a read request is typically unable to proceed until that request completes and the data is available, so such requests are seen as synchronous. A write operation, instead, can complete at some later time; the initiating process doesn't usually care when that write actually happens. So it is common to prioritize reads over writes, but not to the point that writes are starved.

The key to the Kyber algorithm is that the number of operations (both reads and writes) sent to the dispatch queues (the queues that feed operations directly to the device) is strictly limited, keeping those queues relatively short. If the dispatch queues are short, then the amount of time that passes while any given request waits in the queues (the per-request latency) will be relatively small. That ensures a quick completion time for higher-priority requests.

The scheduler tunes the actual number of requests allowed into the dispatch queues by measuring the completion time of each request and adjusting the limits to achieve the desired latencies. There are two tuning knobs available to the system administrator for setting the latency targets; they default to 2ms for read requests and 10ms for writes. As always, there will be tradeoffs in setting these values; setting them too low will ensure low latencies, but at the cost of reducing the opportunities for the merging of requests, hurting throughput.

Kyber, too, has been accepted for the 4.12 merge window. So, if all goes according to plan, the 4.12 kernel will have two new options for I/O scheduling on multiqueue devices. Users concerned with interactive response and possibly using slower devices will likely opt for BFQ, while throughput-sensitive server loads are more likely to run with Kyber. Either way, an important gap in the kernel's multiqueue block I/O subsystem has been filled, clearing the path for an eventual transition of all of the kernel's block drivers to the multiqueue API.

Index entries for this article
Kernel	Block layer/I/O scheduling
Kernel	I/O scheduler