The 7.0 scheduler regression that wasn't

6 min read Original article ↗
We're bad at marketing

We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.

One of the more significant changes in the 7.0 kernel release is to use the lazy-preemption mode by default in the CPU scheduler. The scheduler developers have wanted to reduce the number of preemption modes for years, and lazy preemption looks like a step toward that goal. But then there came this report from Salvatore Dipietro that lazy preemption caused a 50% performance regression on a PostgreSQL benchmark. Investigation showed that the situation is not actually so grave, but the episode highlights just how sensitive some workloads can be to configuration changes; there may be surprises in store for other users as well.

One of the key decisions a CPU scheduler must make is when to remove a running process from the CPU to allow another to run. Preempting processes quickly when there is higher-priority work to do can produce quicker response times and, thus, lower latency. Aggressive preemption comes with a cost, though, in terms of the overall throughput of the system. Rapid switching of tasks can lead to more scheduler overhead, worse cache utilization, and more lock contention. It is hard to find a solution that works for every workload, a fact that has made it hard to remove the variety of preemption modes from the scheduler.

The lazy-preemption mode was designed with an eye toward the needs of both latency-sensitive and throughput-driven workloads. Unlike the full-preemption or realtime modes, lazy preemption will normally allow a task to run for a while even after the need for preemption has been detected. That preemption will be deferred until the task exhausts its time slice, blocks for some other reason, or until the next scheduler tick occurs. That leads to a quicker preemption than would happen with the PREEMPT_NONE mode (which only preempts a process at the end of its time slice), but still allows the task to run for a while before the preemption occurs.

Dipietro reported that the PostgreSQL performance regression was caused by a large increase in lock contention. PostgreSQL uses user-space spinlocks for much of its concurrency control; one problem with such locks is that, if a lock holder is preempted before the lock can be released, other processes will spin on a lock that may remain held for a long time. An increase in the frequency of preemption could indeed cause this to happen; more preemptions mean more chances to sideline a process before it is able to release a contended lock.

At a first glance, that seemed to be exactly what was happening here, leading scheduler developer Peter Zijlstra to suggest that the proper fix was for PostgreSQL to use time-slice extension to protect lock holders from preemption. This feature allows a process to request that it not be preempted for a short period while it completes the execution of a critical section and releases its locks. It is a useful feature for a situation like this but, as PostgreSQL developer Andres Freund pointed out, time-slice extension was only added in the 7.0 kernel; "requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great". It would also not be a simple change, he said, so backporting any such fix to released versions of PostgreSQL was unlikely to happen.

Zijlstra, faced with the prospect of having to revert a scheduler change that had been years in the making, was clearly reluctant to do so. He suggested that anybody who updates the kernel on a system running PostgreSQL could be expected to update the database manager as well. This is the sort of forced update scenario that the kernel's regression policy is meant to avoid, but Zijlstra remarked "sometimes you have to break eggs to make cake :-)". If a revert was needed, he said, it would be "a very temporary thing". The plan is to eventually remove PREEMPT_NONE entirely, eliminating a fair amount of complexity in the scheduler.

Meanwhile, though, Freund was unable to reproduce the problem, and had a hard time understanding how it could come about. A little while later, though, he figured it out. In his test systems, he had enabled the use of transparent huge pages (THPs), "as that is the only sane thing to do with 10s to 100s of GB of shared memory and thus part of all my benchmarking infrastructure". When he disabled huge pages, the problem reported by Dipietro surfaced immediately. That revelation removed the urgency from this regression:

I don't see a reason to particularly care about the regression if that's the sole way to trigger it. Using a buffer pool of ~100GB without huge pages is not an interesting workload. With a smaller buffer pool the problem would not happen either.

He added that, even in the absence of the spinlock contention, avoiding huge pages was going to have bad performance effects.

Freund had expressed confusion about how there could be contention on the lock that Dipietro pointed out, since the critical section it protects is quite short. But, when huge pages are not in use, that section will take longer to execute. The extra pressure on the translation lookaside buffer (TLB) caused by using small pages will be a part of the problem, but a bigger part is almost certainly just the greatly increased number of page faults that will occur in that configuration. These effects will increase the execution time in the critical section, increasing the chances that a PREEMPT_LAZY kernel will take control away from a lock-holding process. That slowdown is far less likely to happen when huge pages are in use.

One conclusion from that diagnosis is that time-slice extension would be of little help; Freund confirmed that the performance regression happened even when user-space spinlocks are taken out of the picture. That said, he did acknowledge that the feature was worth looking into on its own merits, saying it looks "nice for performance regardless of using spinlocks".

Dipietro confirmed that enabling huge pages caused the regression to disappear. With that report, thoughts of reverting the scheduler change also seemed to disappear. That may be a bit premature, though. There are likely to be systems in the wild running under less-than-optimal configurations that will show regressions when hit with this kind of change. That prospect, in turn, may cause distributors to shy away from lazy preemption in their kernels, regardless of what the scheduler developers might like. An immediate revert might not be in the cards, but the grand plan to remove PREEMPT_NONE may have a longer path to completion than some would like.

Index entries for this article
KernelPreemption
KernelScheduler/Preemption models