Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt?

4 min read Original article ↗

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? Kaisong Huang, Jiatang Zhou, Zhuoyue Zhao, Dong Xie, and Tianzheng Wang SIGMOD'25

Say you are a database, and your job is to execute two kinds of queries (both from different TPC benchmarks):

  • High-priority New-Order queries from TPC-C (OLTP)

  • Low-priority Q2 queries from TPC-H (OLAP)

Congratulations, you are a hybrid transaction/analytical processing (HTAP) database! You would like OLTP transactions to experience low tail latency, while OLAP transactions run at high throughput. How can transactions be scheduled to achieve these goals?

A Non-Preemptive FIFO policy runs transactions to completion in the order they came. This is easy but has a high tail latency for OLTP transactions that have to sit in line behind a long queue of OLAP transaction.

A cooperative policy involves yield points at specific places in the database code. At each yield point, an OLAP transaction can realize that an OLTP transaction is waiting and yield the CPU. It is a hassle to insert yield points, and it is hard to find the Goldilocks frequency. Checking for high-priority transactions too often adds overhead, but checking infrequently increases tail latency.

A preemptive policy allows high-priority OLTP transactions to borrow a CPU core as soon as possible. In the past, the only practical way to do this involved OS context switching, which is expensive.

Fig. 2. illustrates these policies:

Enter userspace interrupts. These allow preemption without OS kernel involvement.

Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running.

The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction. Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction.

There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as:

  • A userspace interrupt occurring in the middle of the context switch function

  • Support for database code which uses thread-local storage (e.g., the thread_local modifier in C++)

  • Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired

As seen with xUI, while userspace interrupts are cheap, they still incur a cost. This paper proposes firing a single interrupt to execute a batch of high-priority transactions. Section 5 also describes a starvation avoidance mechanism to ensure that low-priority transactions eventually finish.

Note that when a low-priority transaction is not preempted, it is not automatically aborted. The paper assumes the underlying database uses multi-versioning and optimistic concurrency control.

Fig. 10 has the headline results. Wait represents FIFO scheduling. Cooperative represents the case where the OLAP query can occasionally yield the CPU core. PreemptDB is the work described in this paper. Tail latencies for OLTP queries are significantly reduced, while performance of OLAP queries does not change much.

It would be nice to see a comparison against a version which uses traditional OS thread synchronization rather than userspace interrupts.

The details of userspace context switching are tricky and seem orthogonal to databases. A library or OS functionality which provides a robust implementation seems like a useful thing to exist.

The paper doesn’t mention what happens if the work associated with a single query is parallelized across multiple CPU cores. I imagine this complicates the scheduling policy.

Discussion about this post

Ready for more?