Guide to Lock Convoys
davekilian.comThe best pattern I've found for inter-thread communication involves CAS for a ring buffer slot on the producer (you can skip CAS if only 1 producer thread), with either busy wait or yield/sleep on the consumer(s). If you keep your sync primitive separated between consumer & producer, the worst bits of contention can be avoided. CAS is very fast when un-contended or lightly-contended. Busy waiting can be used for latency-sensitive applications where you need threads coordinating with latency bounds measured in microseconds. yield/sleep can be used to play more fairly with other processes and power, but then you are at mercy of the OS timeslice size.
One other great thing about the ring buffer is built-in sane backpressure. I think the article is good, but too quickly dismissive of backpressure as a tool for system health.
On x86, you're quite a bit better off using fetch-and-add instead of CAS.
> Maybe we can’t eliminate them, but that doesn’t mean there’s nothing we can do about them.
Specifically for the request case, I recall seeing a talk suggestion that processing requests last-in-first-out resolves this failure mode as well as reduces the 99th percentile latency. The intuition is if a request is going to be late, might as well abandon it and process it later because it's already late instead of doing a slow job and making everything else slow in the process.
Maybe that's a strange metaphor for work too.
> as well as reduces the 99th percentile latency
How does it manage to do that?
When you're getting close to your limits and requests are actually waiting in the queue, I would expect FIFO to slow down everything to provide backpressure, while LIFO keeps a very nice median speed but has an increasing percentage of requests that timeout and retry once or even twice.
Are there significant dynamics I'm not thinking of?
Maybe if you have bursts that are just big enough for FIFO to delay >1% of total requests, but small enough for LIFO to drop <1%? But in that situation giving a single percentile paints a misleading picture; 99th would do better on LIFO but it comes at the cost of trashing higher percentiles.
If you have a long standing queue, switching to LIFO makes sense. It requires you to know how long the requests have been waiting, however, and many servers don't even have this basic information. Every request needs to come with a deadline and a timestamp, so the request processor can make rational decisions about processing or dropping it. If a service finds a request that arrived a long time ago and sat in the queue until the deadline was in the past, that's a fairly good signal to temporarily switch to LIFO processing.
LIFO processing in the steady state is not a great idea because it will stochastically starve some requests for no reason.
In the steady state, the queue or stack should be nearly empty, no? It's only when shocks occur, as the article describes, that a lock convoy forms.
I know this isn't any kind of silver bullet, but if you have these kinds of time constraints, wouldn't building your solution on a real time OS make more sense?
So if I understand this right, a one-thread-per-core request handling instead of a one-thread-per-request architecture should make this problem much more unlikely. Would switching to JVM fibers be a solution, in case this problem appears in a traditional threaded JVM app?
Why does the author say throughput tanks in the their first convoy example? Yes requests are being queued (delayed) but the server is still doing the same amount of work per “snap”. It’s not really a traffic shockwave, but instead more like adding cars to the beginning of the road.
Yeah, that part confused me too; the latency goes up, but the throughput stays the same. It looks like it's just a bad example, because the later example under "Forming a Lock Convoy" shows better understanding of the problem's dynamics. A better initial example would be fast food or retail workers serving a single queue of customers.
Sort of reminds me of this post on Netflix Tomcat tuning https://netflixtechblog.com/tuning-tomcat-for-a-high-through...
Requests spikes cause the whole system to backup where appropriate upper limit tuning allows the servers to drop requests before they get overloaded and grind to a halt
Regarding profiling, it should be possible to get some insights using perf I would have thought (e.g. perf-timechart and some fancy flamegraph that also takes off-cpu time into account)?
Wrt lock design, IIRC Linus writing something along that every time they optimize throughput by getting rid of fairness they have to come back and add fairness later on. That doesn't mean that both fairness and throughput are incompatible, high-performance locking designs are often unfair in the short term to provide the performance boost from avoiding unnecessary context switching, but there is some mechanism to ensure that on a slightly longer timescale nobody is starved.