Read Locks Are Not Your Friends

eventual-consistency.vercel.app

20 points by emschwartz 3 days ago · 21 comments

Reader

Lock contention is a real issue for any multi-threaded system, and while a RW mutex is useful when you have a longer executing critical section, for something very short lived there is still a cache coordination cost. In many of the HashiCorp applications, we work around this by using an immutable radix tree design instead [1].

Instead of a RW mutex, you have a single writer lock. Any writer acquires the lock, makes changes, and generates a new root pointer to the tree (any update operation generates a new root, because the tree is immutable). Then we do an atomic swap from the old root to the new root. Any readers do an atomic read of the current point in time root, and perform their read operations lock free. This is safe because the tree is immutable, so readers don't need to be concerned with another thread modifying the tree concurrently, any modifications will create a new tree. This is a pattern we've standardized with a library we call MemDB [2].

This has the advantage of making reads multi-core scalable with much lower lock contention. Given we typically use Raft for distributed consensus, you only have a single writer anyways (e.g. the FSM commit thread is the only writer).

We apply this pattern to Vault, Consul, and Nomad all of which are able to scale to many dozens of cores, with largely a linear speedup in read performance.

[1] https://github.com/hashicorp/go-immutable-radix [2] https://github.com/hashicorp/go-memdb

amluto 6 hours ago

The code examples are confusing. The show the code that takes the locks, but they don’t show any of the data structures involved. The rwlock variant clones the Arc (makes sense), but the mutex variant does not (is it hidden inside inner.get)?

In any case, optimizing this well would require a lot more knowledge of what’s going on under the hood. What are the keys? Can the entire map be split into several maps? Can a reader hold the rwlock across multiple lookups? Is a data structure using something like RCU an option?

ot 5 hours ago

This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).

One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/

There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.

PaulHoule 5 hours ago

I think it’s not unusual that reader-writer locks, even if well implemented, get in places where there are so many readers stacked up that writers never get to get a turn or 1 writer winds up holding up N readers which is not so scalable as you increase N.
amluto 2 hours ago

Wow, folly::SharedMutex is quite an example of design tradeoffs. I wonder what application the authors wanted it for where using a global array was better than a per-mutex array.
mike_hearn 5 hours ago

Right, and if you're on the JVM you have access to things like ConcurrentHashMap which is lock free.
Jyaif 5 hours ago

And a Rust equivalent of folly::SharedMutex: https://docs.rs/crossbeam-utils/latest/crossbeam_utils/sync/...

alecco 4 hours ago

It always amazes me the amount these folks are willing to work and struggle just to avoid reading basic database literature.

Also Fedor Pikus has some nice cppcon talks from years ago on all this. Very low level.

_dky 6 hours ago

If implementation is task based and task always runs on same virtual CPU (slots equaling CPUs or parallelism), wonder if something like below might help.

RW lock could be implemented using an array of length equal to slots and proper padding to ensure each slot is in its own face line (avoid invalidating CPU cache when different slot is read/written).

For read lock: Each task acquires the lock for their slot.

For write lock: Acquire lock from left most slot to right. Writes can starve readers when they block on in-flight reader at a different slot when moving from left to right.

I do not know how Rust RW locks are implemented.

whizzter 6 hours ago

I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?

hansvm 6 hours ago

I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing.
We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes).
- the_duke 5 hours ago
  
  Rust has an interesting crate for this, arc-swap [1].
  It's essentially just an atomic pointer that can be swapped out.
  [1] https://docs.rs/arc-swap/latest/arc_swap/
gpderetta 5 hours ago

the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them).
A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable.
PunchyHamster 5 hours ago

Read lock requires communication between cores. It just can't scale with CPU count

sevensor 5 hours ago

Does this apply also to std::shared_mutex in C++? This is a timely article if so; I’m in the middle of doing some C++ multithreading that relies on a shared_mutex. I have some measuring to do.

gpderetta 5 hours ago

mostly yes.
- sevensor 4 hours ago
  
  Thanks, that’s what I was afraid of. The ping ponging described in the article seems hard to avoid regardless of what language you’re using.

Retr0id 5 hours ago

claudes love to talk about The Hardware Reality

stuaxo 5 hours ago

"The performance Death Spiral" was the point I realised I was being LLMd and bailed out.

api 5 hours ago

Take a look at crates like arc_swap if you have a read often write rarely lock case. You can easily implement the RCU pattern. Just be sure to read about how to use RCU properly.

Well done this pattern gives you nearly free reads and cheap writes, sometimes cheaper than a lock.

For frequent writes a good RWLock is often better since RCU can degrade rapidly and badly under write contention.

Settings

Read Locks Are Not Your Friends

Keyboard Shortcuts