Understanding Cache line invalidation and striped locks for Concurrent HashMap Implementation

To start with a non-concurrent hashmap, the basic principle is this:

Have a indexed structure (most often an array or set of arrays) for the keys and values.
Get a hash for the key.
Reduce this to be within the size of the arrays. (Modulo does this simply enough, so if the hash value is 123439281 and there are 31 slots available, then we use 123439281 % 31 which is 9 and use that as our index).
See if there's a key there, and if so if it matches (equals).
Store the key if it's new, and the value.

The same approach can be used to find the value for a given key (or to find that there is none).

Of course the above doesn't work if there's a key in the same slot that isn't equal to the key you're concerned with, and there are different approaches to dealing with this, mainly either continuing to look in different slots until one is free, or having slots actually act as a linked list of equal-indexed pairs. I won't go into the details of this.

If you are looking to other slots it won't work once you've filled the arrays (and will be slow before that point) and if you are using linked-lists to handle collisions you will be very slow if you have many keys at the same index (the O(1) you want becomes closer and closer to O(n) as this gets worse). Either way you're going to want a mechanism to resize the internal store when the amount stored gets too large.

Okay. That's a very high-level description of a hashmap. What if we want to make it threadsafe?

It won't be threadsafe by default as e.g. two different threads writing different keys whose hash modulo down to the same value, then one thread might stomp over the other.

The simplest way to make a hashmap threadsafe is to simply have a lock that we use on all operations. That means that every thread will wait on every other thread though, so it won't have very good concurrent behaviour. With a bit of clever structuring it's not too hard to have it that we can have multiple reading threads or a single writing thread (but not both), but that still isn't great.

It's possible to create hashmaps that are safely concurrent without using locks at all (see http://www.communicraft.com/blog/details/a-lock-free-dictionary for a description of one I wrote in C#, or http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable for one in Java by Cliff Click whose basic approach I used in my C# version).

But another approach is striped locks.

Because the basis of the map is either an array for key-value pairs (and likely a cached copy of the hashcode) or a set of arrays for them, and because it is generally safe to have two threads writing and/or reading to different parts of an array at a time (there are caveats, but I'll ignore them for now) the only problems are when either two threads want the same slot, or when a resize is necessary.

And therefore the different slots could have different locks, and then only threads that are operating on the same slot would need to wait on each other.

There'd still be the problem of resizing, but that isn't insurmountable; if you need to resize obtain every one of the locks (in a set order, so that you can prevent deadlocks from happening) and then do the resize, having first checked that no other thread did the same resize in the meantime.

However, if you've a hashmap with 10,000 slots this would mean 10,000 lock objects. That's a lot of memory used, and also a resize would mean obtaining every one of those 10,000 locks.

Striped locks are somewhere in-between the single-lock approach and the lock-per-slot approach. Have an array of a certain number of locks, say 16 as a nice (binary) round number. When you need to act on a slot then obtain lock number slotIndex % 16, and do your operation. Now while threads may still end up blocking on threads doing operations on completely different slots (slot 5 and slot 21 have the same lock) they can still act concurrently to many other operations, so it's a middle-ground between the two extremes.

So that's how striped locking works, at a high level.

Now, modern day memory access is not uniform, in that it does not take the same time to access arbitrary pieces of memory because there is a level of caching (generally at least 2 levels) in the CPU. This caching has both good and bad effects.

Obviously the good effects normally outweigh the bad, or chip manufacturers wouldn't use it. If you access a piece of memory, and then access a piece of memory very close to it, the chances are that second access will be very fast because it will have been loaded into the cache on the first read. Writes are also improved.

It's already natural enough that a given piece of code is likely to want to do several operations on blocks of memory close to each other (e.g. reading two fields in the same object, or two locals in a method), which is why this sort of caching worked in the first place. And programmers further work to take advantage of this fact as much as possible in how they design their code, and collections such as hashmaps are a classic example. E.g. we might have stored keys and stored hashes in the same array so that reading one brings the other into the cache to be quickly read, and so on.

There are though times when this caching has a negative effect. In particular if two threads are going to deal with bits of memory that are close to each other at around the same time.

This doesn't come up that often, because threads are most often dealing with their own stacks or bits of heap memory pointed to by their own stacks, and only occasionally heap memory that is visible to other threads. That in itself is a big part of why CPU caches are normally a big win for performance.

However, the use of concurrent hashmaps is inherently a case where multiple threads hit neighbouring blocks of memory.

CPU caches work on the basis of "cache lines". These are blocks of code that are loaded into the cache from the RAM, or written from the cache to the RAM as a unit. (Again, while we're about to discuss a case where this is a bad thing, this is an efficient model most of the time).

Now, consider a 64-bit processor with 64-byte cache-lines. Every pointer or reference to an object is going to take up 8 bytes. If a piece of code tries to access such a reference it will mean that 64 bytes are loaded into the cache, then 8 bytes of that dealt with by the CPU. If the CPU writes to that memory, then those 8 bytes are changed on the cache, and the cache written back to the RAM. As said, this is generally good, because the odds are high that we'll also want to do the same with other bits of RAM nearby, and hence in the same cache line.

But what if another thread wants to hit the same block of memory?

If CPU0 goes to read from a value that is in the same cachline that CPU1 has just written to, it will have a stale cacheline that has been invalidated and have to read it again. If CPU0 was trying to write to it it may well not only have to read it again, but redo the operation that gave it the result to write.

Now, if that other thread had wanted to hit the exact same bit of memory, there'd have been a conflict even without caching, so things aren't that much worse than they would have been (but they are worse). But if the other thread was going to hit nearby memory it will still suffer.

This is obviously bad for our concurrent map's slots, but its even worse for its striped locks. We'd said we might have 16 locks. With 64-byte cachelines and 64-bit references that's 2 cachelines for all the locks. The odds a lock is in the same cacheline as that wanted by the other thread is 50%. With 128-byte cachelines (Itanium has those) or 32-bit references (all 32-bit code uses those) it's 100%. With lots of threads its effectively 100% that you're going to be waiting. And waiting again if there's yet another hit. And waiting.

Our attempt to prevent threads waiting on the same lock has turned into them waiting on the same cacheline.

Worse, the more cores you have using the locks, the worse this becomes. Each extra core slows down the total throughput roughly exponentially. 8 cores might take over 200 times as long to execute as 1 core would!

If however we pad out our striped locks with blank space so that there is a 56-byte gap between each one, then this doesn't happen; the locks are all on different cachelines, and operations on neighbouring locks don't affect it any more. This costs memory, and makes normal reading and writing slower (the point of caches is that it makes things faster most of the time after all), but is appropriate in cases where particularly frequent concurrent access is expected, and we're not likely to want to hit the next lock (we aren't, except for resize operations). (Another example would be striped counters; have different threads increment different integers and sum them when you want to get the tally).

This problem of threads hitting neighbouring pieces of memory (called "false-sharing" because it has a performance impact caused by shared access to the same memory even though they are actually accessing neighbouring memory rather than the same memory) will also affect the internal storage of the hashmap itself, but not as much because the map itself is likely larger and so the odds of two accesses hitting the same cacheline is less. It would also be more expensive to use padding here for the same reason; being larger the amount of padding that would involve could be huge.