Modelling distributed locking in TLA+

One of my favorite writer on distributed systems lore is Martin Kleppmann. His lectures on the topic are, in my opinion, some the best introductory material available.

I’ve also enjoyed reading, and more importantly, re-reading, his article “How to do distributed locking”.

As I am currently learning TLA+, it suddenly hit me that this topic was another great opportunity to write a little spec or two.

So that’s exactly what I did, following the same “failure/fix” approach I’ve taken so far, which happens to also fit well with the description of the problem, and the solution, in the article itself.

Let’s take a look at the results…

What is a distributed lock?

To make this article a little bit more self-contained, let’s also briefly discuss what a distributed lock is.

Programmers may be familiar with the multi-threading concept of a “mutex”, which allows multiple threads to safely share access to a resource. In that context, “safely” means that only one thread can access the protected shared resource at a time.

Mutexes are often paired with condition variables, so that threads using the shared resource can wait and/or signal on certain state changes happening at the level of the shared resource.

A mutex implementation will rely on the fact that threads share the same address space, and use lower-level “atomic” operations on that shared address space to implement the said mutual exclusion, and offer it as a convenient higher-level API.

A distributed lock offers a similar abstraction to designers of distributed systems.

Since the various parts in a distributed system are by definition unable to share the same physical address space, the “distributed lock” has to be implemented differently.

And similar to a programmer using a mutex, designers of distributed systems do not want to invent or implement their own distributed lock each time they need one.

The solution is for a component of the distributed system to offer, by way of an API, the capabilities of a distributed lock, to other parts of the system. This component can then offer various guarantees as part of its API, which would constrain it’s own implementation.

As Martin does a great job of explaining in his article, having a distributed lock offering similar guarantees as a mutex does in a threaded context, is likely to require a component internally implementing distributed consensus.

Examples of such components are Chubby, and Zookeeper.

The important thing to realize is that as a designer of a distributed system, if you think your system need a distributed lock, you have to:

Understand the needs of the system you are designing. As Martin explains, there is a difference between requiring a lock for efficiency, and requiring one for correctness.
Understand the guarantees offered by the distributed locking component you are planning to use. This may require taking a critical look at the implementation of the component you are evaluating. Martin’s ultimate point in his article is that Redis does not offer the guarantees that would enable a broader system to use it as a reliable source of distributed locks for correctness.
Understand how the lock will be used by the other parts of the system, and what invariant the lock will be used to protect. This is somewhat similar to the programmer of a threaded program needing to understand the concurrent logic that she is trying to implement. “using a lock” in and of itself will not ensure that your concurrent program behaves as expected.

Point 3 implies that once you have determined that your system needs a distributed lock with a certain level of guarantees, and once you have ascertained yourself that the implementation/component you have chosen to use does fit those needs, you can then treat the distributed lock component as a kind of black box, and focus on how its use can fit within the broader system.

The current article, and the specifications contained therein, are essentially about this third point, since they do not cover the implementation of the component offering the distributed lock, and they make certain assumptions about the needs of the system using it.

A broken use of a distributed lock

The first spec models the system using the distributed lock in a “broken way”, as described by Martin.

Note that “broken” here does not imply in any way the presence of some bug in the actual service that would be offering the locks. Rather it is the system around it using the locks in a way that does not protect the invariant as expected.

The gist of the problem is that the lock, once acquired, can expire, and that the client using the lock can be paused for an indefinite amount of time, which can result in the lock having expired by the time the shared resource is accessed.

This implies the following system:

The existence of a lock service, and of a shared resource.
The fact that the two are physically separate enough from each other, such that querying the lock service, either to check the validity of a lock and/or to acquire a new one, cannot be done in synchrony with using the shared resource that the lock is meant to protect.

The invariant that the system wished to protect is the following: no client shall access the shared resource without holding a current lock.

Let’s take a look at this “broken” spec in TLA+:

Let’s go over the three variables and their meaning:

clients is a function of client ids to their “lock status”, which can either be “Acquired” or “NotAcquired”. This state is meant to be something held by the lock service.
lock_checked is a function of client ids to a boolean, which is meant to indicate that the client has “checked that the lock it holds is still valid”. This state is meant to be something held by each client of the lock service.
data_written is a function of client ids to a sequence of two boolean, where the first one indicates whether that particular client has written data to the shared resource, and the second one indicates whether the client was holding a valid lock while writing. The first boolean is meant to represent some state at the level of the shared resource, while the second one is rather something only an hypothetical “global observer” could check, and which we’ll only used to check our invariant. In other words, it is not implying the shared resource, or a client, have the ability to actually check the state of their lock while writing to the resource.

Besides the usual Init and TypeInvariant, the intro also contains a Coherence operator, which in fact embodies the invariant the system wants to uphold: if a client has written to the shared resource, it should have done so while holding a lock.

The next state operators are the following:

AcquireLock(id) represents a client acquiring a lock from the lock service. This can only be done while no other client holds a lock.
ExpireLock represents the lock service invalidating a lock. This can happen at any time, and is meant to model the fact that a client holding a lock can be paused for an indefinite amount of time.
CheckLock(id) represents a client “checking” its lock with the service, perhaps before trying to write to the shared resource. The client is then able to “note” that it has checked it’s lock. This represents the fact that no matter how often, or when, a client “checks its lock”, this remains a futile exercise because it can only be done as a step separate from actually writing to the shared resource.
WriteData(id) represents a client writing to the shared resource. This can only happen if that client has “checked its lock”, which is a state the client itself is meant to have access to. However, our hypothetical global observer is able to see whether the client still holds a valid lock at the time of writing, and write that as the second boolean to data_written.

As a brief intermezzo, it is at this point I learned that the TLC model checker does not check what you write as the THEOREM at the bottom of the spec, hence the COHERENCE invariant is not actually checked “as is”. However, you can manually add an invariant in the TLC interface, where I dutifully copy/pasted the operator.

This was the result:

As expected, we hit the following succession of states:

The lock is checked.
The lock is expired.
The client writes to the shared resource.

Fixing the lock

Martin’s proposed solution to this problem is for the lock service to give each client acquiring a lock a so-called “fencing token”, which is just a monotonically increasing counter, which the client would submit to the shared resource, and have the shared resource reject a request from a client with a counter lower than the last one seen.

Does this reduce the distributed lock to a counter? In some ways, it does, although it may still be useful for the lock service to guarantee to clients that the lock will be held for a certain amount of time. That way, a client knows that once it has acquired a lock, it should remain available for “a while”, and that only in some worst case can the lock unexpectedly expire, as evidenced by the shared resource rejecting a request based on the fencing token.

Without this “lease” type of guarantee, clients could lose their locks at any time, making using them(especially under load), more challenging(perhaps even resulting in a kind of livelock if a client can never hold the lock for long enough to do what it wants to do).

However, like in the first spec, we will ignore the timing component in this, and simply focus on a logical invariant, which in this case will be: no clients should write to the shared resource unless:

It holds the lock, or
It is using a token that is higher than the last writer.

While the first part can only be checked by our hypothetical global observer, the second part can be checked as part of the WriteData(id) operator, since that models a client request being processed by the shared resource, hence both the token, and the last token used, should be available to be checked as part of a single step.

This is what the spec looks like:

It should speak for itself.

Note only that the nonsensical CheckLock has been removed, and that the spec may actually contains a problem because the tokens go up without limit.

And what about modelling the Redlock algorithm, using Martin’s article as well as the response from Salvatore on that article, and perhaps settle the discussion once and for all ?

I’ll leave that as an exercise to the reader…