How to do distributed locking (2016)

martin.kleppmann.com

103 points by thealig 3 years ago · 18 comments

Reader

antirez 3 years ago

Please note that I replied to this blog post, when it was published. You may be interested in reading the counter arguments. The author of the original post acknowledged that at least one of his claim was wrong, but I don't know if they modified the post later. Anyway we gained some time perspective, too: Redlock was massively used in the last 10 years, and I have never heard of its properties failing in the practice: that is, no incidents so far.

http://antirez.com/news/101

eternalban 3 years ago

Discussed earlier: https://news.ycombinator.com/item?id=11065933
Someone mentioned "the saddest moment" in there in response to the back and forth comment:
https://www.usenix.org/system/files/login-logout_1305_micken...
(hope all is well!)
rubiquity 3 years ago

> Redlock was massively used in the last 10 years, and I have never heard of its properties failing in the practice: that is, no incidents so far.
I enjoyed these posts between yourself and Martin. Don’t be too confident though. Data races almost always go unnoticed.
- antirez 3 years ago
  
  Thank you. I agree about races going unnoticed. And indeed my post tryed to address why the failure modes outlined in the original post are not possible (but for the monothonic timer issue, but this was fixed in Redis at some point).

danrl 3 years ago

Did a simple Paxos implementation a while ago which I turned into educational code, well documented and coming with a talk and a workshop. May it be useful for anyone wanting to learn about distributed locking from first principles.

Code: https://github.com/danrl/skinny

Talks: https://youtu.be/nyNCSM4vGF4 and https://youtu.be/gAGGPaFDfwE

samsquire 3 years ago

This is very interesting, thank you for this.

I am trying to implement efficient mutual exclusion without blocking and without additional mutexes besides communication mutexes (what I use to provide thread safety) in Java. (I think message passing still needs some form of thread safety to transfer data, such as a mutex or lock free algorithm, since you cannot mutually read and write to the variable without problems.)

If the solution is proven safe in the small, across threads, it should be safe across machines and in distributed systems. I might be wrong on this. But scheduling is a different solution to locks than allowing everything to run at its own behest, without external control. In other words, if we run things and wait for them to finish and never schedule two things to run at the same time that are incompatible (mutual exclusion), we can avoid race conditions where things decide to run at the same time due to two things trying to run simultaneously.

I effectively want epoll for business domain objects and infrastructure state changes. This is the space that Hashicorp consul provides, redis lock and Chubby, the Google lock service.

Imagine being able to register on arbitrary changes to business objects and chaining together complicated behaviours based on efficient epoll-style reactive rules? And have fork/join handled efficiently and mutual exclusion and retries.

I've gathered I can use scheduling, similar to an operating system scheduler to decide when to schedule mutually exclusive operations.

I am inspired by the epoll API, where you register things you're interested in changes and you react to them.

I can protect everything by a single communication lock and then implement scheduling myself. Scheduling shall detect when a task is finished all its dependencies and then schedule the next thing to be executed.

Any thoughts or ideas on distributed and multithreaded scheduling would be greatly appreciated.

anentropic 3 years ago

FWIW there was a reply to that post here: http://antirez.com/news/101

FooBarWidget 3 years ago

I also wrote about distributed locking a while ago, particularly about implementing one on Google Cloud Storage. https://www.joyfulbikeshedding.com/blog/2021-05-19-robust-di...

This is useful because cloud storage is very cheap and serverless. It certainly beats running a Redis or PostgreSQL instance.

In my research and implementation I took care of the problems described in this article (as well as the problems I encountered in other distributed lock implementations).

Clients either freezing or outright crashing or disappearing is definitely a major problem. So timeouts are a must. But you can't just have a timeout because you don't know whether a client is arbitrarily delayed or whether it has disappeared. Furthermore, some workloads have an unknown completion time so you can't pick any sensible timeout.

I address this by encouraging clients to behave as follows:

1. Make the timeout relatively short, but regularly refresh your lease to let others know that you're still alive.

2. Your lease may at any time be taken over by another client that thinks you've disappeared. So you must regularly check whether your lease on the lock is still valid. This checking must be done way, way before your timeout expires, in order to account for the possibility that you can be arbitarily delayed at any moment (GC pauses, scheduling pauses, network delays, etc). I recommend checking in an interval at most 1/8 of the timeout. Smaller intervals are better at the expense of more overhead.

3. Attempt to split up your work into a long-running, discardable preparation phase and a short-running commit phase. Verify the validity of your lock lease right before committing (in addition to periodic verification).

This is of course still not 100% foolproof. There is still risk that two clients run at the same time, but this risk can be arbitrarily reduced by reducing the verification interval and by optimizing for doctrine 3. If your workload cannot tolerate any risk of two clients running at the same time, no matter how small the risk, then pick an infinitely timeout. But I think for most workloads, the benefits of not having to administer the lock (not having to manually break its lease because a client has crashed) outweight the risk.

My algorithm has a Ruby implementation, as well as a third-party Go implementation.

Ruby: https://github.com/FooBarWidget/distributed-lock-google-clou...

Go: https://github.com/FooBarWidget/distributed-lock-google-clou...

klysm 3 years ago

It’s just not possible to implement a correct system using distributed locks with these types of semantics.
- diggs 3 years ago
  
  Every time this topic comes up, it seems people invariably divide into one of two groups. In the first group are those who trust in the odds and play the probability game. In the other, those who demand absolute guarantees. Neither group seems fully capable of understanding the other's standpoint.
  Let me add my own perspective: Distributed locks are a fallacy. They can be beneficial in decreasing contention, under the assumption that "most of the time, only one actor will be active". However, by themselves, they offer no solid guarantees. The blog post addresses this point by introducing the concept of fencing tokens. These tokens have the potential to provide concrete guarantees, but they require the cooperation of downstream systems for enforcement, which isn't always possible.
  I was really surprised to see Antirez argue for the probability approach.
  - FooBarWidget 3 years ago
    
    I don't know why you say neither group understands each other when I literally said in my comment that my approach is probabillistic.
    Also, how is configuring an infinite timeout (and checking whether the client is actually gone before manually breaking the lock lease) not an absolute guarantee?
    
    klysm 3 years ago
    
    I guess it’s an absolute guarantee, but you lose liveness unless there is some other party handling those cases where the client is truly gone
    
    diggs 3 years ago
    
    An infinite timeout works in theory, but it is impractical at scale.
    
    FooBarWidget 3 years ago
    
    Okay but then I really don't know what point you're trying to make. Either an infinite timeout, or non-absolute mutual exclusion guarantee: we have to pick either one. The right choice depends on the workload. There is no perfect solution that lets you have your cake and eat it too.
  - antirez 3 years ago
    
    Redlocks don't use anything proabilistic, not in the sense that the lock may fail to guarantee what it promises at random.
    
    klysm 3 years ago
    
    While that may be true, it’s generally not possible to use red locks to build a correct system is mutual exclusion is a strict requirement
- optimuspaul 3 years ago
  
  what would you define as a correct system?
- FooBarWidget 3 years ago
  
  Elaborate?

Settings

How to do distributed locking (2016)

Keyboard Shortcuts