Distributed Snapshots: Chandy-Lamport Protocol
blog.fponzi.meI found this [0] a very accessible explanation as well.
[0] https://blog.acolyer.org/2015/04/22/distributed-snapshots-de...
Unusually well written article for distributed work involving tla. Thanks. I liked it and learned something. Bookmarked.
Noob question - Raft and Paxos solve a different problem?
Those are about distributed consensus, making sure participants come to the same conclusion about something and nobody has the wrong answer.
Distributed snapshots are trying to do as little work as possible to get a consistent view of the distributed computation, without forcing the heavy cost of consensus on it. For example, node A is sending a message to node B, we don't care if we capture
- 1: A before it sends the message, B before it receives the message
- 2: A after it has sent the message, the message, and B before it receives the message
- 3: A after it has sent the message, B after it has received the message
No matter which of those states we restore, the computation will continue correctly.