Flexible Paxos: Quorum intersection revisited
arxiv.orgTL;DR summary (to my understanding, no sane human can ever claim it can summarize Paxos):
The claim is that, once a leader is elected (ie. Q1), is no longer necessary to attain a majority quorum for actually accepting writes (ie. Q2). A minority quorum can accept writes, provided the minority contains at least one node that participated in the leader election. By increasing the leader election quorum number to higher than N/2+1 (as to have a sufficient number of nodes that participated in the Q1 election), the cluster can then operate much faster because writes require only minority quorum. The drawback is that it no longer tolerates N/2-1 failures, as N/2-1 failures leaves too few electors to choose a new leader in Q1.
NB. the Paxos terminology uses terms like 'decide a value', but practically in clusters this is equivalent to 'accept writes' so I used that instead for easier comprehension.
Haven't read it carefully, so I might have missed the reference, but Barbara Liskov has described a very similar optimization for Viewstamped Replication[1]
If your summary is accurate (haven't read the paper yet), I don't think this works, because if you don't have a proper quorum, you can't know that the leader is still valid at the time of an event. It might've been re-elected in the meanwhile. The only way to know is to "check in" with all the other nodes.
I recently proposed this idea (informally) and had to retract it: https://bentrask.com/?q=hash://sha256/b40971e7b30324fdda15ce...
Disclaimer: totally not an expert.
The nice thing about the paper is that it is transparent. It includes a TLA+ specification for the claims that they make.
In turn, you have a concrete model checked implementation you can talk about or use a basis for understanding where either their or your proposed idea either holds or fails.
Note that a model can specify the wrong correctness criterion in practice. So you may have a "proof" which works, yet the proposition proven is wrong.
In general, when working on distributed systems, we usually want some kind of formal criterion of correctness. The failure modes of such system are quite hard to get right, and hence retractions of claims are plenty. Sadly, we have too little literature on how to on-board people on model checkers such as TLA+, proof assistants such as Coq, Isabelle/HOL or NuPRL, and QuickCheck systems such as Haskell or Erlang QuickCheck, Clojure's core.spec (I believe), etc.
Book recommendation: Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [0]. I did read it, but unfortunately I can't say I actually understood it as in to be able to use it day-to-day.
[0] https://www.amazon.ca/Specifying-Systems-Language-Hardware-E...
It is appropriate that Lamport's tool is being used to prove the correctness to an improvement of Lamport's algorithm.
Lamport can probably claim to summarize it as well as anyone: http://research.microsoft.com/en-us/um/people/lamport/pubs/l...
How much speed improvement would such an algorithm give in practical circumstances (for instance assuming that network outages are rare)?
Good question. Results such as these can be of a form where they currently don't provide much benefit, but yet moves the knowledge of Paxos-like consensus forward. That is, you explore where the system has flexibility in its very nature. In turn, this knowledge can lead to faster systems down the track. Sometimes by employing a vastly different approach due to the knowledge. Good research can be indirect.
Related discussion of the 'A More Flexible Paxos' blog quoted in the paper: https://news.ycombinator.com/item?id=12292590
A friendlier story here: http://hh360.user.srcf.net/blog/2016/08/majority-agreement-i...
There's an easy to understand visualisation of distributed consensus (specifically the Raft algorithm) here: http://thesecretlivesofdata.com/raft/
Not a paper, but I liked this related article https://news.ycombinator.com/item?id=11813180