Application-Level Consensus [pdf]
weareadaptive.comJane Street uses the same approach to build their exchange [0]. Like the doc says, it can be great to replay some sequence of messages in dev to reproduce issues, and to give fault-tolerance to the system.
One downside is that, if all your nodes are using the same application code, simply replaying the log might not help as all nodes might hit exactly the same bug with the same sequence of transitions.
[0] There's an overwiev of their infrastructure here https://youtube.com/watch?v=b1e4t2k2KJY
Thanks for sharing the video and great talk btw. Brian, the speaker, actually asks the audience (around minute 20 in the video) if anybody use paxos for the matching engine. What I'm talking about in the article is exactly that: we're just using another consensus algorithm (Raft) which is significantly simpler to implement than Paxos.
LMAX use synchronous replication in their exchange: https://www.infoq.com/presentations/LMAX
What kind of latency does the consensus add? We are looking at adding fault tolerance to our matching engine but can only afford 10-15 micros.
Related to the latency question, I just watched the Jane Street video (very nice!) and he mentioned that they use operator-initiated failover and he didn't know of anyone using a consensus based approach because it adds an extra hop. Does your Raft-based failover solution do automatic failover?
This is very interesting. I have no doubts that not having to deal with fault tolerance at the application level compensates for the efforts to put in place this architecture. And yes, in my opinion, "application-level consensus" is the perfect term to designate this architecture.
I agree. One place where application-level consensus is fairly common is in Elixir applications, mostly thanks to the crdt implementation that's nicely wrapped up by Phoenix.Tracker in the phoenix_pubsub library.
This is used by the Phoenix project's Presence module to provide a distributed notion of what users are 'present', but it's also used by others to do service location using hash rings, or implement a dht, etc. I've used it for master election and failover on a few projects for little services.
Hi there!! looks like there was some minimal mistakes on the text and the document was updated: http://weareadaptive.com/wp-content/uploads/2017/04/Applicat...
Try 'Edge-Coherence'.