Scalable but Wasteful, or why fast replication protocols are slow
charap.coThe hypothesis in the article could be correct (that industry is not adopting new academic innovations because they fail in the real world). Based on my experience in this industry, though, it could just be that there isn't a super strong connection between academia and the people implementing these kinds of systems. I've had many conversations with my academically-minded friends where they're astonished that we haven't jumped on some latest innovation, and I have to let them down by saying that the problem that paper was addressing is super far down our list of fires to put out. Maybe there are places where teams of top-tier engineers are free to spend 6 months every year rewriting critical core systems use un-battle-scarred new algorithms that might have 20% performance improvements, but most places I've worked would achieve the same result for far less money by spending 20% more on hardware.
Hardware vs software is a big one. You have to hit a certain scale ("web scale") before improving software is cheaper than buying more hardware. Sometimes there are low-hanging fruits but I don't think "building a new distributed consensus system" is usually one of them
It seems these things usually get implemented when Google, Microsoft, Twitter, Facebook (big web company x) hit the problem and they each develop their own optimized solution before slowly open sourcing and products converge and mature to the point it's practical for other companies to adopt (I think container orchestration is a good example where Google, Facebook, Apple(?) each had container solutions long before K8s started exploding)
It leaves a bad taste in my mouth when an argument in that area starts using fake math to drive a point home. Top-tier engineers spend 6 months to have 20% improvement that could be achieved by spending 20% more on hardware?
Where IS all this coming from? Why is it acceptable to pull out a few random numbers out of our ass and conclude that it's always best to buy more hardware?
I know Google literally invented HTTP/2 and HTTP/3 to shave a few % off their hardware costs. And they had to not only implement this in Chrome and Google.com but to make it a world-wide standard. So surely they must be the idiots putting top-tier engineers for 6 months (or more) on such optimization efforts, instead of spending more on hardware? I don't know.
20% more hardware for google isn't the same as for your local IT biz rocking a small data center. If google can save 1% on hw cost they can feed an army of devs for 6 months.
Google is an outlier among outliers.
I do not buy the economic argument. Academia is aiming at individual excellence whereas the industry is, at large, about employee replaceability. This may be a valid economic argument in itself, but I believe there is another appeal to technical mediocrity in the reassurance it provides to management.
When I was part of academia I was under the impression we were aiming for publishing papers, which is why I left.
I've been to SIGMOD and I've seen plenty of database industry people there, I don't think that the knowledge is not being transferred. I guess it is more about business priorities or choosing the known unknown over unknown unknown for support.
Yes for that kind of improvement I think academia would be better served trying to see if the techniques can be incorporated into an existing well tested implementation.
Honestly I think the answer is simpler: People don't need better algorithms. Paxos and Raft are generally used to build service discovery and node coordination, these are not demanding workloads and overwhelmingly read-heavy. Even the largest deployments can probably be serviced by a set of modestly-sized VMs. Paxos and Raft are well-understood algorithms with a choice of battle-tested implementations, why would anyone choose different?
The whole section on "bin-packing Paxos/Raft is more efficient" is strange, because people don't generally bin-pack Paxos/Raft—The bin-packing orchestrators are built off of Paxos/Raft!
People would choose different because operating Raft/Paxos clusters at scale isn't fun. Also, scaling reads is the hard part! At this point the trade offs are pretty well known: you either send all reads to the leader and limit the leader's overall throughput or you pay the cost of reading from a majority.
Your opinions are very specific to using consensus in orchestrators and control planes but the overwhelming majority of writes and reads to a Paxos or Raft cluster are in much higher throughput and latency sensitive systems such as databases.
Indeed we don't need better consensus algorithms. We need to closely examine our problems and step down to weaker consistency models and protocols that aren't worse to operate than Raft/Paxos or harder for developers to program against where ever possible.
Another thing which makes the Raft/Paxos vs new-consensus-algorithm comparisons complicated is caching.
If your raft state machines are doing IO via some write through cache (which they often are) then having specific machines do specific jobs can increase the cache quality. I.e. your leader node can have a better cache for your write workload, whilst your follower nodes can have better caches for your read workload.
This may lead to higher throughput (yay) but then also leave you vulnerable to significant slow-downs after leader elections (boo).
What makes sense will depend on your use case, but I personally agree with the author that multiple simple raft/paxos groups scheduled across nodes by some workload aware component might be the best of both worlds.
None of this actually matters. Consensus algorithms allow you to achieve consensus. Period. There's no requirement whatsoever on what you're getting consensus on. A consensus value could be _one_ database update, but it doesn't need to be. It can also consist of 666 database transactions across 42 different namespaces.
> The protocol presents a leader-less solution, where any node can become an opportunistic coordinator for an operation.
Does leader = master here? My first reaction is that this is a multi-master system but I can't quite unpack "opportunistic coordinator".
For MultiPaxos the leader is sticky to avoid additional round-trips for each transaction. Basically until a configurable timeout all transactions are attempted at the last node to lead an accepted transaction, and only if that times out or fails is a new leader of a transaction proposed.
Again, leader = master here right? As in the node that can both read & write.
So this is different from multi-master in that one node is preferred until it's not, as opposed to any node being able to accept writes at any time.
So, why?
A leaderless protocol is more complex and requires greater IO and CPU resources per unit of work. This is the hidden cost. The question is if the cost is reasonably offset by higher performance (throughput) offered by leaderless consensus. OP is arguing that the benefit is not deemed worth the cost outside of academia and this is why e.g EPaxos is not adopted.
But doesn't a leaderless protocol also give you more resilience against failures? Or, in other words, can it be that the higher cost buys you not better throughput but faster reconciliation when connectivity is poor? Not in a data center but on a mobile network?
Putting my SRE hat on, I'd generally agree that a leaderless protocol would be the way to go to prevent failures but you will still have to consider the costs involved with running such a system.
I had the same thought. My OP was summarizing the article for GP. My personal [sense] is that RAFT stole the limelight and EPaxos never got any blogsphere love. Few actually even knew about it.
Leader transfer is generally quick enough with Paxos in my experience.
They're fast when you limit each machine's CPU, but they're slow when you limit total CPU.
Can you not with the clickbait patterns?
Actual title - why fast replication protocols are slow