Wednesday, 8 March 2017

The ultimate Paxos (and Atomic Multicast) protocol

Many years ago, probably around 1986, Barbara Simons organized a workshop on replication at Asilomar, a California resort that had an iconic role in early systems research (it took a while, but eventually our community was too large to fit there). 

Her idea was to bring people together across areas: theory and practice, and within the practitioners, databases as well as other forms of distributed systems.  It was a good idea, and a great workshop.

In Jim Gray's talk, he pointed out that the interaction pattern we associate with the 2-phase commit protocol was seemingly universal, and that perhaps we were wrong to think of 2-phase commit as primarily a database construct.  Instead, he then asked, if 2-phase commit is really concerned with something deeper than database transactions, what is this deeper core concept?  The essential point of his talk was that the 2-phase pattern seen in 2PC clearly allowed a distributed system to "learn" something.  And from this, he suggested, that if  we could start to appreciate the minimal "knowledge" required for a distributed system to be correct, we could build a minimal implementation of that protocol and solve the problem once and for all.

A long time has passed.  Jim is no longer with us, and we know all about knowledge in distributed systems, and how processes learn.  Even so, Jim's question still sometimes comes up.  I mentioned this to Idit Keidar today over lunch, and her response was definitive: she convinced me that today, we really do (finally) understand the answer to Jim's question.    Her thinking centers on work she published in 2006 with Alex Shraer.

Here's what she explained.

First, she observed, distributed computing is ultimately about just one question: fault-tolerant consensus.  For her, Jim had noticed this, but was understanding it as something about the 2PC communication pattern rather than appreciating that the thing that matters more is consensus: moving from a state in which some decision is uncertain to one in which a decision has been reached.  As Idit views the question, one can transform consensus into other forms, we can route the messages in all sorts of patterns, but the bottom line is that either we achieve agreement via consensus or we simply aren't building systems that can make logically sound claims about their behavior. 

Next, she pointed out, agreement is trivial while failures aren't happening.  While a system remains stable, the author of a new piece of information simply needs to multicast it, and this accomplishes consensus.    If there might be many authors, you can circulate a token around them either before they send (as in the old Totem and Transis protocols) or after the messages arrive, to assign them a delivery ordering.

And so in Idit's view, the only hard issue is to handle failures.  Obviously, in the case of 2PC, particularly back in 1986, the field started out with a somewhat muddled understanding of this aspect: 2PC starts by stating that there is a set of processes that need to achieve agreement, but where does that set come from?  In a modern system like Derecho, the set itself is the output of a consensus layer.  2PC uses a crash failure model, but this is one of many models.  Back in 1986 people were very focused on models: crash failures, failstop failures, omission to send, omission to receive, timing errors, Byzantine models.

But these days all that really matters is to handle crash failures correctly.  Some organizations toy with Byzantine fault-tolerance, but honestly, I've never met anyone who had a service that actually came under a Byzantine attack.  Maybe the blockchain people will finally have that experience.

So let's focus on crash failures for a moment.  In light of the FLP theorem, we know now that without a perfect failure detector, protocols can't be proved correct: you can show safety, but not liveness.  In practice, we solve this by forcing a seemingly faulty process to leave the system and then, if you want, it can rejoin. 

So failure becomes a purely behavioral abstraction: if the majority of the system deems some process to be worthy of excluding, for whatever arbitrary reason it desires, than out goes the offending process, end of story. 

Where did the majority constraint come from?  The role of the majority restriction is to prevent logical partitioning. 

So, we end up with a rather dynamic form of membership: the system defines its own membership, excluding processes that seem faulty, and makes progress as long as it can maintain a dynamic form of majority.  To whit: at any point in time, some epoch defines the composition of the system.  In order to move to a new epoch, the systems needs agreement by the majority of whoever is in the active epoch.

So here's where all of this leads: Derecho is a concrete implementation of this new-age perspective on Jim's question.  Indeed, it is the ultimate story in the sense that the system is optimal in many different ways.

To appreciate this, you'll need to read the Derecho papers, but in a nutshell, the system maps consensus onto RDMA hardware in an exceptionally efficient way.  But the protocol itself runs consensus on the configuration of each epoch, and then uses a cheaper consensus protocol within an epoch (one that can assume membership is stable and that failures won't occur), leaving a very clean realization of distributed computing.

Indeed, Derecho is a constructive lower bound in every interesting dimension.  It is an optimal solution to the problem of distributed computing, using consensus-based methods.

Why do I make this claim?  Well, first, during a given epoch, the system is as quick to deliver messages as is possible: one can prove that any system that delivers messages with fewer "exchanges of information" between its processes is either incorrect, or at last at risk of needing to stop because of some single fault.

Next, one can show that Derecho's agreement on the state of each membership epoch is optimal.   Here, Derecho makes use of an all-to-all pattern of information exchange, and I asked Idit if she thought the protocol could be improved.  Idit pointed again to her 2006 papers with Alex.  Without prior knowledge of which processes failed, and how many failed, she explained, this pattern of information exchange is the quickest way to reach agreement on the membership of the next epoch.

Finally, we can show that Derecho makes progress with a failure model called <>P: eventually perfect failure detection.  Idit explained to me that in the past, people actually thought it might make sense to focus on progress with weaker detection models, like <>W, but that if you do so, you either end up with a fault model equivalent to <>P, or you end up with slower agreement protocols that look much more like Byzantine agreement.  So, for the style of quick progress Derecho is after, she explains, <>P is the right goal.  And Derecho is live with <>P, provided that no more than a minority of processes fail in any epoch.  Which again, is the best one can do.

Now, Idit is a theory person, but it is interesting for me as a practitioner to also think about practicalities.  As the old adage goes: "In theory, theory and practice are the same, but in practice, they differ."

As it happens, Derecho is optimal in a communications-engineering (networking) sense. 

Given that reliable networks always have a one-to-one acknowledgement based layer, reliable multicast over a tree is an optimal data dissemination pattern: if you try and disseminate multicasts over an unreliable 1-N protocol, the cost of detecting lost packets and resending them will be very high compared to a tree of unicast transfers.  (Perhaps optical networks with optically aggregated acknowledgements could offer a slight hardware speedup, but even this isn't clear, since the optical aggregator would itself be a tree). 

Now if you happen to be a cloud computing engineer, and read the above, what will jump to mind is heavy-tailed behaviors and multi-tenancy: Derecho protocols sometimes relay data, so if a relayer is very slow, everyone waits.  Paxos classically would avoid this using quorums: the system doesn't wait for the slowest process.  So how can I claim that Derecho is optimal in a practical sense if it doesn't use quorums?

There are a few answers.  A long-winded one would talk about work Fernando Pedone and his student, Parissa Jallali, did while visiting my group a few years ago.  Basically, they studied quorum behaviors in Paxos and discovered that while Paxos can briefly "outrun" a slow process, eventually work piles up and either flow-control causes the protocol to pause, or a shifting pattern of just who is the slow process switches the slow guy into the quorum and some previous quorum member becomes slow.  Either way, Paxos basically halts until everyone catches up and is back in sync.  So quorum patterns do not evade the disruption caused by heavy tailed behaviors.

Conversely, offloading the protocol into hardware actually can eliminate that issue, because the hardware is dedicated: the network spends 100% of its time communicating, so if you can describe a pattern of communication, then let it rip, the network is the ideal engine for moving data.  As it happens, Derecho generates deterministic data transfer schedules, hence given adequately programmable NICs we can hand the entire sequence of block transfers to the NIC.  So we can even make "optimal" use of the network, and since a NIC never sleeps or pauses, quorum behaviors aren't needed even if end-applications sometimes are a bit slow.

So a road that for me started around when Jim asked his question about 2PC seems to have reached its end: Derecho implements the ultimate Paxos protocols for atomic multicast (often called vertical Paxos) and for persisted memory (classic Paxos).  We could add an optimistic early delivery protocol with a flush, too, as in Isis and Vsync, but we decided to keep the system simple and omitted it: most people who would use that feature probably just want raw RDMC from the Derecho API, and this we do offer.

And so the Paxos problem is solved.  And you know what?  Its about time!  Idit feels that it was solved back in 2006, actually.  As for me, well, until I can write an application using the ultimate protocol, the problem is open.  But today, I finally can.  (I don't mind at all that Derecho employs a protocol that is actually extremely similar to our first Isis protocols from 1985.)

So should we all pack our bags and go home? 

Not quite yet.  First, there are other forms of distributed consistency, like convergent (gossip) protocols, self-stabilization, and of course, Byzantine Agreement.  It would be nice to see how those fit into this picture and whether there can be a single integrated story that combines all the elements.

A second issue is the engineering complexity of modern platforms.  I'll write a whole blog posting on this sometime soon, but suffice it to say that in a data center with physical topology (racks, switches, TOR switches, failure-independence domains...), GPU clusters, NetFPGA accelerators, Intel SGX protection enclaves... it just isn't obvious how to write code for such environments.  Derecho is just part of the answer, and not the whole story.

Then beyond all of this are dimensions we have yet to tackle in Derecho itself.  For example, even if Derecho is the ultimate data center protocol, is it also the ultimate WAN version?  As it happens, I suspect that this may be true too, but more attention to the question will be needed.  Anyhow, until we have it running, I won't believe the story even if I figure it out "in theory".  After all, in theory, theory and practice are the same...  but in practice, they are enormously different.

So I wouldn't despair: very likely we are finally at the end of the road for Paxos, but it certainly isn't the end of the road for distributed systems.  Watch this space: I can promise plenty of interesting new questions, and new answers, in years to come.

No comments:

Post a Comment