Tuesday, 28 March 2017

Is it sensible to try to `` teach'' cloud computing?

Many years ago, I was teaching a rather polished course on distributed computing, and students seemed to like it very much: we looked at distributed programming models and algorithms, and the students learned to create new protocols of their own, solving problems like atomic multicast.  Mostly I taught from my own textbook, initially created back in 1996 on a prior sabbatical.

When distributed systems suddenly expanded enormously and started to be known as cloud infrastructure systems (I would date this roughly to 2005 when AWS was introduced), it seemed like an obvious moment to modernize that class and shift it to become a cloud computing course. 

What followed became a cycle: I would revise my distributed systems textbook, update all my slides,  and then the course would feel reasonably current: I did this in 2010, then again in 2012.

But today I'm facing an interesting puzzle: as part of my sabbatical I thought I might again attempt to revamp the class, but the more I learn about the current state of the art, the less clear it is that this is even possible!  Cloud computing may have become too large a topic, and too arcane, to permit a proper treatment.  Here are a few of the more important topics:

  • Edge platforms.  Devices like iPhones and iPads were completely new back when the cloud emerged, and fairly simple.  Today, they are incredibly complex and very powerful platforms, and it would probably take a full year-long course to just cover the full range of technologies embedded into them.
  • The network itself is increasingly sophisticated, and I use the word "increasingly" with some hesitation, because after decades of continuous evolution and change, it is hard for a person who thinks of the Internet in terms of the old TCP/IP structure to fathom quite how changed the modern network actually has become.  Today's network is a computational structure: it dynamically adapts routing, is able to model each separate enterprise as a distinct domain with its own security and quality of service, it can cache huge amounts of data, and it understands mobility and adapts proactively, so that by the time your car emerges from the tunnel, the network is ready to reconnect and deliver the next bytes of your children's videos.  With the push towards P4, the network is increasingly programmable: an active entity that computes on data flows running at rates of 100Gbs or more.
  • Any serious cloud computing company operates a lot of data centers, at locations spread globally (obviously, smaller players lease capacity from data centers operated by specialists, then customize their slice with their own software layers).  Some systems are just limited-functionality cache and web service structures: simple points of presence; others are full-featured data warehouses that do extensive computation.   Thus the cloud is a heavily distributed structure, with a hierarchy.  Routing of client requests is heavily managed.
  • Within any single data center we have layers of functionality: edge systems that run from cache and are mostly stateless (but this is changing), then back-end systems that track dynamic data, and compute engines that apply iterative computational steps to the flow: data arrives, is persisted, is compressed or analyzed, this creates new meta-data artifacts that in turn are processed, and the entire infrastructure may run on tens or hundreds of thousands of machines.
  • Big data is hosted entirely in the cloud, simply because there is so much of it.  So we also have these staggeringly-large data sets of every imaginable kind, together with indices of various kinds intended to transform all that raw stuff into useful "content".
  • We have elaborate scalable tools that are more and more common: key-value stores for caching (the basic MemCacheD model), transactional ones that can support SQL queries, even more elaborate key-value based database systems. 
  • The cloud is a world of extensive virtualization, and virtualized security enclaves.  All the issues raised by multitenancy arise, and those associated with data leakage, ORAM models, and then technologies like ISGX that offer hardware remedies.
  • Within the cloud, the network itself is a complex and dynamic creation, increasingly supporting RDMA communication, with programmable network interface cards, switches and routers that can perform aspects of machine learning tasks, such as in-network reductions and aggregation.
  • There are bump-in-the-wire processors: NetFPGA and other ASIC devices, plus GPU clusters, and these are all interconnected via new and rather exotic high speed bus technologies that need to be carefully managed and controlled, but permit amazingly fast data transformations.
  • File systems and event notification buses have evolved and proliferated, so in any given category one has an endless list of major players.  For example, beyond the simple file systems like HDFS we have ones that offer strong synchronization, like Zookeeper, ones that are object oriented, like Ceph, real-time versions like Cornell's Freeze Frame (FFFS), big-data oriented ones, and the list goes on and on.  Message bus options might include Kafka, Rabbit, OpenSlice, and these are just three of a list that could extend to include 25.  There are dozens of key-value stores.  Each solution has its special feature set, advantages and disadvantages.
  • There are theories and counter-theories: CAP, BASE, FLP, you name it.  Most are actually false theories, in the sense that they do apply to some specific situation but don't generalize.  Yet developers often elevate them to the status of folk-legend: CAP is so true in the mind of the developers that it almost doesn't matter if CAP is false in any strong technical sense.
  • We argue endlessly about consistency and responsiveness and the best ways to program asynchronously.  The technical tools support some models better than others, but because there are so many tools, there is no simple answer.
  • Then in the back-end we have all the technologies of modern machine learning: neural networks and MapReduce/Hadoop and curated database systems with layers of data cleaning and automated index creation and all the functionality associated with those tasks.
I could easily go on at far greater length.  In fact I was able to attend a presentation on Amazon's latest tools for AWS and then, soon after, one for Microsoft's latest Azure offerings, and then a third one focused on the Google cloud.  Every one of these presentations reveals myriad personalities you can pick and chose from: Azure, for example, is really an Azure PaaS offering for building scalable web applications, an Azure fabric and storage infrastructure, an Azure IaaS product that provides virtual machines running various forms of Linux (yes, Microsoft is a Linux vendor these days!), a container offering based on Mesos/Docker, and then there is Azure HPC, offering medium-size clusters that run MPI supercomputing codes over Infiniband.  All of this comes with storage and compute managers and developer tools to help you build and debug your code, and endless sets of powerful tools you can use at runtime.

Yet all of this is also terribly easy to misuse.  A friend who runs IT for a big company was just telling me about their move to the cloud: a simple mistake ran up a $100k AWS bill one weekend by generating files that simply got bigger and bigger and bigger. On a single computer, you run out of space and your buggy code crashes.  In the cloud, the system actually has the capacity to store exobytes... so if you do this, it just works.  But the, on Monday the bill arrives.  Microsoft, Amazon and Google are vying for the best ways to protect your IT department against surprises, but of course you need to realize that the dashboard has those options, and enable them, much like credit card alerts from Visa or MasterCard.

So even cloud dashboards have become an elaborate topic.

Where does this story end?  The short answer is that it definitely isn't going to end, not for decades.  If anything, it will accelerate: with the trend towards Internet of Things, we'll be moving all sorts of mission-critical real-time systems into the cloud, and even larger data flows: self-driving cars, self-managed smart highways and cities and buildings, you name it. 

And you know what?  I don't see a way to teach this anymore!  I was just talking to Idit Keidar yesterday at Technion, quite possibly the world's very best distributed systems researcher, and she told be a little about her graduate class in distributed systems.  Five years ago, I might have thought to myself that wow, she should really be teaching the cloud, that students will insist on it.  Yesterday my reaction was exactly the opposite: when I resume teaching this fall at Cornell, I might just do what she's doing and just return to my real roots.  Cloud computing was a fun course to teach for a while, but the growth of the area has simply taken it completely out of the scope of what we as faculty members can possibly master and cover in any one class.

Friday, 17 March 2017

The CAP conjecture is dead. Now what?

CAP has been around now for something like 15 years.  There are some circles in which the acronym is even more famous than FLP or SAT!  But CAP was never a theorem (let's call it a "folk-theorem"), and by now we have more and more examples of systems that violate CAP. 

Yet even if CAP is false, it was also a fantastic rule of thumb that was incredibly valuable to developers tasked with building high performance distributed systems on the cloud.  If we start to teach students that CAP is false, what can replace it in this role?

A little historical context: CAP is short for "you can only have two from Consistency, Availabilty and Partition Tolerance",  This assertion was initially put forward by Eric Brewer in a PODC keynote talk he gave in 2000. The justification he offered ran along the following lines.  First, he observed that in today's increasingly globalized web systems, we invariably deploy services with high latency WAN links between them.  These are balky, so services are forced to make a choice: either respond right now based on data locally available, or await restoration of the WAN link.  A tradeoff that obviously favors availability over consistency, and already tells us that many web services will have to find a way to prevent responses that reflect stale data from being noticed by users, or if they are noticed, from causing problems.  He used the term "partition tolerance" for this kind of fault-tolerance (namely, giving a response even when some link is down).  Hence the P in CAP.

He suggested that we are favoring "A and P over C".

Then he observed that even in a single data center, if you want the highest possible levels of performance, you'll need to handle web requests on edge nodes that take their data from cache, without first talking a backend server first: again weaker consistency, but higher availability.  So again, we see the A, as a kind of synonym for rapid response.

So he looked at all the different pairing: C and A over P, C and P over A, A and P over C.  He concluded that in practice we always seem to find that by taking A and P we get the best scalability.  But CAP itself asserted that although other mixes work, you always have to pick the two you like best, and you'll erode the third.

Although the definitions of A and P are a bit strange (P seems to have a bit of A mixed in, and also seems to sometimes mean fault-tolerance, since partitions never really arise within a data center), CAP is sort of catchy.  But like many such acronyms, much depends on the details: if you try and give rigorous definitions, to turn CAP into a theorem (people have done so), you find that it only holds in some narrow situations.

The result is that as professors teaching cloud computing, we generally treat CAP as a clever acronym, but one that works mostly as a rule of thumb.  In some sense, CAP is a useful kind of folklore.

CAP took hold back in 2000 because at that time, companies like eBay and Amazon were struggling with the high costs of ACID transactions in systems that the database SQL programming model.  Scalable database performance poses issues that are much more nuanced than the ones Eric had in mind: there were puzzles of lock conflict, complexity, data pipelines with non-trivial asynchronous ordering requirements, etc.  But the bottom line is that performance of the big systems at eBay and Amazon was erratic and often strayed outside the magic 100ms target for web-service and web-page responses.  This is the point at which a human user feels that the system is "snappy" and everyone wants to be in the sub-100ms range.

So, the technology leaders at eBay began to tell their developers that it was absolutely fine to write SQL code as a starting point in web application design (a pragmatic decision: they people they were hiring mostly had extensive SQL experience from their database courses), but that once the SQL code was working, to weaken the transactions by turning them into a series of individual atomic actions.   eBay began to talk about the resulting development methodology using a new acronym: BASE, by which they meant "Basically Available, Softstate systems with Eventual Consistency."

Notice that BASE doesn't abandon consistency.  Instead, it points out to the developer that many web systems just don't need ACID guarantees to work correctly.   ACID and SQL are great for creating a quick prototype that will be robust and easy to debug, but then you "optimize" it by taking away the ACID properties, without breaking the behavior in ways that violate the specification.

Amazon embraced BASE too.  Around 2006, they decided to rebuild many of their core applications around a key-value technology called Dynamo, but SQL users found it hard to use, and by 2008 the adoption of Dynamo began to falter.  To make the transition easier, Amazon layered in a NoSQL API for Dynamo, called Dynamo-DB: now SQL code could run on Dynamo, but with weaker guarantees than for a full SQL system (for example, NoSQL lacks join operations), and Dynamo-DB happened to be especially well-matched to BASE.

So you can see from these examples why CAP would be such convenient way to motivate developers to make the switch: it more or less tells them that if they optimize their code using BASE, it won't scale properly.  Moreover, the eBay memos about BASE include step by step instructions to explain precisely how to go about doing it.

Today, fifteen years later,  we've discovered more and more ways to implement scalable, high performance cloud services, edge caches that maintain coherence even at global scale, fully scalable transactional key-value storage systems, and the list goes on.  Derecho is one example: it helps you build systems that are highly Available, Replicated and Consistent.  Call this ARC.

The cool twist is that with ARC, you get lock-free strong consistency, right in the cloud edge! This is because we end up with very fast replication at the edge, and every replica is consistent.  You do need to think hard about your update sources and patterns if you hope to avoid using locks, but for most applications that aspect is solvable because in the cloud, there is usually some sense in which any particular data item has just once real update source.  So there is an update "pipeline" and once you have consistency in the system, you end up with a wide range of new options for building strongly consistent solutions.

An example I like very much was introduced by Marcos Aguilera in Sinfonia.  In that system, you can make strongly consistent cache snapshots, and run your code on the snapshot.  At massive scale you have as many of these snapshots as needed, and transactions mostly just run at the edge.  But when you want to update the system state, the trick he suggested is this: run your transaction and keep track of what data it read and wrote (version numbers).  Now instead of just responding to the client, generate a "minitransaction" that checks that these version numbers are still valid, and then does all the writes.  Send this to the owner of the true database: it validates your updates and either commits by applying the updates, or rejects the transaction, which you can then retry.

Since so much of the heavy lifting is down when speculatively executing the transactions at the first step, the database owner has way less compute load imposed on it.  Then you can start to think about systems with sharded data that has different owners for each shard, and limits the transactions to run within a single shard at a time (the trick is to design the sharing rule cleverly).  Sinfonia and the follow-on systems had lots of these ideas.

ARC creates a world in which solutions like the Sinfonia one are easy to implement -- and the Sinfonia approach is definitely not the only such option.

I'm convinced that as we move the cloud towards the Internet of Things services, developers will need this model,  because inconsistency in a system that controls smart cars or runs the power grid can wreak havoc and maybe even would be dangerous.

So now I want to claim that ARC is the best BASE story ever!  Why?  Well, remember that BASE is about basically available, soft-state systems with eventual consistency.  I would argue that in examples like the Sinfonia one, we see all elements of the BASE story!

For example, a Sinfonia system is basically available because with enough consistent snapshots you can always do consistent read-only operations at any scale you like.  The state is soft (the snapshots aren't the main copy of the system: they are replicas of it, asynchronously being updated as the main system evolves).  And they are eventually consistent because when you do make an update, the mini-transaction validation step lets you push the update results into the main system, in a consistent way.  It might take a few tries, but eventually, should succeed.

What about the eventual consistency aspect of BASE in the case of ARC?   ARC is about direct management of complex large-scale strongly consistent replicated state.   Well, if you use ARC at the edge, back-end servers tend to be doing updates in a slightly time-lagged way: batched updates and asynchronous pipelines improve performance.  Thus they are eventually consistent, too: the edge queues an update, then responds to the end-user in a consistent way, and the back end catches up soon afterwards.

And ARC isn't the only such story:  my colleague Lorenzo Alvisi has a way to combine BASE with ACID: he calls it SALT, and it works really well.

So, feeling burned by CAP?  Why not check out ARC?  And perhaps you would like a little SALT with that, for your databases?

Sunday, 12 March 2017

Should Derecho do optimistic early delivery?

In the Isis and Vsync systems, we supported a form of early delivery that could be described as "optimistic":  messages were delivered as soon as they arrived, before the system was certain safety had been achieved, enabling the application to start processing them, asynchronously with respect to the background exchanges of acknowledgements required to learn that stability had been reached.  

For example, suppose that a client process, P, asked a process Q to take some action.  Perhaps as a side effect of this, Q updates data replicated in group G with members {Q, R, S}.  In an optimistic case, the update multicast itself might return as soon as the multicast was in process, and the delivery events could occur as soon as ordering was determined.  Latency is thus minimal.

The trouble with optimistic early delivery is that a crash can erase the multicast at the first stages of such a protocol.  Thus there might be a window of a milisecond or so during which a power failure could kill some process after it sent (or received and delivered ) such a multicast, but before it was fully delivered at other members.  Perhaps we would end up with R having delivered (and acted upon the message), and use of the crash, S never gets a copy.

This is a bad behavior: normally, one wants failure atomicity.  Accordingly, the Isis or Vsync application developer who used these optimistic protocols was told to invoke a flush primitive before taking actions that might remain visible after a crash, like giving out cash from the ATM, or sending a reply to an external client application.  The flush primitive simply waited until stability was achieved, returning after the message had become safe.  Thus if R called flush before handing out cash, either the multicast stabilized and we distribute the money, or the crash happens while flush is waiting, and R is killed,  but no money was handed out, hence nothing bad has occurred.

The idea was that optimistic delivery plus flush would be equivalent to pessimistic delivery, but that during the period before flush was called, computing overlaps with background message passing and extra work can be done.  In Vsync and Isis, the totally ordered multicast was  optimistic, whereas the "safe" primitive was pessimistic.  You may have heard of vertical Paxos protocol: this is also a pessimistic protocol.  Pessimistic protocols in effect call flush before the application ever sees the message, so that the application layer is always safe.

As a teaching device, we would ask new Isis or Vsync developers to think about the way that buffered disk writes are done in a file system.  The application writes some text, but it ends up buffered first by the file I/O library, and then in the kernel.  Seconds can elapse before the data is on the disk.  The value of buffering is speed: performance is dramatically improved, far better than if fflush is called character by character or even line by line.

Today I'm revisiting this same question in Derecho.  Our basic OrderedSend is pessimistic, and pretty fast.  We also have a raw version of send that lacks ordering guarantees or failure atomicity and runs at the highest speeds possible.  But do we need an optimistic delivery too?

The case for optimistic deliver centers on this question: are there applications that need strong properties, yet also need the small degree of overlapped computing afforded by optimism?  A pessimistic delivery rarely delays more than a few microseconds relative to an optimistic one.  So you can deliver a small multicast after, say, 12us in an optimistic mode, but might have to wait 25us to deliver it with full safety (pessimistically).

In our experiments, the performance gap is most extreme for small messages.  Our impression, so far, is that these suffer from two kinds of delay.  First, with very small multicasts, the actual time needed to report stability through the SST subsystem is proportionally larger relative to data transfer delays, so when you compare a small message delivery to a large one, the wait for stabilization simply looks larger.  And then the other issue is that because these messages need to be buffered, Derecho runs out of space and at high rates, new incoming messages are delayed waiting for an admission slot.

In contrast, optimistic delivery mode lets Derecho hand off incoming messages to the application instantly, hence the system buffers only data that is in-flight, and won't incur this admission backlog buildup.  Stabilization takes the same amount of time, but by the time flush is invoked by a handler processing message m, there is more hope that m will have fully stabilized and that flush won't delay at all.

But keep in mind that with raw mode, there is no delay and no need to use flush at all.     The question becomes this: will people who would find optimistic delivery useful be content to use raw mode instead?  If so, the demand for optimistic mode would be low, and we should then omit the extra mechanisms and extra complexity.  Or do they need stronger guarantees when failures occur?  And if the latter, just what would those applications look like, that need atomicity but also need to shave 25us or so off the delivery delay?

Let us know if you have a great example of an application that would really benefit from optimistic early delivery.  For now, we know how to support optimism within Derecho, but are learning towards not doing so!

Wednesday, 8 March 2017

The ultimate Paxos (and Atomic Multicast) protocol

Many years ago, probably around 1986, Barbara Simons organized a workshop on replication at Asilomar, a California resort that had an iconic role in early systems research (it took a while, but eventually our community was too large to fit there). 

Her idea was to bring people together across areas: theory and practice, and within the practitioners, databases as well as other forms of distributed systems.  It was a good idea, and a great workshop.

In Jim Gray's talk, he pointed out that the interaction pattern we associate with the 2-phase commit protocol was seemingly universal, and that perhaps we were wrong to think of 2-phase commit as primarily a database construct.  Instead, he then asked, if 2-phase commit is really concerned with something deeper than database transactions, what is this deeper core concept?  The essential point of his talk was that the 2-phase pattern seen in 2PC clearly allowed a distributed system to "learn" something.  And from this, he suggested, that if  we could start to appreciate the minimal "knowledge" required for a distributed system to be correct, we could build a minimal implementation of that protocol and solve the problem once and for all.

A long time has passed.  Jim is no longer with us, and we know all about knowledge in distributed systems, and how processes learn.  Even so, Jim's question still sometimes comes up.  I mentioned this to Idit Keidar today over lunch, and her response was definitive: she convinced me that today, we really do (finally) understand the answer to Jim's question.    Her thinking centers on work she published in 2006 with Alex Shraer.

Here's what she explained.

First, she observed, distributed computing is ultimately about just one question: fault-tolerant consensus.  For her, Jim had noticed this, but was understanding it as something about the 2PC communication pattern rather than appreciating that the thing that matters more is consensus: moving from a state in which some decision is uncertain to one in which a decision has been reached.  As Idit views the question, one can transform consensus into other forms, we can route the messages in all sorts of patterns, but the bottom line is that either we achieve agreement via consensus or we simply aren't building systems that can make logically sound claims about their behavior. 

Next, she pointed out, agreement is trivial while failures aren't happening.  While a system remains stable, the author of a new piece of information simply needs to multicast it, and this accomplishes consensus.    If there might be many authors, you can circulate a token around them either before they send (as in the old Totem and Transis protocols) or after the messages arrive, to assign them a delivery ordering.

And so in Idit's view, the only hard issue is to handle failures.  Obviously, in the case of 2PC, particularly back in 1986, the field started out with a somewhat muddled understanding of this aspect: 2PC starts by stating that there is a set of processes that need to achieve agreement, but where does that set come from?  In a modern system like Derecho, the set itself is the output of a consensus layer.  2PC uses a crash failure model, but this is one of many models.  Back in 1986 people were very focused on models: crash failures, failstop failures, omission to send, omission to receive, timing errors, Byzantine models.

But these days all that really matters is to handle crash failures correctly.  Some organizations toy with Byzantine fault-tolerance, but honestly, I've never met anyone who had a service that actually came under a Byzantine attack.  Maybe the blockchain people will finally have that experience.

So let's focus on crash failures for a moment.  In light of the FLP theorem, we know now that without a perfect failure detector, protocols can't be proved correct: you can show safety, but not liveness.  In practice, we solve this by forcing a seemingly faulty process to leave the system and then, if you want, it can rejoin. 

So failure becomes a purely behavioral abstraction: if the majority of the system deems some process to be worthy of excluding, for whatever arbitrary reason it desires, than out goes the offending process, end of story. 

Where did the majority constraint come from?  The role of the majority restriction is to prevent logical partitioning. 

So, we end up with a rather dynamic form of membership: the system defines its own membership, excluding processes that seem faulty, and makes progress as long as it can maintain a dynamic form of majority.  To whit: at any point in time, some epoch defines the composition of the system.  In order to move to a new epoch, the systems needs agreement by the majority of whoever is in the active epoch.

So here's where all of this leads: Derecho is a concrete implementation of this new-age perspective on Jim's question.  Indeed, it is the ultimate story in the sense that the system is optimal in many different ways.

To appreciate this, you'll need to read the Derecho papers, but in a nutshell, the system maps consensus onto RDMA hardware in an exceptionally efficient way.  But the protocol itself runs consensus on the configuration of each epoch, and then uses a cheaper consensus protocol within an epoch (one that can assume membership is stable and that failures won't occur), leaving a very clean realization of distributed computing.

Indeed, Derecho is a constructive lower bound in every interesting dimension.  It is an optimal solution to the problem of distributed computing, using consensus-based methods.

Why do I make this claim?  Well, first, during a given epoch, the system is as quick to deliver messages as is possible: one can prove that any system that delivers messages with fewer "exchanges of information" between its processes is either incorrect, or at last at risk of needing to stop because of some single fault.

Next, one can show that Derecho's agreement on the state of each membership epoch is optimal.   Here, Derecho makes use of an all-to-all pattern of information exchange, and I asked Idit if she thought the protocol could be improved.  Idit pointed again to her 2006 papers with Alex.  Without prior knowledge of which processes failed, and how many failed, she explained, this pattern of information exchange is the quickest way to reach agreement on the membership of the next epoch.

Finally, we can show that Derecho makes progress with a failure model called <>P: eventually perfect failure detection.  Idit explained to me that in the past, people actually thought it might make sense to focus on progress with weaker detection models, like <>W, but that if you do so, you either end up with a fault model equivalent to <>P, or you end up with slower agreement protocols that look much more like Byzantine agreement.  So, for the style of quick progress Derecho is after, she explains, <>P is the right goal.  And Derecho is live with <>P, provided that no more than a minority of processes fail in any epoch.  Which again, is the best one can do.

Now, Idit is a theory person, but it is interesting for me as a practitioner to also think about practicalities.  As the old adage goes: "In theory, theory and practice are the same, but in practice, they differ."

As it happens, Derecho is optimal in a communications-engineering (networking) sense. 

Given that reliable networks always have a one-to-one acknowledgement based layer, reliable multicast over a tree is an optimal data dissemination pattern: if you try and disseminate multicasts over an unreliable 1-N protocol, the cost of detecting lost packets and resending them will be very high compared to a tree of unicast transfers.  (Perhaps optical networks with optically aggregated acknowledgements could offer a slight hardware speedup, but even this isn't clear, since the optical aggregator would itself be a tree). 

Now if you happen to be a cloud computing engineer, and read the above, what will jump to mind is heavy-tailed behaviors and multi-tenancy: Derecho protocols sometimes relay data, so if a relayer is very slow, everyone waits.  Paxos classically would avoid this using quorums: the system doesn't wait for the slowest process.  So how can I claim that Derecho is optimal in a practical sense if it doesn't use quorums?

There are a few answers.  A long-winded one would talk about work Fernando Pedone and his student, Parissa Jallali, did while visiting my group a few years ago.  Basically, they studied quorum behaviors in Paxos and discovered that while Paxos can briefly "outrun" a slow process, eventually work piles up and either flow-control causes the protocol to pause, or a shifting pattern of just who is the slow process switches the slow guy into the quorum and some previous quorum member becomes slow.  Either way, Paxos basically halts until everyone catches up and is back in sync.  So quorum patterns do not evade the disruption caused by heavy tailed behaviors.

Conversely, offloading the protocol into hardware actually can eliminate that issue, because the hardware is dedicated: the network spends 100% of its time communicating, so if you can describe a pattern of communication, then let it rip, the network is the ideal engine for moving data.  As it happens, Derecho generates deterministic data transfer schedules, hence given adequately programmable NICs we can hand the entire sequence of block transfers to the NIC.  So we can even make "optimal" use of the network, and since a NIC never sleeps or pauses, quorum behaviors aren't needed even if end-applications sometimes are a bit slow.

So a road that for me started around when Jim asked his question about 2PC seems to have reached its end: Derecho implements the ultimate Paxos protocols for atomic multicast (often called vertical Paxos) and for persisted memory (classic Paxos).  We could add an optimistic early delivery protocol with a flush, too, as in Isis and Vsync, but we decided to keep the system simple and omitted it: most people who would use that feature probably just want raw RDMC from the Derecho API, and this we do offer.

And so the Paxos problem is solved.  And you know what?  Its about time!  Idit feels that it was solved back in 2006, actually.  As for me, well, until I can write an application using the ultimate protocol, the problem is open.  But today, I finally can.  (I don't mind at all that Derecho employs a protocol that is actually extremely similar to our first Isis protocols from 1985.)

So should we all pack our bags and go home? 

Not quite yet.  First, there are other forms of distributed consistency, like convergent (gossip) protocols, self-stabilization, and of course, Byzantine Agreement.  It would be nice to see how those fit into this picture and whether there can be a single integrated story that combines all the elements.

A second issue is the engineering complexity of modern platforms.  I'll write a whole blog posting on this sometime soon, but suffice it to say that in a data center with physical topology (racks, switches, TOR switches, failure-independence domains...), GPU clusters, NetFPGA accelerators, Intel SGX protection enclaves... it just isn't obvious how to write code for such environments.  Derecho is just part of the answer, and not the whole story.

Then beyond all of this are dimensions we have yet to tackle in Derecho itself.  For example, even if Derecho is the ultimate data center protocol, is it also the ultimate WAN version?  As it happens, I suspect that this may be true too, but more attention to the question will be needed.  Anyhow, until we have it running, I won't believe the story even if I figure it out "in theory".  After all, in theory, theory and practice are the same...  but in practice, they are enormously different.

So I wouldn't despair: very likely we are finally at the end of the road for Paxos, but it certainly isn't the end of the road for distributed systems.  Watch this space: I can promise plenty of interesting new questions, and new answers, in years to come.