A Few Thoughts on Distributed Computing: Transactions [3]: Replication with consistency (what problem are we solving)?

So the question now arises: what about data replication?

There are really two issues:

Data replication inside the professional database, or the DHT, or whatever the transactional server might be. This is usually completely invisible to the end-user, who just writes little transactional operations and doesn’t need to be aware of how the DHT will handle crashes or similar issues.
Data replication used in the client systems: either out on the edge (like in a smart car, which might need to have a highly available control system, replicated and continuously active even if some component crashes due to a bug or fails because of a hardware glitch), or in the outer tier of the cloud (e.g. inside the cloud datacenter, but in the layer where the client requests are first received: an example might be an information service launched on behalf of that smart car that will run while the car is actively driving, as its cloud-hosted partner for planning the route, anticipating traffic jams, checking for sales at stores you are passing that sell stuff you seem likely to buy, and so forth).

Both kinds of replication matter, and will often need strong replica consistency, but notice that the replication technology would run in different places:

The first kind of replication runs inside the transactional service itself, to keep the service alive and accessible even if something fails while it is running. The earliest work on Lamport’s famous Paxos protocol[1] was something called Viewstamped Replication and was a solution to precisely this problem: Oki and Liskov were building a database and wanted to use replication within it, and ended up with a version of Paxos (it looks just like Lamport’s later Paxos protocol in his famous theoretical treatment a few years later), but deeply integrated with the data storage layer of the database they were building.
The second form of replication runs in a highly available application, where we may be replicating the active state of the program (perhaps, data or data structures it uses as it runs) across a set of program instances that back each-other up to step in if a failure occurs.

In my view, this gets to the core of the distinction. Think back to those edge processes that might be using transactions: in a stateless model, like the one that prevails in the cloud, such a process isn’t fault-tolerant and even if it is playing an important role, like being the representative within the cloud for a self-driving car that connects periodically for updates, the thread doesn’t have any real option for keeping itself alive in the face of routine cloud stuff like elasticity events that shut nodes down (including, perhaps, the node the thread was on). The cloud does things of that kind all the time, without much consideration of application state, because in the prevailing style of coding any kind of state is supposed to be stored into a database, or a DHT.

This has actually worked very well up to now, but as I see it, the world is shifting because with the growth of Internet of Things applications, multi-user gaming, and other kinds of applications that have continuously managed online state and need to be responsive within a few milliseconds, we can no longer trust the cloud to be fast enough to meet the requirement. That self-driving car might have a fail-safe approach to handling outages in its cloud-hosted controller, but it won’t want to activate fail-safe mode unnecessarily. The same is true for smart power grid systems: they can operate as dumb systems, but you often really want high availability.

When you write a distributed program as a set of processes with some form of consistently replicated state, you can also take advantage of having multiple processes to gain extra CPU power, for example to perform actions in parallel. With modern machine-learning systems, this could matter, so as we move towards a world of active AI and ML applications, that sense real-world input and instantly react, we’ll also move towards a world with greater and greater need for replicated state and consistent coordinated actions by programs that view themselves as team players. And if that program is long-running, you need it to also tolerate failures, be able to regenerate a node lost due to a crash by integrating a replacement node into the running system, etc.

These needs argue that the future may be a world with far greater use of HPC clusters running applications coded in languages like MPI, and perhaps also far greater deployment of multi-node applications running on general platforms (elsewhere in this blog, on the RDMA discussion, I suggest that HPC with MPI is not a very general infrastructure and that we won’t soon see MPI on datacenters that use fast Ethernet and have multitenancy). For those cases, we’ll need something different – Cornell’s Derecho system is a response to this specific need.

Back on message, what does this tell us about transactions “versus” consistent replication in Paxos or using atomic multicast (some call this in-memory Paxos or RAM Paxos, but I tend to view it as a different abstraction because Paxos is focused on a stateful model of the protocol itself, whereas atomic multicast is usually finished with a message once it hands it to the application – it isn’t required to log the thing, replay the log later, etc). Derecho has both.

So in the world I expect will prevail, we’ll probably have a mix of stateless threads on the edge with more stateful, replicated, multi-node services using a replication solution to run in a fault-tolerant way on a set of nodes. Personally, I think Derecho is the best way to pull this off, replicating state in that kind of a service – much as we did in all the Isis applications years ago. But if you prefer to use JGroups, LibPaxos, Vsync, RaFT or whatever, I’m not going to try very hard to talk you out of that (basically, a team should use whatever fits best for its goals).

So our edge would not have these highly available replicated services and applications, side by side with today’s stateless threads. And my belief is that these highly available, strongly consistent edge services will sometimes need to talk to one-another (service instance to service instance, not internal to the single replicated program).

For example, suppose our highly available service is in charge of helping my self-driving car navigate the New York City traffic and hit the open roads up to Ithaca. It runs on the cloud, and my self-driving car has an autonomous onboard control system that talks to the service when connectivity is good, getting routing advice and so forth (“Hey Ken, its already 6:30 and we are coming up on a great little restaurant near the Delaware Water Gap: should I reserve a table?”). And since there might be 1M cars on the road in the US Northeast, there could be 1M instances of this helper running on various US Northeast datacenters.

Where does that transactional DHT or database enter in? Well, that helper program might be storing my routing plan (like a flight plan) and other information into a database or DHT in order to get help from various backend services that run on big data, or to share my trajectory with other route-helpers for other nearby cars, etc. Just as the little stateless threads share data via a database or DHT, so would these things.

In contrast, they use consistent replication internally, inside the application, for other purposes, like for high availability, application persistence (e.g. if a node crashes and then restarts), etc. The benefit compared to storing all the state in the database or DHT is that you get continuous realtime control with the ability to ride out failures or other configuration problems, and you also get an actual programming model for leveraging a set of nodes as part of one service. This kind of thing is hard to do with a DHT: if your lightweight stateless thread crashes, who relaunches it, and how does it track down its state? Can it be done in milliseconds or less? Not obvious.

Think about future banking systems, smart power grid, smart buildings in smart cities, and you get a longer and longer list of possible use cases fitting the pattern. Which is why I think we need both forms of consistency: replicated services as well as transactional storage layers.

[1] For fairness, I should note that this is a much-debated question: the invention of Paxos is one of those events that would normally jet people into contention for a Turing Award, although in this particular case, two of the main contenders already have Turing Awards (for other work, but you can win that award twice).

Barbara Liskov is quite insistent that she and Brian Oki invented Paxos and she points to the viewstamped replication paper, which appeared in PODC in 1988.

But my own Isis Toolkit (a data replication tool that I created years earlier than the work Oki and Liskov did) has a protocol called Gbcast in it, for group management, and it bisimulates Paxos, which is a fancy way of saying that any execution of Gbcast matches precisely with an execution of Paxos and vice versa. So in this mathematical sense, Gbcast really is Paxos, and once one realizes this, the mapping from one to the other becomes more or less evident; there is a Wikipedia article that discusses this, under the heading Gbcast Protocol).

The trouble is that without a fair amount of manipulation, Gbcast doesn’t look much like Paxos; you need to really think about it to figure out how to transform one into the other, as explained in that Wikipedia article. So while Gbcast is certainly in the class of protocols that Paxos is in, it isn't clear that one can call it Paxos. If anything, the opposite seems to be true: Gbcast shouldn't be viewed as a Paxos protocol.

This said, Gbcast definitely solves the same problem that Paxos solves. Moreover, since we're discussing chronology here, papers on Isis started to appear around 1985, and included this protocol, and I even gave invited talks on the work at MIT in that period. By 1987 all the main Isis protocols had appeared in major conferences or journals. But again, the Isis version of Paxos didn’t look much like Paxos, and the proof of correctness was way less elegant than Lamport’s proofs, so even I would have a hard time claiming that this was really the first Paxos. What I would say is that those who came later would, mostly, have seen this work.

Then, continuing with fair treatment for all, there was yet another protocol, by Larry Stockmeyer, Nancy Lynch and Cynthia Dwork. The protocol looked more like Paxos than my Gbcast protocol, and had a proof a lot like the one that Lamport later used (not identical), and it was published in the same PODC conference proceedings as the Viewstamped Replication paper, in 1988! So we have my work in 1985, then these papers which came out simultaneously in 1988, and then Paxos which circulated as a technical report starting around 1990, but didn’t get published until 1996.

So, who invented Paxos? I lean towards giving the nod to Barbara and Brian, provided that Gbcast is explicitly acknowledged:

Gbcast was the first practical protocol to solve the consensus problem in a way that bisimulates what we now think of as the Paxos protocol. It was not specified in the same manner as Paxos, nor was the proof much like the contemporary proofs. Let’s say 1985, since that was the date of the first paper on the work, and also the time period where I gave an invited talk at MIT.

Viewstamped Replication looks like Paxos, and has the famous Synod pattern in it (the core of Paxos), so let’s call this a true Paxos, in 1988. But the protocol is very deeply integrated with the transactional database they were trying to replicate, and the PODC paper lacked any proofs, so we can’t guess from that at the proof structure. I’m told that Brian Oki had the proofs in his thesis, but am also told that they didn’t look much like the Paxos proofs. So: Synod protocol, and a Paxos-style use case, but apparently the proof was not in the style that made Lamport’s Paxos work so famous.

The consensus protocol of Stockmeyer, Lynch and Dwork, also in 1988. Another close match, but you need to do some mental mapping to convert it to Paxos (like for Gbcast). Call it a close runner up.

True Paxos: let’s date this 1990, when the TR version was widely circulated for the first time. Has the Synod protocol, the full specification, and the proofs have a modern form. Definitely the winner if you don’t give full credit to Viewstamped Replication.

As noted, both Lamport and Liskov have already won the Turing Award, but the awards didn’t point to Paxos in either case (Leslie won for his overall body of contributions, starting much earlier with work on causality and time, and Barbara won for early innovations in programming languages and modularity).

A Few Thoughts on Distributed Computing

Wednesday, 7 December 2016

Transactions [3]: Replication with consistency (what problem are we solving)?

No comments:

Post a Comment