So the question now arises: what about data
replication?
There are really two
issues:
- Data replication inside the professional database, or the DHT, or whatever the transactional server might be. This is usually completely invisible to the end-user, who just writes little transactional operations and doesn’t need to be aware of how the DHT will handle crashes or similar issues.
- Data replication used in the client systems: either out on the edge (like in a smart car, which might need to have a highly available control system, replicated and continuously active even if some component crashes due to a bug or fails because of a hardware glitch), or in the outer tier of the cloud (e.g. inside the cloud datacenter, but in the layer where the client requests are first received: an example might be an information service launched on behalf of that smart car that will run while the car is actively driving, as its cloud-hosted partner for planning the route, anticipating traffic jams, checking for sales at stores you are passing that sell stuff you seem likely to buy, and so forth).
Both kinds of replication matter, and will
often need strong replica consistency, but notice that the replication
technology would run in different places:
- The first kind of replication runs inside the transactional service itself, to keep the service alive and accessible even if something fails while it is running. The earliest work on Lamport’s famous Paxos protocol[1] was something called Viewstamped Replication and was a solution to precisely this problem: Oki and Liskov were building a database and wanted to use replication within it, and ended up with a version of Paxos (it looks just like Lamport’s later Paxos protocol in his famous theoretical treatment a few years later), but deeply integrated with the data storage layer of the database they were building.
- The second form of replication runs in a highly available application, where we may be replicating the active state of the program (perhaps, data or data structures it uses as it runs) across a set of program instances that back each-other up to step in if a failure occurs.
This has actually worked very well up to
now, but as I see it, the world is shifting because with the growth of Internet
of Things applications, multi-user gaming, and other kinds of applications that
have continuously managed online state and need to be responsive within a few
milliseconds, we can no longer trust the cloud to be fast enough to meet the
requirement. That self-driving car might
have a fail-safe approach to handling outages in its cloud-hosted controller,
but it won’t want to activate fail-safe mode unnecessarily. The same is true for smart power grid
systems: they can operate as dumb systems, but you often really want high
availability.
When you write a distributed program as a
set of processes with some form of consistently replicated state, you can also
take advantage of having multiple processes to gain extra CPU power, for
example to perform actions in parallel.
With modern machine-learning systems, this could matter, so as we move
towards a world of active AI and ML applications, that sense real-world input
and instantly react, we’ll also move towards a world with greater and greater
need for replicated state and consistent coordinated actions by programs that
view themselves as team players. And if
that program is long-running, you need it to also tolerate failures, be able to
regenerate a node lost due to a crash by integrating a replacement node into
the running system, etc.
These needs argue that the future may be a
world with far greater use of HPC clusters running applications coded in
languages like MPI, and perhaps also far greater deployment of multi-node
applications running on general platforms (elsewhere in this blog, on the RDMA
discussion, I suggest that HPC with MPI is not a very general infrastructure
and that we won’t soon see MPI on datacenters that use fast Ethernet and have
multitenancy). For those cases, we’ll
need something different – Cornell’s Derecho system is a response to this
specific need.
Back on message, what does this tell us
about transactions “versus” consistent replication in Paxos or using atomic
multicast (some call this in-memory Paxos or RAM Paxos, but I tend to view it
as a different abstraction because Paxos is focused on a stateful model of the
protocol itself, whereas atomic multicast is usually finished with a message
once it hands it to the application – it isn’t required to log the thing,
replay the log later, etc). Derecho has
both.
So in the world I expect will prevail,
we’ll probably have a mix of stateless threads on the edge with more stateful,
replicated, multi-node services using a replication solution to run in a
fault-tolerant way on a set of nodes.
Personally, I think Derecho is the best way to pull this off, replicating
state in that kind of a service – much as we did in all the Isis applications
years ago. But if you prefer to use
JGroups, LibPaxos, Vsync, RaFT or whatever, I’m not going to try very hard to
talk you out of that (basically, a team should use whatever fits best for its
goals).
So our edge would not have these highly
available replicated services and applications, side by side with today’s
stateless threads. And my belief is that
these highly available, strongly consistent edge services will sometimes need
to talk to one-another (service instance to service instance, not internal to
the single replicated program).
For example, suppose our highly available
service is in charge of helping my self-driving car navigate the New York City
traffic and hit the open roads up to Ithaca.
It runs on the cloud, and my self-driving car has an autonomous onboard
control system that talks to the service when connectivity is good, getting
routing advice and so forth (“Hey Ken, its already 6:30 and we are coming up on
a great little restaurant near the Delaware Water Gap: should I reserve a
table?”). And since there might be 1M
cars on the road in the US Northeast, there could be 1M instances of this
helper running on various US Northeast datacenters.
Where does that transactional DHT or
database enter in? Well, that helper
program might be storing my routing plan (like a flight plan) and other
information into a database or DHT in order to get help from various backend
services that run on big data, or to share my trajectory with other
route-helpers for other nearby cars, etc.
Just as the little stateless threads share data via a database or DHT,
so would these things.
In contrast, they use consistent
replication internally, inside the application, for other purposes, like for
high availability, application persistence (e.g. if a node crashes and then
restarts), etc. The benefit compared to
storing all the state in the database or DHT is that you get continuous
realtime control with the ability to ride out failures or other configuration
problems, and you also get an actual programming model for leveraging a set of
nodes as part of one service. This kind
of thing is hard to do with a DHT: if your lightweight stateless thread crashes,
who relaunches it, and how does it track down its state? Can it be done in milliseconds or less? Not obvious.
Think about future banking systems, smart
power grid, smart buildings in smart cities, and you get a longer and longer
list of possible use cases fitting the pattern.
Which is why I think we need both forms of consistency: replicated
services as well as transactional storage layers.
[1] For fairness, I should note that this is a much-debated question:
the invention of Paxos is one of those events that would normally jet people
into contention for a Turing Award, although in this particular case, two of
the main contenders already have Turing Awards (for other work, but you can win
that award twice).
Barbara Liskov is quite insistent
that she and Brian Oki invented Paxos and she points to the viewstamped
replication paper, which appeared in PODC in 1988.
But my own Isis Toolkit (a
data replication tool that I created years earlier than the work Oki and Liskov
did) has a protocol called Gbcast in it, for group management, and it
bisimulates Paxos, which is a fancy way of saying that any execution of Gbcast
matches precisely with an execution of Paxos and vice versa. So in this mathematical sense, Gbcast really
is Paxos, and once one realizes this, the mapping from one to the other becomes
more or less evident; there is a Wikipedia article that discusses this, under the heading Gbcast Protocol).
The trouble is that without a fair amount of manipulation, Gbcast doesn’t look much like Paxos; you need to really think about it to figure out how to transform one into the other, as explained in that Wikipedia article. So while Gbcast is certainly in the class of protocols that Paxos is in, it isn't clear that one can call it Paxos. If anything, the opposite seems to be true: Gbcast shouldn't be viewed as a Paxos protocol.
This said, Gbcast definitely solves the same problem that Paxos solves. Moreover, since we're discussing chronology here, papers on Isis started to appear around 1985,
and included this protocol, and I even gave invited talks on the work at MIT in
that period. By 1987 all the main Isis
protocols had appeared in major conferences or journals. But again, the Isis version of Paxos didn’t
look much like Paxos, and the proof of correctness was way less elegant than
Lamport’s proofs, so even I would have a hard time claiming that this was
really the first Paxos. What I would say
is that those who came later would, mostly, have seen this work.
Then, continuing with fair treatment
for all, there was yet another protocol, by Larry Stockmeyer, Nancy Lynch and
Cynthia Dwork. The protocol looked more
like Paxos than my Gbcast protocol, and had a proof a lot like the one that
Lamport later used (not identical), and it was published in the same PODC
conference proceedings as the Viewstamped Replication paper, in 1988! So we have my work in 1985, then these papers
which came out simultaneously in 1988, and then Paxos which circulated as a
technical report starting around 1990, but didn’t get published until
1996.
So, who invented Paxos? I lean towards giving the nod to Barbara and
Brian, provided that Gbcast is explicitly acknowledged:
- Gbcast was the first practical protocol to solve the consensus problem in a way that bisimulates what we now think of as the Paxos protocol. It was not specified in the same manner as Paxos, nor was the proof much like the contemporary proofs. Let’s say 1985, since that was the date of the first paper on the work, and also the time period where I gave an invited talk at MIT.
- Viewstamped Replication looks like Paxos, and has the famous Synod pattern in it (the core of Paxos), so let’s call this a true Paxos, in 1988. But the protocol is very deeply integrated with the transactional database they were trying to replicate, and the PODC paper lacked any proofs, so we can’t guess from that at the proof structure. I’m told that Brian Oki had the proofs in his thesis, but am also told that they didn’t look much like the Paxos proofs. So: Synod protocol, and a Paxos-style use case, but apparently the proof was not in the style that made Lamport’s Paxos work so famous.
- The consensus protocol of Stockmeyer, Lynch and Dwork, also in 1988. Another close match, but you need to do some mental mapping to convert it to Paxos (like for Gbcast). Call it a close runner up.
- True Paxos: let’s date this 1990, when the TR version was widely circulated for the first time. Has the Synod protocol, the full specification, and the proofs have a modern form. Definitely the winner if you don’t give full credit to Viewstamped Replication.
As noted, both Lamport and Liskov
have already won the Turing Award, but the awards didn’t point to Paxos in
either case (Leslie won for his overall body of contributions, starting much
earlier with work on causality and time, and Barbara won for early innovations
in programming languages and modularity).
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.