A quick summary of this whole set of postings:
· Transactions, and in particular, transactional key-value stores, are a genuinely exciting technology for sharing information between tasks that are mutually anonymous. By this, I mean tasks A and B that update and share information about some sort of external reality, but where A has no direct knowledge of B and vice versa: they are not directly coordinating their actions, or running as two parts of some overarching structured application.
o The ACID properties make sense for this use case.
o The model of lightweight tasks that use the transactional memory as their sole persistent store is very natural for the cloud, so we see this approach in PaaS solutions.
o Often we say that the tasks are stateless, meaning that they always are launched in whatever the original state defined by the virtual machine or the code is; state is not kept from activation to activation (except in the transactional store, where state persists).
· Group replication is a great model for high availability programming. You design with a state machine style of execution in mind, although generally the rigid lock-step behavior of a state machine occurs only in replicated objects that might be just a small subset of the functionality of the overall application.
o For example, a modern application might have a load-balancer, a cache that is further subdivided into shards using a key-value model, and then each of the shards might be considered as a state machine replicated group for updates, but its members might operate independently for RPC-style read-only operations (to get read parallelism). Then perhaps the same application might even have a stateful persistent backend.
o We use virtually synchronous atomic multicast and virtually synchronous Paxos in such applications: virtual synchrony ensures consistency of the membership updates, and then the atomic multicast or Paxos protocol handles delivery of messages or state updates within the group.
· In today’s cloud we might want both functionalities, side by side.
o The virtual synchrony group mechanisms make it easy to build a scalable fault-tolerant distributed program that will be highly available, responding very rapidly to external stimuli. You would want such a program in an Internet of Things setting, to give one example.
o But if you run such a replicated program many times, separately, the program as a whole is quite similar to the tasks mentioned in the first bullet: each of these multi-component programs is like a task. And we might do that for a setting like a cloud-hosted controller for a smart car, where we want high availability and speed (hence the group structure), but also prefer to have one of these applications (replication and all) per car. So we could launch a half million of them on the same cloud, if we were interested in controlling the half million or so cars driving on the California state highway network on a typical afternoon.
o These group-structured programs might then use a transactional key-value store to share data between instances (in effect, the car-to-car dialog would be through a shared transactional store, while the single-car-control aspect would be within the group).
Here’s a picture illustrating a group-structured program like the one just mentioned:
In this example we actually see even more groups than were mentioned above: the back-end is sharded too, and then there are groups shown that might be used by the back-end to notify the cache layer when updates occur, as is needed in a so-called coherently cached, strongly consistent store. Paxos would be useful in the back-end, whereas the cache layer would be stateless and hence only needs atomic multicast (which is much faster). Applications like this are easy to build with Derecho, and they run at the full speed of your RDMA network, if you have RDMA deployed.
The point then is that replication is used here to ensure rapid failure recovery of the application instance (this is the data in the back-end store), and for scaling to handle high query rates (the replication in the cache layer). But this form of replication is entirely focused on local state: the information needed to control one car, for example.
I tried to make a picture with a few of these structured applications sharing information through a key-value store, but it gets a little complex if I show more than two. So here are two of them: task A on the left and task B on the right. The transactional DHT in the middle could be FaRM or Herd:
Notice the distinction: local state was replicated in each of task A and task B (handling two different vehicles), but the global state that defines the full state of the world, as it were, is stored in the transactional key-value database that they share. This matters: processes running within A or within B know of the structure of the overall car-manager task in which they run. Subtasks of a single task actually know about one-another and cooperate actively. In contrast, when we consider task A as a unit, and task B as a unit, the picture changes: A does not really know that B is active and they do not normally cooperate directly. Instead, they share data through an the DHT: the transactional key-value subsystem.
As in the case of RDMA, there is a lot more to be said here. I've posted sub-topic remarks on a few of the points that I find most interesting.