Wednesday, 7 December 2016

Transactions [4]: How does a system like Derecho fit in a world with transactional DHTs?


Derecho is Cornell’s new RDMA-based data replication solution.  It is a simple library coded in C++ 14 (well, maybe not that simple, but simple to use) and it automatically interfaces your code to the local RDMA hardware.  It also automates such tasks as initializing new members when an application process joins an active application or restarts after a serious crash (state transfer), membership coordination and update when members join or fail, persistent data management, atomic multicast, and even RPC (but we also allow you to interface to a Derecho application with standard RPC packages like RESTful RPC or WCF).
What I find interesting is that (1) Derecho can be understood in terms of transactional consistency for atomic "one-shot" operations.  (2) In fact these consistency guarantees extend to aspects of distributed application structure that are not normally considered to need consistency, (3) The really interesting use cases might actually need both Derecho and something like FaRM.
Derecho Primer (mostly duplicates content from the RDMA Derecho posting)
So let's get there step by first.  I'll start with a bit of a duplicate, since I discussed Derecho on the RDMA thread too.  But just in case you skipped that, this illustrates an application of the kind Derecho can help you create:




In this example, the really small white circles are processes: instances of a program, each running on some computer.  The picture thus could span tens or even thousands of nodes in a cluster or datacenter.  We’re trying to help you structure large numbers of processes into cooperating applications where each process plays a specific role within the larger picture.
This particular application has external clients that use browser technologies to talk to it: the red squiggles on the top left.  For example they might use RESTful RPC, which encodes RPC requests as web pages (and similarly for the replies).  REST is widely support but not very fast, which is why we also have a point-to-point RPC layer internal to Derecho.  Most applications using Derecho would use our layer internally even if they talk to the external world using REST or something similar.
The next Derecho concept is that of the top level group.  This is the full membership of the whole application.  In the picture we can see that the top-level group has (at least) 13 members, because there are 13 explicitly shown tiny little white circles.  For most uses the members are actually running identical code, but playing distinct roles.  How can that be?  Well, membership is a kind of input, and each program in a Derecho application knows its respective membership rank (1 of 13, 2 of 13, etc).  So this is enough for them to specialize their behavior.
There is a standard way to specialize behavior, and the picture shows that as well: we associate C++ classes with the top-level group.  This example has 3 classes: LoadBalancer, CacheLayer, and BackEnd.  Each is a normal C++ class, but registered with Derecho.  There is also a notification class, used by the back end to talk to the cache layer (purple circles). 
Derecho uses little functions to define subgroup membership.  They actually are implemented using C++ lambdas.  One such lambda told Derecho that the first three members of the top-level group would be the load-balancer, and we see that at the top.  Another indicated that the cache layer is sharded; in this configuration it seems to have at least 3 shards.  Further, the shards have 3 members each, and there is a shard generator (built into Derecho) that assigned 9 members to this role.  Last, we see that the back end store is also sharded, with 2 replicas per shard, and 4 members in total.
Derecho has all sorts of options for sharding (or at least it will when we finish work on v2 of the system).   Shards can overlap, there can be members held in reserve as warm standby processes to jump in instantly after a crash, etc.  Subgroups and shards can overlap… whatever makes sense to you as the designer.
Next, Derecho has ways to talk to these groups, subgroups or shards: via atomic multicast, which can also be logged into persistent storage.  The latter case matches the specification for Paxos, so in effect, Derecho is a scalable Paxos with an in-memory or a disk execution model, and you can select between the two cases when designing the various components of your application.  The system automatically cleans up after crashes and makes sure that messages are seen in the same order by all members and so forth. 
So with about 15 lines of C++ code per class, more or less, you can instruct Derecho to set up a structure like the one seen here.  And then it just takes a line or so more to send an RPC to some specific member, or issue a multicast to a subgroup or shard, or even to query the members of a subgroup or shard.  The actual delivery of new membership reports (we call them views), multicasts and RPCs is via upcall to methods your class defines; the style of coding looks like a kind of procedure call with polymorophic arguments, and in the case of queries (a multicast to a group where each invoked function returns a result object), you can iterate over the replies in a for loop.
Derecho is blazingly fast.  The API is quite simple and clean, so there isn’t a lot of infrastructure between your code and the RDMA network, and we’ve done some really interesting work to maximize performance of our RDMA sends and multicast operations.  Learn more from the Derecho papers on http://www.cs.cornell.edu/ken, and download the open-source distribution from GitHub.
Derecho plus FaRM 
Visiting at MSRC got me thinking about cases where you might want both Derecho and a transactional DHT, like FaRM or HERD.  I came up with this:
What you see here is two applications, both using Derecho (task A and task B), and then a transactional DHT sitting between them (more properly, present on the same cloud, since all of these span lots of compute nodes).

The case I'm thinking about might arise in a setting like self-driving cars.  Suppose that task A is a helper application for self-driving cars on California highway 101.  It scales out (hence needs lots of read-oriented replicas in the caching layer), has state (car positions and plans), and needs consistency.  And suppose that task B is handling California 280, which is also quite a busy throughway (so it needs its own smart highway helper), but has some overlap with 101.  In fact California might need ten thousand of these tasks just to cover all its busy highway segments.

So we have complex scaled-out applications here, that need replication for high availability, fault-tolerance, and very strong consistency, built using Derecho.  The internal state of each lives within it, including persistency.  But when these tasks share state between tasks -- with external "users", the need is different.  In our illustration, we use a transactional DHT to store state shared between these complex, structured applications.

My thinking here is that the DHT is a very good fit for a kind of anonymous sharing: task A wants to report on vehicles that are likely to leave highway 101 for highway 280, but on the other hand doesn't want to "talk directly" to task B about this, perhaps because the sharing model is simply very convenient this way, or perhaps to avoid interdependencies (we don't want task A jamming up if task B is very busy and slow to respond to direct queries).  Anyhow, the manager for route 1, along the coast, might periodically do a quick scan to plan ahead for cars that might be leaving 101 to join route 1 in the future, etc.  By using the DHT we can support a kind of unplanned sharing.

In contrast the Derecho form of consistency is a great fit for the availability and fault-tolerance needs of our individual tasks -- busy, scaled-out, complex systems, that need consistency.  The DHT wouldn't easily support that sort of application: a DHT like FaRM lacks any notion of a complex, long-lived application with state replicated across its member processes.  FaRM uses replication, but only for its own availability, not to replication application roles or functionality.

So my belief is that we are heading towards a very large-scale world of demanding IoT applications that will really need both models: replication in the form a system like Derecho offers for application design and structure, and transactional DHTs for anonymous sharing.  But I guess we'll have to wait and see whether the application designers faced with building smart highways agree!

No comments:

Post a Comment