Wednesday, 24 July 2019

In theory, asymptotic complexity matters. In practice...

Derecho matches Keidar and Shraer’s lower bounds for dynamically uniform agreement:  No Paxos protocol can  safely deliver messages with fewer "information exchange" steps.  But does this matter?

Derecho targets a variety of potential deployments and use cases.  A common use would be to replicate state within some kind of "sharded" service -- a big pool of servers but broken into smaller replicated subservices that use state machine replication in subsets of perhaps 2, 3 or 5.  A different use case would be for massive replication -- tasks like sharing a VM image, a container, or a machine-learned model over huge numbers of nodes.  In those cases the number of nodes might be large enough for asymptotic protocol complexity bounds to start to matter -- Derecho's optimality could be a winning argument.  But would an infrastructure management service really stream high rates of VM images, containers, and machine-learned models? I suspect that this could arise in future AI Systems... it wouldn't today.

All of which adds up to an interesting question: if theoretical optimality is kind of a "meh" thing, what efficiency bounds really matter for a system like Derecho?  And how close to ideal efficiency can a system like this really come?

To answer this question, let me start by arguing that 99% of Derecho can be ignored.  Derecho actually consists of a collection of subsystems: you link your C++ code to one library, but internally, that library has several distinct sets of "moving parts".  A first subsystem is concerned with moving bytes: our data plane.  The second worries about data persistency and versioning.  A third is where we implement the Paxos semantics: Derecho's control plane.  In fact it handles more than just Paxos -- Derecho's control plane is a single thread that loops through a set of predicates, testing them one by one and then taking triggered actions for any predicate that turns out to be enabled.  A fourth subsystem handles requests that query the distributed state: it runs purely on data that has become stable and is totally lock-free and asynchronous -- the other three subsystems can ignore this one entirely.  In fact the other three subsystems are as lock-free and asynchronous as we could manage, too -- this is the whole game when working with high speed hardware, because the hardware is often far faster than the software that manages it.  We like to think of the RDMA layer and the NVM storage as two additional concurrent systems, and our way of visualizing Derecho is a bit like imagining a machine with five separate moving parts that interact in a few spots, but are as independent as we could manage.

For steady state performance -- bandwidth and latency -- we can actually ignore everything except the update path and the query path.  And as it happens, Derecho's query path is just like any query-intensive read-only subsystem: it uses a ton of hashed indices to approximate one-hop access to objects it needs, and it uses RDMA if that one hop involves somehow fetching data from a remote node, or sending a computational task to that remote node.  This leads to fascinating questions, in fact: you want those paths to be lock-free, zero-copy, ideally efficient, etc.  But we can set those questions to the side for our purposes here -- results like the one by Keidar and Shraer really are about update rates.  And for this, as noted a second ago, almost nothing matters except the data-movement path used by the one subsystem concerned with that role.  Let's have a closer look.

For large transfers Derecho uses a tree-based data movement protocol that we call a binomial pipeline.  In simple terms, we build a binary tree, and over it, create a flow pattern of point to point block transfers that obtains a high level of internal concurrency, like a two-directional bucket brigade -- we call this a reliable multicast over RDMA, or "RDMC").  Just like in an actual bucket brigade, every node settles into a steady behavior, receiving one bucket of data (a "chunk" of bytes) as it sends some other bucket, more or less simultaneously.  The idea is to max-out the RDMA network bandwidth (the hardware simply can't move data more efficiently).  The actual data structure creates a hypercube "overlay" (a conceptual routing diagram that lives on our actual network, which allows any-to-any communication) of dimension d, and then wraps d binomial trees over it, and you can read about it in our DSN paper, or in the main Derecho paper.

A binary tree is the best you can hope for when using point-to-point transfers to replicate large, chunked, objects.  And indeed, when we measure RDMC, it seems to do as well as one can possibly do on RDMA, given that RDMA lacks a reliable one-to-many chunk transfer protocol.   So here we actually do have an ideal mapping of data movement to RDMA primitives.

Unfortunately, RDMC isn't very helpful for data too small to "chunk".  If we don't have enough data a binomial tree won't settle into its steady-state bucket brigade mode and we would just see a series of point-to-point copying actions.  This is still "optimal" at large-scale, but recall that often we will be replicating in a shard of size two, three or perhaps five.  We decided that Derecho needed a second protocol for small multicasts, and Sagar Jha implemented what he calls the SMC protocol.

SMC is very simple.  The sender, call it process P, has a window, and a counter.  To send a message, P places the data in a free slot in its window (each sender has a different window, so we mean "P's window"), and increments the counter (again, P's counter).  When every receiver (call them P, Q and R: this protocol actually loops data back, so P sends to itself as well as to the other shard members) has received the message, the slot is freed and P can reuse it, round-robin.  In a shard of size three where all the members send, there would be one instance of this per member: three windows, three counters, three sets of receive counters (one per sender).

SMC is quite efficient with small shards.  RDMA has a direct-remote-write feature that we can leverage (RDMC uses a TCP-like feature where the receiver needs to post a buffer before the sender transmits, but this direct write is different: here the receiver declares a region of memory into which the sender can do direct writes, without locking).

Or is it?  Here we run into a curious philosophical debate that centers on the proper semantics of Derecho's ordered_send: should an ordered_send be immediate, or delayed for purposes of buffering, like a file I/O stream?  Sagar, when he designed this layer, opted for urgency.  His reasoning was that if a developer can afford to batch messages and send big RDMC messages that carry thousands of smaller ones, this is exactly what he or she would do.  So a developer opting for SMC must be someone who prioritizes immediate sends, and wants the lowest possible latency.

So, assume that ordered_send is required to be "urgent".  Let's count the RDMA operations that will be needed to send one small object from P to itself (ordered_send loops back), Q and R.  First we need to copy the data from P to Q and R: two RDMA operations, because  reliable one-sided RDMA is a one-to-one action.  Next P increments its full-slots counter and pushes it too -- the updated counter can't be sent in the same operation that sends the data because RDMA has a memory consistency model under which a single operation that spans different cache-lines only guarantees sequential consistency on a per-cache-line basis, and we wouldn't want P or Q to see the full-slots counter increment without certainty that the data would be visible to them.  You need two distinct RDMA operations to be sure of that (each is said to be "memory fenced.")  So, two more RDMA operations are required.  In our three-member shard, we are up to four RDMA operations per SMC multicast.

But now we need acknowledgements.  P can't overwrite the slot until P, Q and R have received the data and looked at it, and to report when this has been completed, the three update their receive counters.  These counters need to be mirrored to one-another (for fault-tolerance reasons), so P must send its updated receive counter to Q and R, Q to P and R, and R to P and Q: six more RDMA operations, giving a total of ten.  In general with a shard of size N, we will see 2*(N-1) RDMA operations to send the data and count, and N*(N-1) for these receive counter reports, a total of N^2+N-2.  Asymptotically, RDMC will dominate because of the N^2 term, but N would need to be much larger than five for this to kick in.  At a scale of two to five members, we can think of N as more or less a constant, and so this entire term is like a constant.

So by this argument, sending M messages using SMC with an urgent-send semantic "must" cost us M*(N^2+N-2) RDMA operations.  Is this optimal?

Here we run into a hardware issue.  If you check the specification for the Connect X4 Mellanox device used in my group's experiments, you'll find that it can transmit 75M RDMA messages per second, and also that it has peak performance of 100Gbps (12.5GB) in each link direction.   But if your 75M messages are used to report updates to tiny little 4-byte counters, you haven't used much of the available bandwidth: 75M times 4 bytes is only 300MB/s, and as noted above, the device is is bidirection.  Since we are talking about bytes, the bidirectional speed could be as high as 25GB/s with an ideal pattern of transfers.  Oops: we're too slow by a factor of 75x!

In our TOCS paper SMC peaks at around 7.5M small messages per second, which bears out this observation.  We seem to be leaving a lot of capacity unused.  If you think about it, everything centers on the assumption that ordered_send should be as urgent as possible.  This is actually limiting performance and for applications that average out at 7.5M SMC messages per second or less, but have bursts that might be much higher, this is even inflating latency (a higher-rate burst will just fill the window and the sender will have to wait for a slot).

Suppose our sender wants fast SMC streaming and low latency, and simply wasn't able to do application-level batching (maybe the application has a few independent subsystems of its own that send SMC messages).  Well, everyone is familiar with file I/O streaming and buffering.  Why not use the same idea here?

Clearly we could have aggregated a bunch of SMC messages, and then done one RDMA transfer for the entire set of full window slots (it happens that RDMA has a so-called "scatter gather/put feature", and we can use that to transfer precisely the newly full slots even if they wrap around the window).  Now one counter update covers the full set.  Moreover, the receivers can do "batched" receives, and one counter update would then cover the full batch of receives.

An SMC window might have 1000 sender slots in it, with the cutoff for "small" messages being perhaps 100B.  Suppose we run with batches of size 250.  We'll have cut the overhead factors dramatically: for 1000 SMC messages in the urgent approach, the existing system would send 1000*10 RDMA messages for the 3-member shard: 10,000 in total.  Modified to batch 250 messages at a time, only 40 RDMA operations are needed: a clean 250x improvement.  In theory, our 7.5M SMC messages per second performance could then leap to 1.9B/second.  But here, predictions break down: With 100 byte payloads, that rate would actually be substantially over the limit we calculated earlier, 25GB/s, which limits us to 250M SMC messages per second.  Still, 250M is quite a bit faster than 7.5M and worth trying to achieve.

It might not be trivial to get from here to there, even with batching.  Optimizations at these insane data rates often aren't be nearly as simple as a pencil-and-paper calculation might suggest.  And there are also those urgency semantics issues to think about:  A bursty sender might have some gaps in its sending stream.  Were one to occur in the middle of a 250 message batch, we shouldn't leave those SMC messages dangling: some form of automatic flush has to kick in.  We should also have an API operation so that a user could explicitly force a flush.  

Interestingly, once you start to think about this, you'll realize that in this latency sense, Sagar's original SMC is probably "more optimal" than any batched solution can be.  If you have just one very urgent notification to send, not a batch, SMC is already a very low-latency protocol; arguably, given his argument that the API itself dictates that SMC should be an urgent protocol, his solution actually is "ideally efficient."  What we see above is that if you question that assumption, you can identify an inefficiency -- not that the protocol as given is inefficient under the assumptions it reflects.

Moral of the story?  The good news is that right this second, there should be a way to improve Derecho performance for small messages, if the user is a tiny bit less worried about urgency and would like to enable a batching mode (we can make it a configurable feature).  But more broadly, you can see is that although Derecho lives in a world governed in part by theory, in the extreme performance range we target and with the various hardware constraints and properties we need to keep in mind, tiny decisions can sometimes shape performance to a far greater degree.

I happen to be a performance nut (and nobody stays in my group unless they share that quirk).  Now that we are aware of this SMC performance issue, which was actually called to our attention by Joe Israelevitz when he compared his Acuerdo protocol over RDMA with our Derecho one for 100B objects and beat us hands-down,  we'll certainly tackle it.  I've outlined one example of an optimization, but it will certainly turn out that there are others too, and I bet we'll end up with a nice paper on performance, and a substantial speedup, and maybe even some deep insights.  But they probably won't be insights about protocol complexity.  At the end of the day, Derecho may be quite a bit faster for some cases, and certainly this SMC one will be such a case.  Yet the asymptotic optimality of the protocol will not really have been impacted: the system is optimal in that sense today!  It just isn't as fast as it probably should be, at least for SMC messages sent in high-rate streams!