Wednesday, 7 March 2018

Data stream computing

I've actually posted on this topic in the past, but I spend so much of my own time thinking about it that I'm going to revisit the question from my "revised and improved" perspective, a year or so down the road.

Context: with the move towards a disaggregated data center model in which we use RDMA or other forms of next-generation direct access to talk to remote memory (perhaps persistent, perhaps not), we find ourselves in a world where the underlying costs and barriers have shifted dramatically relative to the programming style familiar in the past.

Specific example: Consider 2-phase commit, or your favorite consensus protocol, or Paxos.  In the recent past, we would program these protocols by having some leader who initiates the protocol and sends out a request to the other participants: "can you commit transaction 71?" or "I vote 1", or  "I propose that we place value v into Paxos log slot 22, and this is my 3rd ballot for that slot."  Responses eventually come back, and after a round or two of these back-and-forth interactions, the protocol terminates.

In that classic way of thinking, there was a kind of implicit balance in mind between the cost of moving the data (value v in the example above) and the cost of the back-and-forth round-trips.  I'm not quite sure how to best express this sense of balance, but what one finds is that the cost of the round-trip actions (the RTT, as we say in the business...) is sort of "reasonable" relative to the bandwidth available for moving the value field, often over TCP connections or UDP.

Today, with new forms of direct remote memory access, talking to remote memory over a technology like RDMA is actually faster than talking to local memory, but only in the sense that RDMA can sustain a far higher bandwidth than local actions like using memcpy (the fastest available copying primitive on a Linux/C computing platform) to move data from one place in memory to another, assuming that the data in question isn't in the L2 cache.   As RDMA speeds up, this gap will only get larger.  So it is very cheap to move data from machine A to machine B.

In contrast, the RTT for a modern optical Ethernet is surprisingly high, when you consider quite how fast copying can be.  We see RTTs in the 1.5us to 2.5us range: quite fast compared to the 1-10ms numbers that were common for protocols running over TCP/UDP/IP, but relatively sluggish if you consider the number of bytes that could have been sent in that time, particularly if you add in the delays for scheduling a thread to notice the request coming in (when A's message arrives on B), or to send the reply (when B's response reaches A).

If A blocks sending while waiting for this RTT to complete, quite a few bytes won't be sent.  And as RDMA gets faster (today, 100Gbps, but tomorrow 200, then 400, and 1Tbps is within range now), the amount of missed transmission opportunity gets pretty substantial.

Concurrency feels like the obvious answer,  but this instinctive reaction proves to be naive.  The theory would be that if A has lots of side by side transactions with B, then while one waits, another can transmit.  But that overlooks overheads: concurrency also brings many kinds of costs, and this model is anything but a clear win!  Those costs could quickly add up: locking to ensure that the transfers won't interfere with one-another, memory contention on the DRAM chunks where the data lives, various forms of state that might be sitting around while the round-trip interactions are running (threads consume stack space, for example).  So concurrency doesn't come for free.

Batching is a second possibility:  A could collect a bunch of requests, then send them as a batch (while collecting the next batch), B could respond on the whole series of requests (one by one, but as a batch of responses), and so forth.  This amortizes costs in a nice way.

Batching makes sense on a receiver, too.  Rather than sending batches and forcing the receiver to use those batch sizes, a sender could send continuously, and the receiver could bite off a chunk of incoming records based on when it happens to be ready to run. So if we have A sending continuously, rather than having A batch 10 items at a time, it could stream steadily to B.  B might be ready for its next chunk of input at a moment when there are 11 items available; it treats these as a batch of 11.  Then at the next loop, B is perhaps ready for input when 4 new items have turned up: a batch of 4.  We still see a benefit of amortized costs, but the batch sizes are variable and determined by when B happens to be ready, rather than fixed and decided by A (the sender).

Derecho works this way.

Now why are we discussing these obvious points?  Well, the optimality of Derecho is directly tied to its use of this style of receiver-side batching.    The key "thing" about Derecho is that it runs a version of Paxos modified to make decisions as soon as possible, based on the Paxos notion of safety, using receiver-side batches.  This allows us to stream an exceptionally high rate of Paxos requests, process them with no unnecessary delays, and amortize all costs over batches of events.

In fact, the value of this method centers on the steady stream of updates.  With one update, by itself, classic Paxos is a reasonable choice of protocol.  But with a stream of updates, we need to amortize costs for efficiency.  The classic protocols run in a greedy way, and cease to be efficient because they miss this opportunity!

So now the question arises: is this a story purely about Derecho, or does the same opportunity transition to other kinds of systems that aren't doing atomic multicast (vertical Paxos) and atomic log updates (classic durable Paxos)?  For example, could a distributed file system somehow gain performance by being implemented to use the same sort of pattern?  Could a database system be optimized by leveraging this sort of transformation?

And we can pose the problem even more broadly.   As a programming question (almost a compilation question), is there a systematic programming methodology here, one that would lead the developer to an optimized solution matched to the underlying cost model mentioned earlier (the cost model in which steady flow works extremely well at high rates and with large data volumes, but RTT looks somewhat costly)?  How could one start with a specification of a file system, or a specification of Paxos, and systematically arrive at a solution like Derecho?

I find it fascinating that despite all the work that has been done on stream processing in database systems, there seems to be very little prior work on the kind of protocol optimization I've outlined.  The underlying "memory model" is really just an exaggerated NUMA one, with a cache-line memory coherence model, remote memory access, but with unusually high latencies relative to the extremely  high bandwidth.   By and large, my colleagues in the PL community view the resulting questions as being variations on what they are already used to: they have extensive experience with NUMA programming.  Yet here we can see that we actually are after solutions that can be quite different from the classic ones: Derecho doesn't look a lot like Paxos at first glance, and the optimality we achieved emerges from the Paxos protocol transformations that let us optimize for the RDMA environment.  So there was something non-trivial happening here, even though it happens over the standard NUMA model.

I don't know if we have PL readers of this blog, but I would love to hear their thoughts, if so!