A Few Thoughts on Distributed Computing: Should Derecho do optimistic early delivery?

In the Isis and Vsync systems, we supported a form of early delivery that could be described as "optimistic": messages were delivered as soon as they arrived, before the system was certain safety had been achieved, enabling the application to start processing them, asynchronously with respect to the background exchanges of acknowledgements required to learn that stability had been reached.

For example, suppose that a client process, P, asked a process Q to take some action. Perhaps as a side effect of this, Q updates data replicated in group G with members {Q, R, S}. In an optimistic case, the update multicast itself might return as soon as the multicast was in process, and the delivery events could occur as soon as ordering was determined. Latency is thus minimal.

The trouble with optimistic early delivery is that a crash can erase the multicast at the first stages of such a protocol. Thus there might be a window of a milisecond or so during which a power failure could kill some process after it sent (or received and delivered ) such a multicast, but before it was fully delivered at other members. Perhaps we would end up with R having delivered (and acted upon the message), and use of the crash, S never gets a copy.

This is a bad behavior: normally, one wants failure atomicity. Accordingly, the Isis or Vsync application developer who used these optimistic protocols was told to invoke a flush primitive before taking actions that might remain visible after a crash, like giving out cash from the ATM, or sending a reply to an external client application. The flush primitive simply waited until stability was achieved, returning after the message had become safe. Thus if R called flush before handing out cash, either the multicast stabilized and we distribute the money, or the crash happens while flush is waiting, and R is killed, but no money was handed out, hence nothing bad has occurred.

The idea was that optimistic delivery plus flush would be equivalent to pessimistic delivery, but that during the period before flush was called, computing overlaps with background message passing and extra work can be done. In Vsync and Isis, the totally ordered multicast was optimistic, whereas the "safe" primitive was pessimistic. You may have heard of vertical Paxos protocol: this is also a pessimistic protocol. Pessimistic protocols in effect call flush before the application ever sees the message, so that the application layer is always safe.

As a teaching device, we would ask new Isis or Vsync developers to think about the way that buffered disk writes are done in a file system. The application writes some text, but it ends up buffered first by the file I/O library, and then in the kernel. Seconds can elapse before the data is on the disk. The value of buffering is speed: performance is dramatically improved, far better than if fflush is called character by character or even line by line.

Today I'm revisiting this same question in Derecho. Our basic OrderedSend is pessimistic, and pretty fast. We also have a raw version of send that lacks ordering guarantees or failure atomicity and runs at the highest speeds possible. But do we need an optimistic delivery too?

The case for optimistic deliver centers on this question: are there applications that need strong properties, yet also need the small degree of overlapped computing afforded by optimism? A pessimistic delivery rarely delays more than a few microseconds relative to an optimistic one. So you can deliver a small multicast after, say, 12us in an optimistic mode, but might have to wait 25us to deliver it with full safety (pessimistically).

In our experiments, the performance gap is most extreme for small messages. Our impression, so far, is that these suffer from two kinds of delay. First, with very small multicasts, the actual time needed to report stability through the SST subsystem is proportionally larger relative to data transfer delays, so when you compare a small message delivery to a large one, the wait for stabilization simply looks larger. And then the other issue is that because these messages need to be buffered, Derecho runs out of space and at high rates, new incoming messages are delayed waiting for an admission slot.

In contrast, optimistic delivery mode lets Derecho hand off incoming messages to the application instantly, hence the system buffers only data that is in-flight, and won't incur this admission backlog buildup. Stabilization takes the same amount of time, but by the time flush is invoked by a handler processing message m, there is more hope that m will have fully stabilized and that flush won't delay at all.

But keep in mind that with raw mode, there is no delay and no need to use flush at all. The question becomes this: will people who would find optimistic delivery useful be content to use raw mode instead? If so, the demand for optimistic mode would be low, and we should then omit the extra mechanisms and extra complexity. Or do they need stronger guarantees when failures occur? And if the latter, just what would those applications look like, that need atomicity but also need to shave 25us or so off the delivery delay?

Let us know if you have a great example of an application that would really benefit from optimistic early delivery. For now, we know how to support optimism within Derecho, but are learning towards not doing so!

A Few Thoughts on Distributed Computing

Sunday, 12 March 2017

Should Derecho do optimistic early delivery?

No comments:

Post a Comment