Monday, 15 October 2018

Improving asynchronous communication in u-service systems.

I've been mostly blogging on high-level topics related to the intelligent edge, IoT, and BlockChain lately.  In my work, though, there are also a number of deeper tradeoffs that arise.  I thought I might share one, because it has huge performance implications in today's systems and yet seems not to get much "press".

Context.  To appreciate the issue I want to talk about, think about a data pipeline moving some sort of stream of bytes at very high speed from point A to point B.  If you want to be very concrete, maybe A is a video camera and B is a machine that runs some form of video analytic.  Nothing very exotic.  Could run on TCP -- or it could use RDMA in some way.  For our purposes here, you have my encouragement to think of RDMA as a hardware-accelerated version of TCP.

Now, what will limit performance?  One obvious issue is that if there is some form of chain of relaying between A and B, any delays in relaying could cause a performance hiccup.  Why might a chain arise?  Well, one obvious answer would be network routing, but modern cloud systems use what we call a micro-services (u-service) model, and for these, we deliberately break computations down into stages, and then each stage is implemented by attaching a small "function" program (a normal program written in C++ or Java or a similar language, running in a lightweight container setting like Kubanetes, and with "services" like file storage coming from services it talks to in the same data center, but mostly running on different machines).  You code these much as you might define a mouse-handler method: there is a way to associate events with various kinds of inputs or actions, and then to associate an event handler with those events, and then to provide the code for the handler.  The events themselves are mostly associated with HTML web pages, and because these support a standard called "web services", there is a way to pass arguments in, and even to get a response: a form of procedure call that runs over web pages, and invokes a function handler program coded in this particular style.

So why might a u-services model generate long asynchronous chains of actions?  Well, if you consider a typical use case, it might involve something like a person snapping a photo, which uploads into the cloud, where it is automatically analyzed to find faces, and then where the faces are automatically tagged with best matches in the social network of the camera-owner, etc.  Each of these steps would occur as a function event in the current functional cloud model, so that single action of taking a photo (on Facebook, WhatsApp, Instagram, or whatever) generated a chain.  In my work on Derecho, we are concerned with chains too.  Derecho is used to replicate data, and we often would see a pipeline of photos or videos or other objects, which then are passed through layers of our system (a chain of processing elements) before they finally reach their delivery target.

Chains of any kind can cause real issues.  If something downstream pauses for a while, huge backlogs can form, particularly if the system configures its buffers generously.  So what seems like a natural mechanism to absorb small delays turns out to sometimes cause huge delays and even total collapse!

Implications.  With the context established, we can tackle the real point, namely that for peak performance the chain has to run steadily and asynchronously: we need to see an efficient, steady, high-rate of events that flow through the system with minimal cost.  This in turn means that we want some form of buffering between the stages, to smooth out transient rate mismatches or delays.  But when we buffer large objects, like videos, we quickly fill the in-memory buffer capacity, and data will then spill to a disk.  The stored data will later need to be reread when it is finally time to process it: a double overhead that can incur considerable delay and use quite a lot of resources.  With long pipelines, these delays can be multiplied by the pipeline length.  And even worse, modern standards (like HTML), often use a text-based data representation when forwarding information, but use a binary one internally: the program will be handed a photo, but the photo was passed down the wire in an ascii form.  Those translations cost a fortune!

What alternative did we have?  Well, one option to consider is back-pressure: preventing the source from sending the photo unless there is adequate space ("credits") available in the receiver.  While we do push the delay all the way back to the camera, the benefit is that these internal overheads are avoided, so the overall system capacity increases.  The end-to-end system may perform far better even though we've seemingly reduced the stage-by-stage delay tolerance.

But wait, perhaps we can do even better.   Eliminating the ascii conversions would really pay off.  So why not trick the system into taking the original photo, storing it into a blob ("binary large object") store, and substituting a URL in the pipeline.  So now our pipeline isn't carrying much data at all -- the objects being shipped around might be just a few tens of bytes in size, instead of many megabytes.   Now the blob store can offer a more optimized photo transfer service, and the need for buffering in the pipeline itself is eliminated because these tiny objects would be so cheap to ship.

This would be a useless optimization if most stages need to touch the image data itself, unless the stage by stage protocol is hopelessly tied to ascii representations and the blob-store-transfer is super-fast (but wait... that might be true!).

However, there is even more reason to try this: In a u-services world, most stages do things like key-value storage, queuing or message-bus communication, indexing.  Quite possibly the majority of stages aren't data-touching at all, and hence wouldn't fetch the image itself.  The tradeoff here would be to include meta-data actually needed by the intermediary elements in the pipeline while storing rarely-needed bits in the blob store, for retrieval only when actually needed.  We could aspire to a "zero copy" story: one that never copies unnecessarily (and that uses RDMA for the actually needed data transfers, of course!)