A Few Thoughts on Distributed Computing: June 2019

Wednesday, 26 June 2019

Whiteboard analysis: IoT Edge reactive path

One of my favorite papers is the one Jim Gray wrote with Pat Helland, Patrick O'Neil and Dennis Shasha, on the costs of replicating a database over a large set of servers, which they showed to be prohibitive if you don't fragment (shard) the database into smaller and independentally accessed portions: mini-databases. In some sense, this paper gave us the modern cloud, because you can view Brewer's CAP conjecture and the eBay/Amazon BASE methodologies as both flowing from Gray's original insight.

Fundamentally, what Jim and his colleagues did was to undertake a whiteboard analysis of the scalability of concurrency control in an uncontrolled situation, where transactions are simply submitted to some big pool of servers, and then compete for locks in accordance with a two-phase locking model (one in which a transaction acquires all its locks before releasing any), and then terminates using a two-phase or three-phase commit. They show that without some mechanism to prevent lock conflicts, there is a predictable and steadily increasing rate of lock conflicts leading to delay and even deadlock/rollback/retry. The phenomenon causes overheads to rise as a polynomial in the number of servers over which you replicate the data, and quite sharply: I believe it was N^3 in the number of servers, and T^5 in the rate of transactions. So your single replicated database will have a perform collapse. With shards, using state machine replication (implemented using Derecho!) this isn't an issue, but of course we don't get the full SQL model at that point -- we end up with a form of NoSQL on the sharded database, similar to what MongoDB or Amazon's Dynamo DB offers.

Of course the "dangers" paper is iconic, but the techniques it uses are of broad value. And this was central to the way Jim approached problems: he was a huge fan in working out the critical paths and measuring costs along them. In his cloud database setup, a bit of fancy mathematics let the group he was working with turn that sort of thinking into a scalability analysis that led to a foundational insight. But even if you don't have an identical chance to change the world, it makes sense to try and follow a similar path.

This has had me thinking about paper-and-pencil analysis of the critical paths and potential consistency conflict points for large edge IoT deployments of the kind I described last week. Right now, those paths are pretty messy, if you approach it this way. Without an edge service, we would see something like this:

IoT IoT Function Micro
Sensor ---------------> Hub ---> Server ------> Service

In this example I am acting as if the function server "is" the function itself, and hiding the step in which the function server looks up the class of function that should handle this event, launches it (or perhaps had one waiting, warm-started), and then hands off the event data to the function for handling on one of its servers. Had I included this handoff the image would be more like this:

IoT IoT Function Function Micro

Sensor ---------------> Hub ---> Server ------> F -----> Service

F is "your function", coded in a language like C#, F# or C++ or Python, and then encapsulated into a container of some form. You'll want to keep these programs very small and lightweight for speed. In particular, a function is not the place to do any serious computing, or to try and store anything. Real work occurs in the micro service, the one you built using Derecho. Even so, this particular step looks costly to me: without warm-starting it, launching F could take a substantial fraction of a section. And if F was warm-started, the context switch still involves some form of message passing, plus waking F up, and could still be many tens or even hundreds of milliseconds: an eternity at cloud speeds!

Even more concerning, many sensors can't connect directly to the cloud, and we end up cloning the architecture and running it twice: within an IoT Edge system (think of that as an operating system for a small NUMA machine or a cluster, running close to the sensors, and then relaying data to the main cloud if it can't handle the events out near the sensor device).

IoT Edge Edge Fcn IoT Function Micro

Sensor ---------------> Hub ---> Server -> F======> Hub ---> Server -> CF -> Service

Notice that now we have two user-supplied functions on the path. The first one will have decided that the event can't be handled out at the edge, and forwarded the request to the cloud, probably via a message queuing layer that I haven't actually shown, but represented using a double-arrow: ===>. This could have chosen to store the request and send it later, but with luck the link was up and it was passed to the cloud instantly, didn't need to sit in an arrival queue, and was instantly given to the cloud's IoT Hub, which in turn finally passed it to the cloud function server, the cloud function (CF) and the Micro Service.

The Micro Service may actually be a whole graph of mutually supporting Micro Services, each running on a pool of nodes, and each interacting with some of the others. The cloud's "App Server" probably hosts these and provides elasticity if a backlog forms for one of them.

We also have the difficulty that many sensors capture images and videos. These are initially stored on the device itself, which has substantial capacity but limited compute power. The big issue is that the first link, from sensor to the edge hub, would often be bandwidth limited. So we can't upload everything. Very likely what travels from sensor to hub is just a thumbnail and other meta-data. Then the edge function concludes that a download is needed (hopefully without too much delay), sends back a download request to the imaging device, and then the device moves the image to the cloud.

Moreover, there are industry standards for uploading photos and videos to a cloud, and those put the uploaded objects into the edge version of the blob store (short for "binary large objects"), which in turn is edge aware ands will mirror them to the main cloud blob store. Thus we have a whole pathway from IoT sensor to the edge blob server, which will eventually generate another event later to tell us that the data is ready. And as noted, for data that needs to reach the actual cloud and can't be processed at the edge, we replicate this path too, moving that image via the queuing service to the cloud.

So how long will all of this take? Latencies are high and bandwidth low for the first hop, because sensors rarely have great connectivity, and almost never have the higher levels of power required for really fast data transfers (even with 5G). So perhaps we will see a 10ms delay at that stop, plus more if the data is large. Inside the edge we should have a NUMA machine or perhaps a small cluster, and can safely assume 10G connections with latencies of 10us or less, although of course software like TCP will often impose its own delays. The big delay will probably be the handoff to the user-defined function, F.

My guess is that for an event that requires downloading a small photo, the very best performance will be something like 50ms before F sees the event (maybe even 100ms), then another 50-100 for F to request a download, then perhaps 200ms for the camera to upload the image to the blob server, and then a small delay (25ms?) for the blob server to trigger another event, F', saying "your image is ready!". We're up near 350ms and haven't done any work at all yet!

Because the function server is limited to lightweight computing, it hands off to our micro-service (a quick handoff because the service is already running; the main delay will be the binding action by which the function connects to it, and perhaps this can be done off the critical path). Call this 10ms? And then the micro service can decide what to do with this image.

Add another 75ms or so if we have to forward the request to the cloud. So the cloud might not be able to react to a photo in less than about 500ms, today.

None of this involved a Jim Gray kind of analysis of contention and backoff and retry. If you took my advice and used Derecho for any data replication, the 500ms might be the end of the story. But if you were to use a database solution like MongoDB (CosmosDB on Azure), it seems to me that you might easily see a further 250ms right there.

What should one do about these snowballing costs? One answer is that many of the early IoT applications just won't care: if the goal is to just journal that "Ken entered Gates Hall at 10am on Tuesday", a 1s delay isn't a big deal. But if the goal is to be reactive, we need to do a lot better.

I'm thinking that this is a great setting for various forms of shortcut datapaths, that could be set up after the first interaction and offer direct bypass options to move IoT events or data from the source directly to the real target. Then with RDMA in the cloud, and Derecho used to build your micro service, the 500ms could drop to perhaps 25 or 30ms, depending on the image size, and even less if the photo can be fully handled on the IoT Edge server itself.

On the other hand, if you don't use Derecho but you do need consistency, you'll get into trouble quickly: with scale (lots of these pipelines all running concurrently), and contention, it is easy to see how you could trigger Jim's "naive replication" concerns. So designers of smart highways had better beware: if they don't heed Jim's advice (and mine), by the time that smart highway warns that a car should "watch out for that reckless motorcycle approaching on your left!" it will already have zoomed past...

These are exciting times to work in computer systems. Of course a bit more funding wouldn't hurt, but we certainly will have our work cut out for us!

Saturday, 22 June 2019

Data everywhere but only a drop to drink...

One peculiarity of the IoT revolution is that it may explode the concept of big data.

The physical world is a domain of literally infinite data -- no matter how much we might hope to capture, at the very most we see only a tiny set of samples from an ocean of inaccessible information because we had no sensor in the proper place, or we didn't sample at the proper instant, or didn't have it pointing in the right direction or focused or ready to snap the photo, or we lacked bandwidth for the upload, or had no place to store the data and had to discard it, or misclassified it as "uninteresting" because the filters used to make those decisions weren't parameterized to sense the event the photo was showing.

Meanwhile, our data-hungry machine learning algorithms currently don't deal with the real world: they operate on snapshots, often ones collected ages ago. The puzzle will be to find a way to somehow compute on this incredible ocean of currently-inaccessible data while the data is still valuable: a real-time constraint. Time matters because in so many settings, conditions change extremely quickly (think of a smart highway, offering services to cars that are whizzing along at 85mph).

By computing at the back-end, AI/ML researchers have baked in very unrealistic assumptions, so that today's machine learning systems have become heavily skewed: they are very good at dealing with data acquired months ago and painstakingly tagged by an army of workers, and fairly good at using the resulting models to make decisions within a few tens of milliseconds, but in a sense consider the action of acquiring data and processing it in real-time to be part of the (offline) learning side of the game. In fact many existing systems wouldn't even work if they couldn't iterate for minutes (or longer) on data sets, and many need that data to be preprocessed in various ways, perhaps cleaned up, perhaps preloaded and cached in memory, so that a hardware accelerator can rip through the needed operations. If a smart highway were capturing data now that we would want to use to relearn vehicle trajectories so that we can react to changing conditions within fractions of a second, many aspects of this standard style of computing would have to change.

To me this points to a real problem for those intent on using machine learning everywhere and as soon as possible, but also a great research opportunity. Database and machine learning researchers need to begin to explore a new kind of system in which the data available to us is understood to be a "skim" (I learned this term when I used to work with high performance computing teams in scientific computing settings where data was getting big decades ago. For example the CERN particle accelerators capture far too much data to move data from the sensor, so even uploading "raw" data involves deciding which portions to keep, which to sample randomly, and which to completely ignore).

Beyond this issue of deciding what to include in the skim, there is the whole puzzle of supporting a dialog between the machine-learning infrastructure and the devices. I mentioned examples in which one need to predict that a photo of such and such a thing would be valuable, anticipate the timing, point the camera in the proper direction, pre-focus it (perhaps, on an expected object that isn't yet in the field of view, so that the auto-focus wouldn't be useful because the thing we want to image hasn't yet arrived), plan the timing, capture the image, and then process it -- all under real-time pressure.

I've always been fascinated by the emergence of new computing areas. To me this looks like one ripe for exploration. It wouldn't surprise me at all to see an ACM Symposium on this topic, or an ACM Transactions journal. Even at a glance one can see all the elements: a really interesting open problem that would lend itself to a theoretical formalization, but also one that will require substantial evolution of our platforms and computing systems. The area is clearly of high real-world importance and offers a real opportunity for impact, and a chance to build products. And it emerges at a juncture between systems and machine learning: a trending topic even now, so that this direction would play into gradually building momentum at the main funding agencies, which rarely can pivot on a dime, but are often good at following opportunities in a more incremental, thoughtful way.

The theoretical question would run roughly as follows. Suppose that I have a machine-learning system that lacks knowledge required to perform some task (this could be a decision or classification, or might involve some other goal, such as finding a path from A to B). The system has access to sensors, but there is a cost associated with using them (energy, repositioning, etc). Finally, we have some metric for data value: a hypothesis concerning the data we are missing that tells us how useful a particular sensor input would be. Then we can talk about the data to capture next that minimizes cost while maximizing value. Given a solution to the one-shot problem, we would then want to explore the continuous version, where the new data changes these model elements, fixed-points for problems that are static, and quality of tracking for cases where the underlying data is evolving.

The practical systems-infrastructure and O/S questions center on the capabilities of the hardware and the limitations of today's Linux-based operating system infrastructure, particularly in combination with existing offloaded compute accelerators (FPGA, TPU, GPU, even RDMA). Today's sensors run a gamut from really dumb fixed devices that don't even have storage to relatively smart sensors that can do various tasks on the device itself, have storage and some degree of intelligence about how to report data, etc. Future sensors might go further, with the ability to download logic and machine-learned models for making such decisions: I think it is very likely that we could program a device to point the camera at such and such a lane on the freeway, wait for a white vehicle moving at high speed that should arrive in the period [T0,T1], obtain a well-focused photo showing the license plate and current driver, and then report the image capture accompanied by a thumbnail. It might even be reasonable to talk about prefocusing, adjust the spectral parameters of the imaging system, selecting from a set of available lenses, etc.

Exploiting all of this will demand a new ecosystem that combines elements of machine learning on the cloud with elements of controlled logic on the sensing devices. If one thinks about the way that we refactor software, here we seem to be looking at a larger-scale refactoring in which the machine learning platform on the cloud, with "infinite storage and compute" resources, has the role of running the compute-heavy portions of the task, but where the sensors and the other elements of the solution (things like camera motion control, dynamic focus, etc) would need to participate in a cooperative way. Moreover, since we are dealing with entire IoT ecosystems, one has to visualize doing this at huge scale, with lots of sensors, lots of machine-learned models, and a shared infrastructure that imposes limits on communication bandwidth and latency, computing at the sensors, battery power, storage and so forth.

It would probably be wise to keep as much of the existing infrastructure as feasible. So perhaps that smart highway will need to compute "typical patterns" of traffic flow over a long time period with today's methodologies (no time pressure there), current vehicle trajectories over mid-term time periods using methods that work within a few seconds, and then can deal with instantaneous context (a car suddenly swerves to avoid a rock that just fell from a dumptruck onto the lane) as an ultra-urgent real-time learning task that splits into the instantaneous part ("watch out!") and the longer-term parts ("warning: obstacle in the road 0.5miles ahead, left lane") or even longer ("at mile 22, northbound, left lane, anticipate roadway debris"). This kind of hierarchy of temporality is missing in today's machine learning systems, as far as I can tell, and the more urgent forms of learning and reaction will require new tools. Yet we can preserve a lot of existing technology as we tackle these new tasks.

Data is everywhere... and that isn't going to change. It is about time that we tackle the challenge of building systems that can learn to discover context, and use current context to decide what to "look more closely" at, and with adequate time to carry out that task. This is a broad puzzle with room for everyone -- in fact you can't even consider tackling it without teams that include systems people like me as well as machine learning and vision researchers. What a great puzzle for the next generation of researchers!