Friday, 20 December 2019

A few 10-Year Challenges for Distributed Systems and IoT

A newspaper column on "next decade" predictions got me thinking about crystal ball prognoses.   Tempting as it is to toss in my views on climate change, surveillance in China and self-driving cars, I'll focus this particular blog on computer systems topics in my area.

1. AI Sys and RT ML.  These terms relate to computer systems created to support AI/ML applications, and that often involve addressing real-time constraints.  There is a second meaning that centers on using AI tools in networks and operating systems and database platforms. I'm open-minded but, so far, haven't seen convincing demonstrations that this will yield big advances.  I'll focus on the first meaning here.

Although AI Sys terminology is trendy, the fact is that we are still at the very earliest stages of incorporating sensing devices into applications that leverage cloud-scale data and machine learning.  As this style of system deployment accelerates in coming years, we'll start to see genuinely smart power grids (existing grids often proclaim themselves to be "smart" but honestly, not much use is being made of ML as of yet), smart homes and offices, smart cities and highways, smart farms....  The long-term potential is enormous, but to really embrace it we need to rethink the cloud, and especially, the cloud edge where much of the reactive logic needs to run.  This is why the first of my predictions centers on the IoT edge: we'll see a new and trustworthy edge IoT architecture emerge and mature, in support of systems that combine sensors, cloud intelligence and big data.

Getting there will require more than just redesigning today's cloud edge components, but the good news is that an area begging for disruptive change can be an ideal setting for a researcher to tackle.  To give just one example: in today's IoT hub services, we use a database model to track the code revision level and parameter settings for sensors, connected and accessible or not.  The IoT hub manages secure connectivity to the sensors, pushes updates to them, and filters incoming event notifications, handing off to the function service for lightweight processing.  I really like the hub concept, and I think it represents a huge advance relative to the free-for-all that currently is seen when sensors are connected to the cloud.  Moreover, companies like Microsoft are offering strong quality of service guarantees (delay, bandwidth, and even VPN security) for connectivity to the edge.  They implement the software, then contract with ISPs and teleco's to obtain the needed properties.  From the customer's perspective what matters is that the sensors are managed in a trustworthy, robust and secure manner.

The puzzle relates to the reactive path, which is very far from satisfactory right now.  When a sensor sends some form of event to the cloud, the IoT hub operates like a windowing environment handling mouse movements or clicks: it functions as the main loop for a set of handlers that can be customized and even standardized (thus, a Canon camera could eventually have standard cloud connectivity with standard events such as "focused", or "low power mode", or "image acquired.").  Like with a GUI, the incoming events that need user-defined processing are passed to these user-defined functions, which can customize what will happen next.

The core problem is with the implementation.  First, we split the path: sensor-to-cloud uploads of large objects like photos are videos follow one path, and end up with the data automatically stoed into a binary large objects (BLOB) store, replicated for fault-tolerance.

Meanwhile, other kinds of events, like the ones just mentioned, are handled by small fragments of logic: cloud functions.  But these aren't just lambdas written in C++ or Scala -- they are typically full programs coded in Linux and then handed to the cloud as containers, perhaps with configuration files and even their own virtualized network mapping.  As a result, the IoT Hub can't just perform the requested action -- it needs to launch the container and pass the event into it.

The IoT hub accomplishes these goals using the "function service", which manages a pool of machines, picks one on which to launch this container for this particular Canon photo acquisition event, and then the program will load that event's meta-data and can decide what to do.  In effect, we launch a Linux command.

Normally, launching a Linux command has an overhead of a few milliseconds.  Doing so through the IoT hub is much slower: Today, this process takes as much as two seconds.   The issues are several: first, because the IoT Hub is built on a database like SQL server or Oracle, we have overheads associated with the way databases talk to services like the function service.  Next the function service itself turns out to do a mediocre job of warm-starting functions -- here the delay would center on caching, binding the function to any microservices it will need to talk to ahead of time (off the critical path), dealing with any synchronization the function may require.

I can't conceive of a sensible realtime use case where we can tolerate two seconds of delay -- even web page interactions are down in the 10-50ms range today, well below the 100ms level at which alpha-beta tests show that click-through drops.  So I would anticipate a complete redesign of the IoT hub and function layer to warm-start commonly needed functions, allow them to pre-bind to any helper microservices they will interact with (binding is a potentially slow step but can occur out of the critical path), and otherwise maintain a shorter critical path from sensor to user-mediated action.  I think we could reasonably target sub-1ms delays... and need to do so!

There are many other unnecessarily long delays in today's IoT infrastructures, impacting everything from photo and video upload to ML computation on incoming objects.  But none of this is inevitable, and from a commercial perspective, the value of reengineering it (in a mostly or fully compatible way) would be huge.

2. Cost-efficient sharable hardware accelerators for IoT Edge.  In prior blog postings, I've written about the puzzle of hardware for the IoT Edge (many people take that to mean "outside" the cloud, but I also mean "in the outermost tier of a data center supporting the cloud, like Azure IoT).  Here, the central question involves costs: modern ML and especially model training is cost-effective only because we can leverage hardware accelerators like GPU, TPU and custom FPGA to offload the computationally parallel steps into ultra-efficient hardware.  To this, add RDMA and NVM.

The current generation of hardware components evolved in backed back-end systems, and it is no surprise to realize that they are heavily optimized for batched, offline computing.  And this leads to the key puzzle:  today's ML accelerators are expensive devices that are cost effective only when they can be kept busy.  The big batches of work seen in the back-end enable today's accelerators to run in support of very long tasks, which keeps them busy and makes them cost-effective.  If the same devices were mostly idle, this style of accelerated ML would become extremely expensive.

In some sense, today's ML accelerators could have been at home in the old-styled batch computing systems of the 1970's.  As we migrate toward a more event-driven IoT edge, we also will need to migrate machine learning (model training) and inference into real-time contexts, and this means that we'll be using hardware accelerators in settings that lack the batched pipelining that dominates in the big-data HPC-style settings were those currently reside.  To be cost-effective we will either need completely new hardware (sharable between events or between users), or novel ways to repurpose our existing hardware for use in edge settings.

It isn't obvious how to get to that point, making it a fascinating research puzzle.  As noted, edge systems are event-dominated, although we do see streams of image and video data (image-processing tasks on photo or video streams can be handled fairly well with existing GPU hardware, so that particular case can be solved cost-effectively now).  The much harder case involves singleton events: "classify this speech utterance," or "decide whether or not to retain a copy of that photo."  So the problem is to do snap analysis of an event.  And while my examples involve photos and videos, any event could require an intelligent response.  We may only have milliseconds to react, and part of that reaction may entail retraining or incrementally adjusting the ML models -- dynamic learning.

The hardware available today isn't easily sharable across scaled out event-driven systems where the events may originate in very different privacy domains, or from different users.  We lack ways to protect data inside accelerators (Intel's new SIMD instruction set offers standard protections, but a GPU or TPU or FPGA is typically operated as a single security context: it is wide-open if a task runs on behalf of me immediately after one that ran on behalf of you: the kernel I've invoked could just reach over and extract any data left behind after your task was finished).

So why not use Intel's SIMD solutions?  For classification tasks, this may be the best option, but for training, which is substantially more expensive from a computational point of view, the Intel SIMD options are currently far slower than GPU or TPU (FPGA is the cheapest of all the options, but would typically be somewhere in between the SIMD instructions and a GPU on the performance scale).

It will be interesting to watch this one play out, because we can see the end goal easily, and the market pressure is already there.  How will the hardware vendors respond?  And how will those responses force us to reshape the IoT edge software environment?

3. Solutions for the problem blockchain was supposed to solve.  I'm pretty negative about cryptocurrencies but for me, blockchain is a puzzle.  Inside the data center we've had append-only logs for ages, and the idea of securing them against tampering using entangled cryptographic signatures wasn't particularly novel back when the blockchain for Bitcoin was first proposed.  So why is a tamper-proof append-only log like Microsoft's Corfu system not a blockchain?

There are several aspects in which blockchain departs from that familiar, well-supported option.  I'll touch on them in an unusual order: from practical uses first to more esoteric (almost, "religious") considerations, which I'll tackle last.  Then I want to argue that the use cases do point to a game changing opportunity, but that the whole story is confused by the religious zealotry around some of these secondary and actually, much less important aspects.

First among the novel new stories is the concept of a smart contract, which treats the blockchain as a database and permits the developer to place executable objects into Blockchain records, with the potential of representing complex transactions like the mortgage-backed securities that triggered the 2008 meltdown.  The story goes that if we can capture the full description of the security (or whatever the contract describes), including the underlying data that should be used to value it, we end up with a tamper roof and self-validating way to price such things, and our transactions will be far more transparent.

I see the value in the concept of a smart contract, but worry that the technology has gotten ahead of the semantics: as of the end of 2019 you can find a dozen tools for implementing smart contracts (Ethereum is the leader, but Hyperledger is popular too).  Less clear is the question of precisely how these are supposed to operate.  Today's options are a bit like the early C or Java programming languages: both omitted specifications for all sorts of things that actually turned out to matter, leaving it to the compiler-writer to make a choice.  We ended up with ambiguities that gave us today's security problems with C programs.

With blockchain and smart contracts you have even nastier risks because some blockchain implementations are prone to rollback (abort), and yet smart contracts create dependency graphs in which record A can depend on a future record B.  A smart contract won't seem so smart if this kind of ambiguity is allowed to persist... I predict that 2020 will start a decade when smart contracts with strong semantics will emerge.  But I'll go out on a limb and also predict that by the time we have such an option, there will be utter chaos in the whole domain because of these early but inadequate stories.  Smart contracts, the real kind that will be robust with strong semantics?  I bet we won't have them for another fifteen years -- and when we do get them, it will be because a company like Oracle or Microsoft steps in with a grown-up product that was thought through from bottom to top.  We saw that dynamic with Java and CORBA giving way to C# and LINQ and .NET, which in turn fed back into languages like C++.  And we will see it again, but it will take just as long!

But if you talk to people enamored with blockchain, it turns out that in fact, smart contracts are often seen as a cool curiosity.  I might have a narrow understanding of the field, but among people I'm in touch with, there is little interest in cryptocurrency and even less interest in smart contracts.  More common, as far as I can tell, is a focus on the auditability of a tamperproof ledger.

I'll offer one example that I run into frequently here at Cornell, in the context of smart farming.  You see variants of it in medical centers (especially ones with partner institutions that run their own electronic health systems), human resource management, supply chains, airports that need to track airplane maintenance, and the list goes on.  At any rate, consider farm to table cold-chain shipment for produce or agricultural products like cheese or processed meats.  A cup of yoghurt will start with the cow being milked, and even at that stage we might wish to track which cow we milked, how much milk she produced, the fat content, document that she was properly washed before the milking machine kicked in, that we tested for milk safety and checked her health, that the milk was promptly chilled and then stored at the proper temperature.  Later the milk is aggregated into a big batch, transported, tested again, pasteurized, homogenized, graded by fat content, cultured (and that whole list kicks in again: in properly sterile conditions, at the right temperature...).|

So here's the challenge: Could we use a blockchain to capture records of these kinds in a secure and tamperproof manner, and then be in a position to audit that blockchain for various tasks such as to confirm that the required safety steps were preserved, or to look for optimization opportunities?  Could we run today's ML tools on it, treating the records as an ordered collection and mapping that collection into an event Tensor Flow or Spark/Databricks could ingest and analyze?  I see this a fantastic challenge problem for the coming decade.

The task is fascinating and hard, for a lot of reasons.  One is that the domain is partly disconnected (my colleagues have created a system, Vegvisir, focused on this aspect).  A second question you can ask concerns integrity of our data capture infrastructure: can I trust that this temperature record is from the proper thermometer, correctly calibrated, etc?  Do I have fault-tolerant redundancy?  How can we abstract from the chain of records to a trustworthy database, and what do trust-preserving queries look like?  How does one do machine learning on a trusted blockchain, and what trust properties would the model then carry?  Can a model be self-certifying, too?  What would the trust certificate look like (at a minimum, it would need to say that "if you trust X and Y and Z, you can trust me for purpose A under assumption B...").  I'm reminded of the question of self-certifying code... perhaps those ideas could be applied in this domain.

I commented that this is the problem blockchain really should be addressing.  I say this because as far as I can tell, the whole area is bogged down on really debates that have more to do with religion than with rigorous technical arguments.  To me this is at least in part because of the flawed belief that anonymity and permissionless mining are key properties that every blockchain should offer.  The former is of obvious value if you plan to do money laundering, but I'm pretty sure we wouldn't even want this property in an auditing setting.  As for the permissionless mining model, the intent was to spread the blockchain mining revenue fairly, but this has never really been true in any of the main blockchain systems: they are all quite unfair, and all the revenue goes to shadowy organizations that operate huge block-mining systems.  As such, the insistence on permissionless mining with anonymity really incarnates a kind of political opinion, much like the "copyleft" clause built into GNU licenses, which incarnated a view that software shouldn't be monetized.  Permissionless blockchain incarnates the view that blockchains are for cybercurrency, that cybercurrency transactions shouldn't be taxed or regulated, and that management of this infrastructure is a communal opportunity, but also a communal revenue source.

Turning to permissionless blockchain as it exists today, we have aspects of this dreamed-of technology, but the solutions aren't fair, and in fact demand a profoundly harmful mining model that squanders energy in the form of hugely expensive proof-of-work certifications.  My colleague, Robbert van Renesse, has become active in the area and has been doing a survey recently to also look at some of the other ideas people have floated: proof of stake (a model in which the rich get richer, but the compute load is much reduced, so they spend less to earn their profits...), proof of elapsed time (a lovely acronym, PoET, but in fact a problematic model because the model can be subverted using today's Intel SGX hardware), and all sorts of one-way functions that are slow to compute and easy to verify (the parallelizable ones can be used for proof-of-work but the sequential ones  simply reward whoever has the fastest computer, which causes them to fail on a different aspect of the permissionless blockchain mantra: they are "undemocratic", meaning that they fail to distribute the income for mining blocks in a fair manner).  The bottom line, according to Robbert, is that for now, permissionless blockchain demands computational cycles and those cycles make this pretty much the least-green technology on earth. There is some irony here, because those who promote this model generally seem to have rather green politics in other ways.  I suppose this says something about the corrupting influence of potentially vast wealth.

Meanwhile, more or less betting on the buzz, we have a whole ecosystem of companies convinced that what people really want are blockchain curation products for existing blockchain models.  These might include tools that build the blockchain for you using the more-or-less standard protocols, that back it up, clean up any garbage, index it for quick access, integrate it with databases and AI/ML.  We also have companies promoting some exceptionally complex protocols, many of which seem to have the force of standards simply because people are becoming familiar with their names.  It will take many years to even understand whether or not some of these are correct -- I have serious doubts about a few of the most famous ones!

But here's my bet for the coming decade: in 2029, we'll be seeing this market morph into a new generation of WAN database consumers, purchasing products from today's database companies.  Those customers won't really be particularly focused on whether they use blockchain or some other technology (and certainly won't insist on permissive models with pervasive anonymity and proof of work).   They will be more interested in tamperproof audits and ML on the temporally-ordered event set.

Proof of work per-se will have long since died from resource exhaustion: the world simply doesn't have enough electrical power and cooling to support that dreadful model much longer (don't blame the inventors: the blame here falls squarely on the zealots in the cybercoin community, who took a perfectly good idea and twisted it into something harmful as part of their quest to become billionaires off the back of a pie-in-the-sky economic model).

The future WAN databases that emerge from the rubble will have sophisticated protection against tampering and the concept of trust in a record will have been elevated to a notion of a trustworthy query result, that can be checked efficiently by the skeptical end-user.  And this, I predict, will be a huge market opportunity for the first players to pull it off.  It would surprise me if those players don't turn out to include today's big database companies.

4. Leave-nothing-sensitive behind privacy.  The role of the cloud in smart settings -- the ones listed above, or others you may be thinking about -- is deeply enshrined by now:  very few smart application systems can avoid a cloud-centric computing model in which the big data and the machine intelligence is at least partly cloud-hosted.  However, for IoT uses, we also encounter privacy and security considerations that the cloud isn't terribly good at right now, with some better examples (Azure, on the whole, is excellent) and some particularly poor ones (I won't point a finger but I will comment that companies incented to place a lot of advertising often find it hard to avoid viewing every single user interaction as an invaluable asset that must be captured in perpetuity and then mined endlessly for every possible nugget of insight).

The upshot of this is that the cloud is split today between smart systems that are trying their best to spy on us, and smart systems that are just doing smart stuff to benefit us.  But I suspect that the spying will eventually need to end, at least if we hope to preserve our Western democracies.  How then can we build privacy-preserving IoT clouds?

I've written about this in the past, but in a nutshell, I favor a partnership: a style of IoT application that tries to "leave no trace behind" coupled to a cloud vendor infrastructure that promises not to deliberately spy on the end-user.  Thus for example when a voice command is given to my smart apartment, it may well need to be resolved up on the cloud, but shouldn't somehow be used to update databases about me, my private life, my friends...

I like the mental imagery of camping in a wilderness where there are some bears roaming around.  The cloud needs a model under which it can transiently step in to assist in making sense of my accent and choice of expressions, perhaps even contextualized by knowledge of me and my apartment, and yet when the task finishes, there shouldn't be anything left behind that can leak to third party apps that will rush into my empty campsite, hungry to gobble up any private data for advertising purposes (or worse, in countries like China, where the use of the Internet to spy on the population is a serious threat to personal liberties).  We need to learn to enjoy the benefits of a smart IoT edge without risk.

Can this be done?  I think so, if the cloud partner itself is cooperative.  Conversely, the problem is almost certainly not solvable if the cloud partner will see its revenue model break without all that intrusive information, and hence is hugely incented to cheat.  We should tackle the technical aspects now, and once we've enabled such a model, I might even favor asking legislative bodies to mandate privacy-preservation as a legally binding obligation on cloud vendor models.  I think this could be done in Europe, but the key is to first create the technology so that we don't end up with an unfunded and infeasible mandate.  Let's strike a blow against all those companies that want to spy on us!  Here's a chance to do that by publishing papers in top-rated venues... a win-win for researchers!

5. Applications that prioritize real-time.  Many IoT systems confront deadlines, and really have no choice except to take actions at the scheduled time.  Yet if we want to also offer guarantees, this poses a puzzle: how do we implemented solutions that are always sure to provide the desired timing properties, yet are also "as consistent" as possible, or perhaps "as accurate as possible", given those constraints?

To me this is quite an appealing question because it is easy to rattle off a number of ways one might tackle such questions.  For example, consider an ML algorithm that iterates until it converges, which typically involves minimizing some sort of error estimate.  Could we replace the fixed estimate by adopting a model that permits somewhat more error if the deadline is approaching?

Or here's an idea: What about simply skipping some actions because it is clear we can't meet the deadline for them?  I'm reminded of work Bart Selman, a colleague of mine, did fifteen years ago.  Bart was looking at situations in which an AI system confronted an NP complete question, but in a streaming context where variations on that question would be encountered every few seconds (he was thinking about robot motion planning but similar issues arise in many AI tasks).  What he noticed was that heuristics for solving these constrained optimization problems sometimes converge rapidly but in other situations diverge and compute endlessly.  So his idea, very clever, was to take the quick answers but to just pull the plug on computations that take too long.  In effect, Bart argued that if the robot is faced with a motion-planning task it won't be able to solve before its next step occurs, take the previously-planned step and then try again.  Sooner or later the computation will converge quickly, and the overall path will be both of high quality, and fast.

We could do similar things in many IoT edge settings, like the smart-things cases enumerated earlier.  You might do better to have a smart grid that finds an optimized configuration setting once every few seconds, but then coasts along using old settings, than to pause to solve a very hard configuration problem 20 times per second if in doing so, you'll miss the deadline for actually using the solution.  The same is true for management of traffic flow on a highway or in a dense city.

For safety purposes, we will sometimes still want to maintain some form of risk envelope.  If I'm controlling a smart car in a decision-loop that runs 20 times per second, I might not run a big risk if I toss up my hands even 4 or 5 times in a role. But we would not want to abandon active control entirely for 30 seconds, so there has to be a safety mechanism too, one that kicks in long before the car could cause an accident (or miss the next turn), forcing it into a safe mode.  I don't see any reason we couldn't do this: a self-driving car (or a self-managed smart highway) would need some form of safety monitor in any case, to deal with all sorts of possible mishaps, so having it play the role of making sure the vehicle has fresh guidance data seems like a fairly basic capability.  Then in the event of a problem, we would somehow put that car into a safe shutdown mode (it might use secondary logic to pull itself into a safety lane and halt, for example).

I could probably go on indefinitely, but every crystal ball eventually fogs over, so perhaps we'll call it quits here.  Have a great holiday and see you in the next decade!

Thursday, 7 November 2019

Sharable accelerators for the IoT Edge

After returning from the ACM Symposium on Operating Systems Principles (SOSP 2019, where we saw a huge number of ideas more focused on AI/ML than on classic operating systems), I wrote a blog posting that I shared a few days ago.

My core point was that we will face a kind of crisis in AI/ML computing as we try to leverage hardware accelerators.  The problem is that to leverage deep neural networks in IoT environments, both the training and the actions may need to occur under tight real-time pressures.  These IoT uses will huge volumes of data (images and video, voice, multi-spectral imaging data, and so forth).  Often, we  need to combine information from multiple sources before we can extract knowledge (data "fusion").  Add those observations up and you find yourself looking at a class of questions that can only be addressed using hardware accelerators.  This is because the core tasks will be data-parallel: operations that can occur in a single highly parallel step on an accelerator, but that might require billions of clock-cycles on a normal general-purpose computer.

But the cost-effectiveness of today's IoT hardware is very strongly tied to the world in which those devices have evolved.  Look at the SOSP papers and you read all sorts of results about improving the mapping of batch-style workloads into GPU and FPGA clusters (and the same insights apply to custom ASICs or TPUs).

Thus at the edge, we will find ourselves in a pinch: while the demand for cycles is similar while performing the inference or training task, these tasks are going to be event-driven: you fuse the data from a set of sources on a smart highway at the instant that you collect the data, with the intent of updating vehicle trajectory information and revising car guidance within seconds. And the problem this poses is that the accelerator might spend most of its time waiting for work, even though when work does show up, it has magical superpowers that let it discharge the task within microseconds.

To me this immediately suggests that IoT could be profligate in its use of hardware accelerators, demanding a whole FPGA or GPU cluster for a single event.  Under any normal model of the cost of these devices, the model would be extremely expensive.  I suppose you could make the case for some kind of bargain-basement device that might be a bit slower, less heavily provisioned with memory and otherwise crippled, and by doing that drive the costs down quite a bit.  But you'll also slow the tasks down, and that would then make ML for IoT a very feeble cousin to ML performed at the back-end on a big data analytics framework.

What would cost-effective IoT for the edge require, in terms of new hardware?  A first guess might be some form of time-sharing, to allow once device to be multiplexed between a large number of events from a single application that scales out to include many sensors (a relatively easy case), or between events from different users (a much harder one).  But I'm going to argue against this, at least if we base our reasoning on today's existing IoT accelerator options.

Multiplexing is exceptionally difficult for hardware accelerators.  Let's try and understand the root problems.  Now, in proposing this mini-deep-dive, I realize that many readers here are like me, and as such, probably view these units as black boxes.  But where we differ slightly is that over the past few years I've been forcing myself to learn more by teaching a class that looks at the roles of accelerators in modern datacenter computing.  I don't find these papers easy to read, or to teach, but along the way I've slowly gotten familiar with the models and learned about some interesting overheads.

When you look at a GPU in a datacenter, what are you likely to find under the hood?  As it happens, my colleague Weijia Song just asked this question, and arrived at a very nice high level summary.  He tells me that from a mile up, we should view a GPU computing unit as a specialized server, with its own memory (just like a normal computer, a GPU would be equipped with perhaps 12GB of DRAM), but then with processors that run a special kind of SIMD code (often written in a variant of C called CUDA) that performs block-parallel computations using blocks of GPU threads, each with its own GPU core.  L1 and L2 caching are software-managed, which is less exotic than you may think: with modern C and C++ we use annotations on variables to describe the desired consistency semantics and atomicity properties, and in fact the GPU scheme is rather similar.  So: we have a 3-level memory hierarchy, which can be understood by thinking about registers (L1), the normal kind of cache (L2) and DRAM resident on the GPU bus (like any normal DRAM, but fast to access from GPU code).

Weijia's summary is pretty close to what I had expected, although it was interesting to realize that the GPU has quite so much DRAM.  Interesting questions about how to manage that as a cache arise...

At any rate, the next thing to think about is this: when we use a GPU, we assume that it somehow has the proper logic loaded into it: the GPU program is up and ready for input.  But what is a GPU program?

It seems best to think of GPU code as a set of methods, written in CUDA, and comprising a kind of library: a library that was loaded into the GPU when we first took control over it, and that we can now issue calls into at runtime.  In effect the GPU device driver can load the address of a function into a kind of register, put the arguments into other registers, and press "run".  Later an interrupt occurs, and the caller knows the task is finished.  For big data arguments, we DMA transfer them into the GPU memory at the outset, and for big results we have a choice: we can leave them in the GPU memory for further use, or DMA them back out.

Now, loading that library had hardware implications and it took time: lets say 3 seconds to 30 seconds depending on whether on not a reboot was required.  So already we face an issue: if we wanted to share our GPU between multiple users, at a minimum we should get the users to agree on a single GPU program or set of programs at the outset, load them once, and then use them for a long time.  Otherwise, your decision to load such-and-such a tensor package will disrupt my ability to do data fusion on my smart highway data, and my clients (smart cars) will be at high risk of an accident.  After all: 3 to 30 seconds is quite a long delay!

Additionally, I should perhaps note that GPUs don't have particularly good internal security models.  If we actually do allow multiple users share one GPU, we depend on a form of cooperation between the library modules to protect against information leaking across from one user to another.  By and large sharing just isn't supported, but if GPUs really are shared, we either trust that this security is working, or we would have huge context switch overheads -- many seconds -- and would encounter cold caches after each such context switch occurs.  People are actually exploring this kind of sharing: several papers I've read recently take a single GPU, subdivide it among a few tasks, and even experiment with having that single device concurrently running distinct tasks in distinct "address spaces" (this is not the term they use, and the GPU analog of an address space is quite different from a general purpose machine with a virtual address space, but there is still a roughly similar construct).

But here's the thing: even if AWS offers me a steep discount for multitenancy, I'm not sure that I would want to share my smart-highway data fusion GPU.  After all, I don't know you.  Maybe you are running some kind of spyware application that uses sneaky methods to extract my data through our shared device!  AWS might promise some form of security through obscurity: "attackers would never know who they share the GPU unit with."  But would I want to trust that?

What about FPGA?  Here, the situation is analogous but a bit more flexible.  As you probably are aware, FPGA is really a generic way to encode a chip:  even the circuitry itself is reconfigurable.  An FPGA could encode a normal processor, like an ARM (although I wouldn't waste your time: most FPGA devices have a 6-core or 12-core ARM chip right onboard, so you might as well just run general purpose tasks on those).  If you want your FPGA to behave like a very stripped-down GPU, you can find a block of logic to do that.  Prefer an analog-to-digital conversion unit that also does an FFT on the incoming signal?  In most respects an FPGA could do that too (it might need a bit of additional hardware to help with the analog signals per-se).  So the FPGA developer is like a chip developer, but rather than sending out the chip design to be burned into silicon (which yields an application-specific integrated circuit or ASIC), you download your design into the FPGA, reboot it, and now your chip logic resides on this "bespoke" chip.

Like a GPU we could think of an FPGA as hosting a set of functions, but here we run into some oddities of the FPGA itself being a chip: at the end of the day, data comes in and out through the FPGA's pins, which have to be "shared" in some way if you want multiple functions on the one device.  Companies like Microsoft are exploring frameworks (they call them "shells") to own the infrastructure, so that this layer of sharing can be standardized.  But that work is nowhere near tackling security in the sense that the O/S community understands the term.  This is a kind of sharing aimed more at ensuring platform stability, not really isolation between potentially hostile competing users.

An FPGA is flexible: you can situate your new chip as a kind of filter interposed on the network (a bump-in-the-wire model), or can use it as a kind of special instruction set accessible from the host computer, and able to operate directly on host memory, or you can treat it much like a separate host side by side with your general purpose host, with its own storage, its own logic, and perhaps a DMA interface to your host memory.  But once again, there is just one set of pins for that FPGA chip.  So if multiple "users" share an FPGA, they wouldn't really be arbitrary general users of AWS.  More likely, they would be multiple tasks that all want to process the identical data in distinct ways: perhaps each incoming frame of data in an IoT setting needs to be decrypted, but then we want to run image segmentation side-by-side with some form of de-duplication, side-by-side with a lighting analysis.  Because you probably designed this entire logic path, you can safely be rather trusting about security.  And after all: given that the FPGA may have DMA access to your DRAM (perhaps via the PCIe bus, which has its own memory map, but perhaps directly via the main memory bus), you wouldn't want to time-share this unit among other users of the same cloud!

Which leads to my point: time-sharing of devices like FPGAs and GPUs is really not a very flexible concept today.  We can share within components of one application, but not between very different users who have never even met and have no intention of cooperating.   The context switch times alone tell us that you really wouldn't want these to change the code they run every time you launch a different end-user application.  You run huge security risks.  A device failure can easily trigger a hardware level host failure, and perhaps could take down the entire rack by freezing the network.  Yes, accelerators do offer incredible speedups, but these devices are not very sharable.

The immense surge in popularity of accelerators stems from big-data uses, but those are often very batched.  We take a million web pages, then perform some massively parallel task on them for an hour.  It might have run for a year without the accelerators.

But this is far from sharability of a general sense.  So, what is this telling us about the event-oriented style of computing seen in the IoT Edge or on the IoT Cloud (the first tier of the true cloud, where a datacenter deals with attached IoT devices)?  In a nutshell, the message is that we will be looking at an exceptionally expensive form of AI/ML unless we find ways to scale our IoT solutions and to batch the actions they are taking.

Here are two small examples to make things a bit more concrete.  Suppose that my IoT objective is to automate some single conference room in a big building.  In today's accelerator model, I may have to dedicate some parallel hardware to the task.  Unless I can multiplex these devices within my conference room application, I'll use a lot of devices per task compared to what back-end experience in the cloud may have led the developers to anticipate.  Thus my solution will be surprisingly costly, when compared with AI of the kind used to place advertising on web pages as we surf from our browsers: thousands of times more costly, because the hardware accelerators for ad placement are amortized over thousands of users at a time.

I suppose I would solve this by offering my conference room solution as a service, having a pool of resources adequate to handle 1000 conference rooms at a time, and hoping that I can just build and deploy more of demand surges and some morning, 10,000 people try to launch the service.  In the cloud, where the need was for general purpose computing, AWS solved this precise puzzle by offering us various IaaS and PaaS options.  With my reliance on dedicated, private, accelerators, I won't have that same option.  And if I blink and agree to use shared accelerators, aside from the 3-second to 30-second startup delay, I'll be accepting some risk of data leakage through the accelerator devices.  They were listening during that conference... and when it ends, someone else will be using that same device, which was created by some unknown FPGA or GPU vendor, and might easily have some kind of backdoor functionality.  So this could play out in my favor ("but AWS had its eye on such risks, and in fact all goes well.")  Or perhaps not ("Massive security breach discovered at KACS (Ken's awesome conference services): 3 years of conferences may have been covertly recorded by an unknown intruder.").

In contrast, if I am automating an entire smart highway, I might have a shot at amortizing my devices provided that I don't pretend to have security domains internal to the application, that might correspond to security obligations on my multi-tasked FPGA or GPU accelerator.  But true security down in that parallel hardware would be infeasible, I won't be able to dynamically reconfigure these devices at the drop of a pin, and I will still need to think about ways of corralling my IoT events into batches (without delaying them!) or I won't be able to keep the devices busy.

Now, facing this analysis, you could reach a variety of conclusions.  For me, working on Derecho, I don't see much of a problem today.  My main conclusion is simply that as we evolve Derecho into a more and more comprehensive edge IoT solution, we'll want to integrate FPGA and GPU support directly into our compute paths, and that we need to start to think about how to do batching when a Derecho-based microservice leverages those resources.  But these are early days and Derecho can be pretty successful long before questions of data-center cost and scalability enter the picture.

In fact the worry should be at places like AWS and Facebook and Azure and Google Cloud.  Those are the companies where an IoT "app" model is clearly part of the roadmap, and where these cost and security tradeoffs will really play out.

And in fact, I were a data-center owner, this situation would definitely worry me!  On the one hand, everyone is betting big on IoT intelligence.  Yet nobody is thinking of IoT intelligence as costly in the sense that I might need to throw (in effect) 10x or 100x more specialized parallel hardware at these tasks, on a "per useful action" basis, to accomplish my goals.  For many purposes those factors of 100 could be prohibitive at real scale.

We'll need solutions within five or ten years.  They could be ideas that let users securely multiplex FPGA or GPU, and with a lot less context switching delay.  Or ideas that allow them to do some form of receiver-side batching that avoids delaying actions but still manages to batch them, as we do in Derecho.  They could be ideas for totally need hardware models that would simply cost a lot less, like FPGA or GPU cores on a normal NUMA die (my worry for that case would be heat dissipation, but maybe the folks at Intel or AMD or NVIDIA would have a clever idea).

The key point, though, remains: if we are serious about the IoT edge, we need to get serious about inventing the specialized accelerator models that the edge will require!

Tuesday, 5 November 2019

The hardest part of a cloud for IoT will be the hardware

At the recent SOSP 2019 conference up in Ottawa, I was one of the "old guard" that had expected a program heavy on core O/S papers aimed at improving performance, code quality or reliability for the O/S itself or for general applications.  How na├»ve of me: Easily half the papers were focused on performance of machine learning infrastructures or applications.  And even in the other half, relatively few fell into the category that I think of as classic O/S research.

Times change, though, and I'm going to make the case here that in fact the pendulum is about to swing back and in quite a dramatic way.  If we ask what AI and ML really meant in the decade or so leading up to this 2019 conference the answer leads directly to the back-end platforms that host big-data analytic tools.

Thus it was common to see a SOSP paper that started by pointing out that training a deep neural network can take many hours, during which the GPU sometimes is stalled, starved for work to do.  Then the authors would develop some sort of technique for restructuring the DNN training system, the GPU would stay fully occupied, and performance would improve 10x.   What perhaps is a bit less evident is the extent to which these questions and indeed the entire setup center on the big-data back-end aspects of the problem.  The problem (as seen today) itself is inherently situated in that world: when you train a DNN, you generally work with vast amounts of data, and the whole game is to parallelize the task by batching the job, spreading the work over large number of nodes that each tackle some portion of the computation, and then iterating, merging the sub-results as the DNN training system runs.  The whole setup is about as far from real-time as it gets.

So we end up with platforms like MapReduce and Hadoop (Spark), or MPI for the HPC fans in the rooms.  These make a lot of sense in a batched, big-data, back-end setting.

But I'm a huge believer in the cloud edge: IoT systems with real-time sensors, that both make decisions and even learn as they run, often under intense time pressure.  Take Microsoft's Farmbeats product, led by a past student of mine who was a star from the outset.  Today, Ranveer Chandra has become the Chief Scientist for Azure Global and is the owner of this product line.  Farmbeats has a whole infrastructure for capturing data (often video or photos) from cameras.  It can support drones that will scan a farm for signs of insect infestations or other problems.  It replans on the fly, and makes decisions on the fly: spray this.  Bessie seems to be favoring her right hoof, and our diagnostic suggests a possible tear in the nail: call the vet.  The tomatoes in the north corner of the field look very ripe and should be picked today.  Farmbeats even learns wind patterns on the fly, and reschedules the dones doing the search pattern to glide on the breeze as a way to save battery power.

And this is the future.  Last time I passed through Newark Airport, every monitor was switching from ads for Microsoft's Farmbeats to ads for Amazon's smart farm products, back and forth.  The game is afoot!

If you walk the path for today's existing DNN training systems and classifiers, you'll quickly discover that pretty much everything gets mapped to hardware accelerators.  Otherwise, it just wouldn't be feasible at these speeds, and this scale.  The specific hardware varies: At Google, TPUs and TPU clusters, and their own in-house RDMA solutions.  Amazon seems very found of GPU and very skeptical of RDMA; they seem convinced that fast datacenter TCP is the way to go.  Microsoft has been working with a mix: GPUs and FPGAs, configurable into ad-hoc clusters that treat the infrastructure like an HPC supercomputing communications network, complete with the kinds of minimal latencies that MPI assumes.  All the serious players are interested in hearing about RDMA-like solutions that don't require a separate InfiniBand network and that can be trusted not to hose the entire network.  Now that you can buy SSDs built from 3-D XPoint, nobody wants anything slower.

And you know what?  This all pays off.  Those hardware accelerators are data-parallel and no general purpose CPU can possibly compete with them.  They would be screamingly fast even without attention to the exact circuitry used to do the computing, but in fact many accelerators can be configured to omit unneeded logic: an FPGA can be pared down until the CPUs implement only the exact operations actually needed, to the exact precision desired.  If you want matrix multiple with 5-bit numbers, there's an FPGA solution for that.  Not a gate or wire wasted... and hence no excess heat, no unneeded slowdowns, and more FPGA real-estate available to make the basic computational chunk sizes larger.  That's the world we've entered, and all these acronyms dizzy you, well, all I can do is to say that you get used to them quickly!

To me this leads to the most exciting of the emerging puzzles: How will we pull this same trick off for the IoT edge? (I don't mean specifically the Azure IoT Edge, but more broadly: both remote clusters and also the edge of the main cloud datacenter).   Today's back-end solutions are more optimized for batch processing and relaxed time constraints than people probably realize, and the same hardware that seems so cost-effective for big data may seem quite a bit less efficient if you try to move it out to near the edge computing devices.

In fact it gets even more complicated.  The edge is a world with some standards.  Consider Azure IoT: the devices are actually managed by the IoT Hub, which provides a great deal of security (particularly in conjunction with IoT Sphere, a hardware security model with its own TCB).  IoT events trigger Azure functions: lightweight stateless computations that are implemented as standard programs but that run in a context where we don't have FPGA or GPU support: initializing those kinds of devices to support a particular kernel (a particular computational functionality) typically requires at least 3-5 seconds to load the function, then reboot the device.  I'm not saying it can't be done: if you know that many IoT functions will need to run a DNN classifier, I suppose you could position accelerators in the function tier and preload the DNN classifier kernel.  But the more general form of the story, where developers for a thousand IoT companies create those functions, leads to such a diversity of needs from the accelerator that it just could never be managed.

So... it seems unlikely that we'll be doing the fast stuff in the function tier.  More likely, we'll need to run our accelerated IoT logic in microservices, living behind the function tier on heavier nodes with beefier memories, stable IoT accelerators loaded with exactly the kernels the particular microservice needs, and then all of this managed for fault-tolerance with critical data replicated (and updated as the system learns: we'll be creating new ML models on the fly, like Farmbeats and its models of cow health, its maps of field conditions, and its wind models).   As a side-remark, this is the layer my Derecho project targets.

But consider the triggers: in the back-end, we ran on huge batches of events, perhaps tens of thousands at a time, in parallel.  At the edge, a single event might trigger the whole computation.  Even to support this, we will still need RDMA to get the data to the microservice instance(s) charged with processing it (via the IoT functions, which behave like small routers, customized by the developer), fast NVM like Optane, and then those FPGA, GPU or TPU devices to do the heavily parallel work.  How will we keep these devices busy enough to make such an approach cost-effective?

And how will the application developer "program" the application manager to know how these systems need to be configured: this microservice needs one GPU per host device; that one needs small FPGA clusters with 32 devices but will only use them for a few milliseconds per event, this other one needs a diverse mix, with TPU here, GPU there, FPGA on the wire...   Even "describing" the needed configuration looks really interesting, and hard.

You see, for researchers like me, questions like those aren't a reason to worry.  They are more like a reason to celebrate: plenty to think about, lots of work to do and papers to write, and real challenges for our students to tackle.  Systems challenges.

So you know what?  I have a feeling that by 2020 or 2021, OSDI and SOSP will start to feel more like classic systems conferences again.  The problems I've outlined look more like standard O/S topics, even if the use-case that drives them is dominated by AI and ML and real-time decision-making for smart farms (or smart highways, smart homes, smart cities, smart grids).  The money will be there -- Bill Gates once remarked during a Cornell visit that the IoT revolution could easily dwarf the Internet and Web and Cloud ones (I had a chance to ask why, and his answer was basically that there will be an awful lot of IoT devices out there... so how could the revolution not be huge?)

But I do worry that our funding agencies may be slow to understand this trend.  In the United States, graduate students and researchers are basically limited to working on problems for which they can get funding.  Here we have a money-hungry and very hard question, that perhaps can only be explored in a hands-on way by people actually working at companies like Google and Microsoft and Amazon.  Will this lead the NSF and DARPA and other agencies to adopt a hands-off approach?

I'm hoping that one way or another, the funding actually will be there.  Because if groups like mine can get access to the needed resources, I bet we can show you some really cool new ideas for managing all that hardware near the IoT edge.  And I bet you'll be glad we invented them, when you jump into your smart car and tell it to take the local smart highway into the office.  Just plan ahead: you won't want to miss OSDI next year, or SOSP in 2021!

Thursday, 10 October 2019

If the world was a (temporal) database...

I thought I might share a problem we've been discussing in my Cornell PhD-level class on programming at the IoT edge.

Imagine that at some point in the future, a company buys into the idea (my idea!) that we'll really need smart highway systems to ever safely use self-driving cars. A bit like air traffic control, but scaled up.

In fact a smart highway might even offer guidance and special services to "good drivers", such as permission to drive at 85mph in a special lane for self-guided cars and superior drivers... for a fee.  Between the smart cars and the good drivers, there would be a lot of ways to earn revenue here.  And we could even make some kind of dent in the endless gridlock that one sees in places like Silicon Valley from around 6:45am until around 7pm.

So this company, call it Smart Highways Inc, sets out to create the Linux of smart highways: a new form of operating system that the operator of the highway could license to control their infrastructure.

What would it take to make a highway intelligent?  It seems to me that we would basically need to deploy a great many sensors, presumably in a pattern intendent to give us some degree of redundancy for fault-tolerance, covering such things as roadway conditions, weather, vehicles on the road, and so forth.

For each vehicle we would want to know various things about it: when it entered the system (for eventual tolls, which will be the way all of this pays for itself), its current trajectory (a path through space-time annotated with speeds and changes in speed or direction), information about the vehicle itself (is it smart?  is the driver subscribed to highway guidance or driving autonomously?), and so forth.

Now we could describe a representative "app": perhaps, for the stretch of CA 101 from San Francisco to San Jose, a decision has been made to document the "worst drivers" over a one month period.  (Another revenue opportunity: this data could definitely be sold to insurance companies!)   How might we do this?  And in particular, how might we really implement our solution?

What I like about this question is that it casts light on exactly the form of Edge IoT I've been excited about.  On the one hand, there is an AI/ML aspect: automated guidance to the vehicles by the highway, and in this example, an automated judgement about the quality of driving.  One would imagine that we train an expert system to take trajectories as input and output a quality metric: a driver swerving between other cars at high speed, accelerating and turning abruptly, braking abruptly, etc: all the hallmarks of poor driving!

But if you think more about this you'll quickly realize that to judge quality of driving you need a bit more information.  A driver who swerves in front of another car with inches to spare, passes when there are oncoming vehicles, causes others to break or swerve to avoid collisions -- that driver is far more of a hazard than a driver who swerves suddenly to avoid a pothole or some other form of debris, or one who accelerates only while passing, and passes only when nobody is anywhere nearby in the passing lane.  A driver who stays the right "when possible" is generally considered to be a better driver than one who lingers in the left, if the highway isn't overly crowded.

A judgment is needed: was this abrupt action valid, or inappropriate?  Was it good driving that evaded a problem, or reckless driving that nearly caused an accident?

So in this you can see that our expert system will need expert context information.  We would want to compute the set of cars near each vehicle's trajectory, and would want to be able to query the trajectories of those cars to see if they were forced to take any kind of evasive action.  We need to synthesize metrics of "roadway state" such as crowded or light traffic, perhaps identify bunches of cars (even on a lightly driven road we might see a grouping of cars that effectively blocks all the lanes), etc.  Road surface and visibility clearly are relevant, and roadway debris.  We would need some form of composite model covering all of these considerations.

I could elaborate but I'm hoping you can already see that we are looking at a very complicated real-time database question.  What makes it interesting to me is that on the one hand, it clearly does have a great deal of relatively standard structure (like any database): a schema listing information we can collect for each vehicle (of course some "fields" may be populated for only some cars...), one for each segment of the highway, perhaps one for each driver.  When we collect a set of documentation on bad drivers of the month, we end up with a database with bad-driver records in it: one linked to vehicle and driver (after all, a few people might share one vehicle and perhaps only some of the drivers are reckless), and then a series of videos or trajectory records demonstrating some of the "all time worst" behavior by that particular driver over the past month.

But on the other hand, notice that our queries also have an interesting form of locality: they are most naturally expressed as predicates over a series of temporal events that make up a particular trajectory, or a particular set of trajectories: on this date at this time, vehicle such-and-such swerved to pass an 18-wheeler truck on the right, then dove three lanes across to the left (narrowly missing the bumper of a car as it did so), accelerated abruptly, braked just as suddenly and dove to the right...  Here, I'm describing some really bad behavior, but the behavior is best seen as a time-linked (and driver/vehicle-linked) series of events that are easily judged as "bad" when viewed as a group, and yet that actually would be fairly difficult to extract from a completely traditional database in which our data is separated into tables by category.  

Standard databases (and even temporal one) don't offer particularly good ways to abstract these kinds of time-related event sequences if the events themselves are from a very diverse set.  The tools are quite a bit better for very regular structures, and for time series data with identical events -- and that problem arises here, too.  For example, when computing a vehicle trajectory from identical GPS records, we are looking at a rather clean temporal database question, and some very good work has been done on this sort of thing (check out Timescale DB, created by students of my friend and colleague, Mike Freedman!).  But the full-blown problem clearly is the very diverse version, and it is much harder.  I'm sure you are thinking about one level of indirection and so forth, and yes, this is how I might approach such a question -- but it would be hard even so.

In fact, is it a good idea to model a temporal trajectory a database relation?  I suspect that it could be, and that representing the trajectory that way would be useful, but this particular kind of relation just lists events and their sequencing.  Think also about this issue of event type mentioned above: here we have events linked by the fact that they involve some single driver (directly or perhaps indirectly -- maybe quite indirectly).  Each individual event might well be of a different type: the data documenting "caused some other vehicle to take evasive action" might depend on the vehicle, and the action, and would be totally different from the data documenting "swerved across three lanes" or "passed a truck on its blind side."  

Even explaining the relationships and causality can be tricky: Well, car A swerved in front of car B, which braked, causing C to brake, causing D to swerve and impact E.  A is at fault, but D might be blamed by E!

In fact, as we move through the world -- you or me, as drivers of our respective vehicles, or for that matter even as pedestrians trying to cross the road, this aspect of building self-centered domain-specific temporal databases seems to be something we do very commonly, and yet don't model particularly well in today's computing infrastructures.  Moreover, you and I are quite comfortable with highways that might have cars and trucks, motorcycles, double-length trucks, police enforcement vehicles, ambulances, construction vehicles... extremely distinct "entities" that are all capable of turning up on a highway, our standard ways of building databases seem a bit overly structured for dealing with this kind of information.

Think next about IoT scaling.  If we had just one camera, aimed at one spot on our highway, we still could do some useful tasks with it: we could for example equip it with a radar-speed detector that would trigger photos and use that to automatically issue speeding tickets, as they do throughout Europe.  But the task I described above fuses information from what may be tens of thousands of devices deployed over a highway more than 100 miles long at the location I specified, and that highway could have a quarter-million vehicles on it at peak commute hours.

As a product opportunity, Smart Highways Inc is looking at a heck of a good market -- but only if they can pull off this incredible scaling challenge.  They won't simply be applying their AI/ML "driving quality" evaluation to individual drivers, using data from within the car (that easier task is the one Hari Balakrisnan's Cambridge Mobile Telematics has tackled, and even this problem has his company valued in the billions as of round A).  Smart Highways Inc is looking at the cross-product version of that problem: combining data across a huge number of sensor inputs, fusing the knowledge gained, and eventually making statements that involve observations taken at multiple locations by distinct devices.  Moreover, we would be doing this at highway scale, concurrently, for all of the highway all of the time.

In my lecture today, we'll be talking about MapReduce, or more properly, the Spark/Databricks version of Hadoop, which combines an open source version of MapReduce with extensions to maximize the quality of in-memory caching and introduces a big-data analytic ecosystem.  The aspect of interest to me is the caching mechanism:  Spark centers on a kind of cacheable query object they call a Resilient Distributed Data object, or RDD.  An RDD describes a scalable computation designed to be applicable across a sharded dataset, which enables a form of SIMD computing at the granularity of files or tensors being processed on huge numbers of compute nodes in a datacenter.

The puzzle for my students, which we'll explore this afternoon, is whether the RDD model could be transferred from the batched, non-real-time settings in which it normally runs (and even more than that, functional, in the sense that Spark treats every computation as a series of read-only data transformation steps, from a static batched input set through a series of MapReduce stages to a sharded, distributed result).  So our challenge is: could a graph of RDDs and an interative compute model express tasks like the Smart Highway ones?  

RDDs are really a linkage between database models and a purely functional Lisp-style Map and Reduce functional computing model.  I've always liked them, although my friends who do database research tend to view them dimnly, at best.  They often feel that all of Spark is doing something a pure database could have done far better (and perhaps more easily).  Still, people vote with their feet and for whatever reason, this RDD + computing style of coding is popular.

So... could we move RDDs to the edge?  Spark itself, clearly, wouldn't be the proper runtime: it works in a batched way, and our setting is event-driven, with intense real-time needs.  It might also entail taking actions in real-time (even pointing a camera or telling it to take a photo, or to upload one, is an action).  So Spark per-se isn't quite right here.  Yet Spark's RDD model feels appropriate.  Tensor Flow uses a similar model, by the way, so I'm being unfair when I treat this as somehow Spark-specific.  I just have more direct experience with Spark, and additionally, see Spark RDDs as a pretty clear match to the basic question of how one might start to express database queries over huge IoT sensor systems with streaming data flows.  Tensor Flow has many uses, but I've seen far more work on using it within a single machine, to integrate a local computation with some form of GPU or TPU accelerator attached to that same host.  And again, I realize that this may be unfair to Tensor Flow.  (And beyond that I don't know anything at all about Julia, yet I hear that system name quite often lately...)

Anyhow, back to RDDs.  If I'm correct, maybe someone could design an IoT Edge version of Spark, one that would actually be suitable for connecting to hundreds of thousands of sensors, and that could really perform tasks like the one outlined earlier in real-time.  Could this solve our problem?  It does need to happen in real-time: a smart highway generates far too much data per second to keep much of it, so a quick decision is needed that we should document the lousy driving of vehicle A when driver so-and-so is behind the wheel, because this person has caused a whole series of near accidents and actual ones -- sometimes, quite indirectly, yet always through his or her recklessness.  We might need to make that determination within seconds -- otherwise the documentation (the raw video and images and radar speed data) may have been discarded.

If I was new to the field, this is the problem I personally might tackle.  I've always loved problems in systems, and in my early career, systems meant databases and operating systems.  Here we have a problem of that flavor.

Today, however, scale and data rates and sheer size of data objects are transforming the game.  The kind of system needed would span entire datacenters, and we will need to use accelerators on the data path to have any chance at all of keeping up.  So we have a mix of old and new... just the kind of problem I would love to study, if I was hungry for a hugely ambitious undertaking.  And who know... if the right student knocks on my door, I might even tackle it.

Wednesday, 24 July 2019

In theory, asymptotic complexity matters. In practice...

Derecho matches Keidar and Shraer’s lower bounds for dynamically uniform agreement:  No Paxos protocol can  safely deliver messages with fewer "information exchange" steps.  But does this matter?

Derecho targets a variety of potential deployments and use cases.  A common use would be to replicate state within some kind of "sharded" service -- a big pool of servers but broken into smaller replicated subservices that use state machine replication in subsets of perhaps 2, 3 or 5.  A different use case would be for massive replication -- tasks like sharing a VM image, a container, or a machine-learned model over huge numbers of nodes.  In those cases the number of nodes might be large enough for asymptotic protocol complexity bounds to start to matter -- Derecho's optimality could be a winning argument.  But would an infrastructure management service really stream high rates of VM images, containers, and machine-learned models? I suspect that this could arise in future AI Systems... it wouldn't today.

All of which adds up to an interesting question: if theoretical optimality is kind of a "meh" thing, what efficiency bounds really matter for a system like Derecho?  And how close to ideal efficiency can a system like this really come?

To answer this question, let me start by arguing that 99% of Derecho can be ignored.  Derecho actually consists of a collection of subsystems: you link your C++ code to one library, but internally, that library has several distinct sets of "moving parts".  A first subsystem is concerned with moving bytes: our data plane.  The second worries about data persistency and versioning.  A third is where we implement the Paxos semantics: Derecho's control plane.  In fact it handles more than just Paxos -- Derecho's control plane is a single thread that loops through a set of predicates, testing them one by one and then taking triggered actions for any predicate that turns out to be enabled.  A fourth subsystem handles requests that query the distributed state: it runs purely on data that has become stable and is totally lock-free and asynchronous -- the other three subsystems can ignore this one entirely.  In fact the other three subsystems are as lock-free and asynchronous as we could manage, too -- this is the whole game when working with high speed hardware, because the hardware is often far faster than the software that manages it.  We like to think of the RDMA layer and the NVM storage as two additional concurrent systems, and our way of visualizing Derecho is a bit like imagining a machine with five separate moving parts that interact in a few spots, but are as independent as we could manage.

For steady state performance -- bandwidth and latency -- we can actually ignore everything except the update path and the query path.  And as it happens, Derecho's query path is just like any query-intensive read-only subsystem: it uses a ton of hashed indices to approximate one-hop access to objects it needs, and it uses RDMA if that one hop involves somehow fetching data from a remote node, or sending a computational task to that remote node.  This leads to fascinating questions, in fact: you want those paths to be lock-free, zero-copy, ideally efficient, etc.  But we can set those questions to the side for our purposes here -- results like the one by Keidar and Shraer really are about update rates.  And for this, as noted a second ago, almost nothing matters except the data-movement path used by the one subsystem concerned with that role.  Let's have a closer look.

For large transfers Derecho uses a tree-based data movement protocol that we call a binomial pipeline.  In simple terms, we build a binary tree, and over it, create a flow pattern of point to point block transfers that obtains a high level of internal concurrency, like a two-directional bucket brigade -- we call this a reliable multicast over RDMA, or "RDMC").  Just like in an actual bucket brigade, every node settles into a steady behavior, receiving one bucket of data (a "chunk" of bytes) as it sends some other bucket, more or less simultaneously.  The idea is to max-out the RDMA network bandwidth (the hardware simply can't move data more efficiently).  The actual data structure creates a hypercube "overlay" (a conceptual routing diagram that lives on our actual network, which allows any-to-any communication) of dimension d, and then wraps d binomial trees over it, and you can read about it in our DSN paper, or in the main Derecho paper.

A binary tree is the best you can hope for when using point-to-point transfers to replicate large, chunked, objects.  And indeed, when we measure RDMC, it seems to do as well as one can possibly do on RDMA, given that RDMA lacks a reliable one-to-many chunk transfer protocol.   So here we actually do have an ideal mapping of data movement to RDMA primitives.

Unfortunately, RDMC isn't very helpful for data too small to "chunk".  If we don't have enough data a binomial tree won't settle into its steady-state bucket brigade mode and we would just see a series of point-to-point copying actions.  This is still "optimal" at large-scale, but recall that often we will be replicating in a shard of size two, three or perhaps five.  We decided that Derecho needed a second protocol for small multicasts, and Sagar Jha implemented what he calls the SMC protocol.

SMC is very simple.  The sender, call it process P, has a window, and a counter.  To send a message, P places the data in a free slot in its window (each sender has a different window, so we mean "P's window"), and increments the counter (again, P's counter).  When every receiver (call them P, Q and R: this protocol actually loops data back, so P sends to itself as well as to the other shard members) has received the message, the slot is freed and P can reuse it, round-robin.  In a shard of size three where all the members send, there would be one instance of this per member: three windows, three counters, three sets of receive counters (one per sender).

SMC is quite efficient with small shards.  RDMA has a direct-remote-write feature that we can leverage (RDMC uses a TCP-like feature where the receiver needs to post a buffer before the sender transmits, but this direct write is different: here the receiver declares a region of memory into which the sender can do direct writes, without locking).

Or is it?  Here we run into a curious philosophical debate that centers on the proper semantics of Derecho's ordered_send: should an ordered_send be immediate, or delayed for purposes of buffering, like a file I/O stream?  Sagar, when he designed this layer, opted for urgency.  His reasoning was that if a developer can afford to batch messages and send big RDMC messages that carry thousands of smaller ones, this is exactly what he or she would do.  So a developer opting for SMC must be someone who prioritizes immediate sends, and wants the lowest possible latency.

So, assume that ordered_send is required to be "urgent".  Let's count the RDMA operations that will be needed to send one small object from P to itself (ordered_send loops back), Q and R.  First we need to copy the data from P to Q and R: two RDMA operations, because  reliable one-sided RDMA is a one-to-one action.  Next P increments its full-slots counter and pushes it too -- the updated counter can't be sent in the same operation that sends the data because RDMA has a memory consistency model under which a single operation that spans different cache-lines only guarantees sequential consistency on a per-cache-line basis, and we wouldn't want P or Q to see the full-slots counter increment without certainty that the data would be visible to them.  You need two distinct RDMA operations to be sure of that (each is said to be "memory fenced.")  So, two more RDMA operations are required.  In our three-member shard, we are up to four RDMA operations per SMC multicast.

But now we need acknowledgements.  P can't overwrite the slot until P, Q and R have received the data and looked at it, and to report when this has been completed, the three update their receive counters.  These counters need to be mirrored to one-another (for fault-tolerance reasons), so P must send its updated receive counter to Q and R, Q to P and R, and R to P and Q: six more RDMA operations, giving a total of ten.  In general with a shard of size N, we will see 2*(N-1) RDMA operations to send the data and count, and N*(N-1) for these receive counter reports, a total of N^2+N-2.  Asymptotically, RDMC will dominate because of the N^2 term, but N would need to be much larger than five for this to kick in.  At a scale of two to five members, we can think of N as more or less a constant, and so this entire term is like a constant.

So by this argument, sending M messages using SMC with an urgent-send semantic "must" cost us M*(N^2+N-2) RDMA operations.  Is this optimal?

Here we run into a hardware issue.  If you check the specification for the Connect X4 Mellanox device used in my group's experiments, you'll find that it can transmit 75M RDMA messages per second, and also that it has peak performance of 100Gbps (12.5GB) in each link direction.   But if your 75M messages are used to report updates to tiny little 4-byte counters, you haven't used much of the available bandwidth: 75M times 4 bytes is only 300MB/s, and as noted above, the device is is bidirection.  Since we are talking about bytes, the bidirectional speed could be as high as 25GB/s with an ideal pattern of transfers.  Oops: we're too slow by a factor of 75x!

In our TOCS paper SMC peaks at around 7.5M small messages per second, which bears out this observation.  We seem to be leaving a lot of capacity unused.  If you think about it, everything centers on the assumption that ordered_send should be as urgent as possible.  This is actually limiting performance and for applications that average out at 7.5M SMC messages per second or less, but have bursts that might be much higher, this is even inflating latency (a higher-rate burst will just fill the window and the sender will have to wait for a slot).

Suppose our sender wants fast SMC streaming and low latency, and simply wasn't able to do application-level batching (maybe the application has a few independent subsystems of its own that send SMC messages).  Well, everyone is familiar with file I/O streaming and buffering.  Why not use the same idea here?

Clearly we could have aggregated a bunch of SMC messages, and then done one RDMA transfer for the entire set of full window slots (it happens that RDMA has a so-called "scatter gather/put feature", and we can use that to transfer precisely the newly full slots even if they wrap around the window).  Now one counter update covers the full set.  Moreover, the receivers can do "batched" receives, and one counter update would then cover the full batch of receives.

An SMC window might have 1000 sender slots in it, with the cutoff for "small" messages being perhaps 100B.  Suppose we run with batches of size 250.  We'll have cut the overhead factors dramatically: for 1000 SMC messages in the urgent approach, the existing system would send 1000*10 RDMA messages for the 3-member shard: 10,000 in total.  Modified to batch 250 messages at a time, only 40 RDMA operations are needed: a clean 250x improvement.  In theory, our 7.5M SMC messages per second performance could then leap to 1.9B/second.  But here, predictions break down: With 100 byte payloads, that rate would actually be substantially over the limit we calculated earlier, 25GB/s, which limits us to 250M SMC messages per second.  Still, 250M is quite a bit faster than 7.5M and worth trying to achieve.

It might not be trivial to get from here to there, even with batching.  Optimizations at these insane data rates often aren't be nearly as simple as a pencil-and-paper calculation might suggest.  And there are also those urgency semantics issues to think about:  A bursty sender might have some gaps in its sending stream.  Were one to occur in the middle of a 250 message batch, we shouldn't leave those SMC messages dangling: some form of automatic flush has to kick in.  We should also have an API operation so that a user could explicitly force a flush.  

Interestingly, once you start to think about this, you'll realize that in this latency sense, Sagar's original SMC is probably "more optimal" than any batched solution can be.  If you have just one very urgent notification to send, not a batch, SMC is already a very low-latency protocol; arguably, given his argument that the API itself dictates that SMC should be an urgent protocol, his solution actually is "ideally efficient."  What we see above is that if you question that assumption, you can identify an inefficiency -- not that the protocol as given is inefficient under the assumptions it reflects.

Moral of the story?  The good news is that right this second, there should be a way to improve Derecho performance for small messages, if the user is a tiny bit less worried about urgency and would like to enable a batching mode (we can make it a configurable feature).  But more broadly, you can see is that although Derecho lives in a world governed in part by theory, in the extreme performance range we target and with the various hardware constraints and properties we need to keep in mind, tiny decisions can sometimes shape performance to a far greater degree.

I happen to be a performance nut (and nobody stays in my group unless they share that quirk).  Now that we are aware of this SMC performance issue, which was actually called to our attention by Joe Israelevitz when he compared his Acuerdo protocol over RDMA with our Derecho one for 100B objects and beat us hands-down,  we'll certainly tackle it.  I've outlined one example of an optimization, but it will certainly turn out that there are others too, and I bet we'll end up with a nice paper on performance, and a substantial speedup, and maybe even some deep insights.  But they probably won't be insights about protocol complexity.  At the end of the day, Derecho may be quite a bit faster for some cases, and certainly this SMC one will be such a case.  Yet the asymptotic optimality of the protocol will not really have been impacted: the system is optimal in that sense today!  It just isn't as fast as it probably should be, at least for SMC messages sent in high-rate streams!

Wednesday, 26 June 2019

Whiteboard analysis: IoT Edge reactive path

One of my favorite papers is the one Jim Gray wrote with Pat Helland, Patrick O'Neil and Dennis Shasha, on the costs of replicating a database over a large set of servers, which they showed to be prohibitive if you don't fragment (shard) the database into smaller and independentally accessed portions: mini-databases.  In some sense, this paper gave us the modern cloud, because you can view Brewer's CAP conjecture and the eBay/Amazon BASE methodologies as both flowing from Gray's original insight.

Fundamentally, what Jim and his colleagues did was to undertake a whiteboard analysis of the scalability of concurrency control in an uncontrolled situation, where transactions are simply submitted to some big pool of servers, and then compete for locks in accordance with a two-phase locking model (one in which a transaction acquires all its locks before releasing any), and then terminates using a two-phase or three-phase commit.  They show that without some mechanism to prevent lock conflicts, there is a predictable and steadily increasing rate of lock conflicts leading to delay and even deadlock/rollback/retry.  The phenomenon causes overheads to rise as a polynomial in the number of servers over which you replicate the data, and quite sharply: I believe it was N^3 in the number of servers, and T^5 in the rate of transactions.  So your single replicated database will have a perform collapse.  With shards, using state machine replication (implemented using Derecho!) this isn't an issue, but of course we don't get the full SQL model at that point -- we end up with a form of NoSQL on the sharded database, similar to what MongoDB or Amazon's Dynamo DB offers.

Of course the "dangers" paper is iconic, but the techniques it uses are of broad value. And this was central to the way Jim approached problems: he was a huge fan in working out the critical paths and measuring costs along them.  In his cloud database setup, a bit of fancy mathematics let the group he was working with turn that sort of thinking into a scalability analysis that led to a foundational insight.  But even if you don't have an identical chance to change the world, it makes sense to try and follow a similar path.

This has had me thinking about paper-and-pencil analysis of the critical paths and potential consistency conflict points for large edge IoT deployments of the kind I described last week.  Right now, those paths are pretty messy, if you approach it this way.  Without an edge service, we would see something like this:

   IoT                             IoT         Function           Micro
Sensor  --------------->  Hub  ---> Server   ------> Service

In this example I am acting as if the function server "is" the function itself, and hiding the step in which the function server looks up the class of function that should handle this event, launches it (or perhaps had one waiting, warm-started), and then hands off the event data to the function for handling on one of its servers.  Had I included this handoff the image would be more like this:

   IoT                             IoT         Function        Function       Micro
Sensor  --------------->  Hub  ---> Server   ------>    F  -----> Service

F is "your function", coded in a language like C#, F# or C++ or Python, and then encapsulated into a container of some form.  You'll want to keep these programs very small and lightweight for speed.  In particular, a function is not the place to do any serious computing, or to try and store anything.  Real work occurs in the micro service, the one you built using Derecho.  Even so, this particular step looks costly to me: without warm-starting it, launching F could take a substantial fraction of a section.  And if F was warm-started, the context switch still involves some form of message passing, plus waking F up, and could still be many tens or even hundreds of milliseconds: an eternity at cloud speeds!

Even more concerning, many sensors can't connect directly to the cloud, and we end up cloning the architecture and running it twice: within an IoT Edge system (think of that as an operating system for a small NUMA machine or a cluster, running close to the sensors, and then relaying data to the main cloud if it can't handle the events out near the sensor device).

   IoT                            Edge      Edge Fcn                        IoT         Function              Micro
Sensor  --------------->  Hub  ---> Server -> F======>  Hub  ---> Server -> CF -> Service

Notice that now we have two user-supplied functions on the path.  The first one will have decided that the event can't be handled out at the edge, and forwarded the request to the cloud, probably via a message queuing layer that I haven't actually shown, but represented using a double-arrow: ===>.  This could have chosen to store the request and send it later, but with luck the link was up and it was passed to the cloud instantly, didn't need to sit in an arrival queue, and was instantly given to the cloud's IoT Hub, which in turn finally passed it to the cloud function server, the cloud function (CF) and the Micro Service.

The Micro Service may actually be a whole graph of mutually supporting Micro Services, each running on a pool of nodes, and each interacting with some of the others.  The cloud's "App Server" probably hosts these and provides elasticity if a backlog forms for one of them.

We also have the difficulty that many sensors capture images and videos.  These are initially stored on the device itself, which has substantial capacity but limited compute power.  The big issue is that the first link, from sensor to the edge hub, would often be bandwidth limited.  So we can't upload everything.  Very likely what travels from sensor to hub is just a thumbnail and other meta-data.  Then the edge function concludes that a download is needed (hopefully without too much delay), sends back a download request to the imaging device, and then the device moves the image to the cloud.

Moreover, there are industry standards for uploading photos and videos to a cloud, and those put the uploaded objects into the edge version of the blob store (short for "binary large objects"), which in turn is edge aware ands will mirror them to the main cloud blob store.  Thus we have a whole pathway from IoT sensor to the edge blob server, which will eventually generate another event later to tell us that the data is ready.  And as noted, for data that needs to reach the actual cloud and can't be processed at the edge, we replicate this path too, moving that image via the queuing service to the cloud.

So how long will all of this take?  Latencies are high and bandwidth low for the first hop, because sensors rarely have great connectivity, and almost never have the higher levels of power required for really fast data transfers (even with 5G).  So perhaps we will see a 10ms delay at that stop, plus more if the data is large.  Inside the edge we should have a NUMA machine or perhaps a small cluster, and can safely assume 10G connections with latencies of 10us or less, although of course software like TCP will often impose its own delays.  The big delay will probably be the handoff to the user-defined function, F.

My guess is that for an event that requires downloading a small photo, the very best performance will be something like 50ms before F sees the event (maybe even 100ms), then another 50-100 for F to request a download, then perhaps 200ms for the camera to upload the image to the blob server, and then a small delay (25ms?) for the blob server to trigger another event, F', saying "your image is ready!".  We're up near 350ms and haven't done any work at all yet!

Because the function server is limited to lightweight computing, it hands off to our micro-service (a quick handoff because the service is already running; the main delay will be the binding action by which the function connects to it, and perhaps this can be done off the critical path).  Call this 10ms?  And then the micro service can decide what to do with this image.

Add another 75ms or so if we have to forward the request to the cloud.  So the cloud might not be able to react to a photo in less than about 500ms, today.

None of this involved a Jim Gray kind of analysis of contention and backoff and retry.  If you took my advice and used Derecho for any data replication, the 500ms might be the end of the story.  But if you were to use a database solution like MongoDB (CosmosDB on Azure), it seems to me that you might easily see a further 250ms right there.

What should one do about these snowballing costs?  One answer is that many of the early IoT applications just won't care: if the goal is to just journal that "Ken entered Gates Hall at 10am on Tuesday", a 1s delay isn't a big deal.  But if the goal is to be reactive, we need to do a lot better.

I'm thinking that this is a great setting for various forms of shortcut datapaths, that could be set up after the first interaction and offer direct bypass options to move IoT events or data from the source directly to the real target.  Then with RDMA in the cloud, and Derecho used to build your micro service, the 500ms could drop to perhaps 25 or 30ms, depending on the image size, and even less if the photo can be fully handled on the IoT Edge server itself.

On the other hand, if you don't use Derecho but you do need consistency, you'll get into trouble quickly: with scale (lots of these pipelines all running concurrently), and contention, it is easy to see how you could trigger Jim's "naive replication" concerns.  So designers of smart highways had better beware: if they don't heed Jim's advice (and mine), by the time that smart highway warns that a car should "watch out for that reckless motorcycle approaching on your left!" it will already have zoomed past...   

These are exciting times to work in computer systems.  Of course a bit more funding wouldn't hurt, but we certainly will have our work cut out for us!