Thursday, 7 November 2019

Sharable accelerators for the IoT Edge

After returning from the ACM Symposium on Operating Systems Principles (SOSP 2019, where we saw a huge number of ideas more focused on AI/ML than on classic operating systems), I wrote a blog posting that I shared a few days ago.

My core point was that we will face a kind of crisis in AI/ML computing as we try to leverage hardware accelerators.  The problem is that to leverage deep neural networks in IoT environments, both the training and the actions may need to occur under tight real-time pressures.  These IoT uses will huge volumes of data (images and video, voice, multi-spectral imaging data, and so forth).  Often, we  need to combine information from multiple sources before we can extract knowledge (data "fusion").  Add those observations up and you find yourself looking at a class of questions that can only be addressed using hardware accelerators.  This is because the core tasks will be data-parallel: operations that can occur in a single highly parallel step on an accelerator, but that might require billions of clock-cycles on a normal general-purpose computer.

But the cost-effectiveness of today's IoT hardware is very strongly tied to the world in which those devices have evolved.  Look at the SOSP papers and you read all sorts of results about improving the mapping of batch-style workloads into GPU and FPGA clusters (and the same insights apply to custom ASICs or TPUs).

Thus at the edge, we will find ourselves in a pinch: while the demand for cycles is similar while performing the inference or training task, these tasks are going to be event-driven: you fuse the data from a set of sources on a smart highway at the instant that you collect the data, with the intent of updating vehicle trajectory information and revising car guidance within seconds. And the problem this poses is that the accelerator might spend most of its time waiting for work, even though when work does show up, it has magical superpowers that let it discharge the task within microseconds.

To me this immediately suggests that IoT could be profligate in its use of hardware accelerators, demanding a whole FPGA or GPU cluster for a single event.  Under any normal model of the cost of these devices, the model would be extremely expensive.  I suppose you could make the case for some kind of bargain-basement device that might be a bit slower, less heavily provisioned with memory and otherwise crippled, and by doing that drive the costs down quite a bit.  But you'll also slow the tasks down, and that would then make ML for IoT a very feeble cousin to ML performed at the back-end on a big data analytics framework.

What would cost-effective IoT for the edge require, in terms of new hardware?  A first guess might be some form of time-sharing, to allow once device to be multiplexed between a large number of events from a single application that scales out to include many sensors (a relatively easy case), or between events from different users (a much harder one).  But I'm going to argue against this, at least if we base our reasoning on today's existing IoT accelerator options.

Multiplexing is exceptionally difficult for hardware accelerators.  Let's try and understand the root problems.  Now, in proposing this mini-deep-dive, I realize that many readers here are like me, and as such, probably view these units as black boxes.  But where we differ slightly is that over the past few years I've been forcing myself to learn more by teaching a class that looks at the roles of accelerators in modern datacenter computing.  I don't find these papers easy to read, or to teach, but along the way I've slowly gotten familiar with the models and learned about some interesting overheads.

When you look at a GPU in a datacenter, what are you likely to find under the hood?  As it happens, my colleague Weijia Song just asked this question, and arrived at a very nice high level summary.  He tells me that from a mile up, we should view a GPU computing unit as a specialized server, with its own memory (just like a normal computer, a GPU would be equipped with perhaps 12GB of DRAM), but then with processors that run a special kind of SIMD code (often written in a variant of C called CUDA) that performs block-parallel computations using blocks of GPU threads, each with its own GPU core.  L1 and L2 caching are software-managed, which is less exotic than you may think: with modern C and C++ we use annotations on variables to describe the desired consistency semantics and atomicity properties, and in fact the GPU scheme is rather similar.  So: we have a 3-level memory hierarchy, which can be understood by thinking about registers (L1), the normal kind of cache (L2) and DRAM resident on the GPU bus (like any normal DRAM, but fast to access from GPU code).

Weijia's summary is pretty close to what I had expected, although it was interesting to realize that the GPU has quite so much DRAM.  Interesting questions about how to manage that as a cache arise...

At any rate, the next thing to think about is this: when we use a GPU, we assume that it somehow has the proper logic loaded into it: the GPU program is up and ready for input.  But what is a GPU program?

It seems best to think of GPU code as a set of methods, written in CUDA, and comprising a kind of library: a library that was loaded into the GPU when we first took control over it, and that we can now issue calls into at runtime.  In effect the GPU device driver can load the address of a function into a kind of register, put the arguments into other registers, and press "run".  Later an interrupt occurs, and the caller knows the task is finished.  For big data arguments, we DMA transfer them into the GPU memory at the outset, and for big results we have a choice: we can leave them in the GPU memory for further use, or DMA them back out.

Now, loading that library had hardware implications and it took time: lets say 3 seconds to 30 seconds depending on whether on not a reboot was required.  So already we face an issue: if we wanted to share our GPU between multiple users, at a minimum we should get the users to agree on a single GPU program or set of programs at the outset, load them once, and then use them for a long time.  Otherwise, your decision to load such-and-such a tensor package will disrupt my ability to do data fusion on my smart highway data, and my clients (smart cars) will be at high risk of an accident.  After all: 3 to 30 seconds is quite a long delay!

Additionally, I should perhaps note that GPUs don't have particularly good internal security models.  If we actually do allow multiple users share one GPU, we depend on a form of cooperation between the library modules to protect against information leaking across from one user to another.  By and large sharing just isn't supported, but if GPUs really are shared, we either trust that this security is working, or we would have huge context switch overheads -- many seconds -- and would encounter cold caches after each such context switch occurs.  People are actually exploring this kind of sharing: several papers I've read recently take a single GPU, subdivide it among a few tasks, and even experiment with having that single device concurrently running distinct tasks in distinct "address spaces" (this is not the term they use, and the GPU analog of an address space is quite different from a general purpose machine with a virtual address space, but there is still a roughly similar construct).

But here's the thing: even if AWS offers me a steep discount for multitenancy, I'm not sure that I would want to share my smart-highway data fusion GPU.  After all, I don't know you.  Maybe you are running some kind of spyware application that uses sneaky methods to extract my data through our shared device!  AWS might promise some form of security through obscurity: "attackers would never know who they share the GPU unit with."  But would I want to trust that?

What about FPGA?  Here, the situation is analogous but a bit more flexible.  As you probably are aware, FPGA is really a generic way to encode a chip:  even the circuitry itself is reconfigurable.  An FPGA could encode a normal processor, like an ARM (although I wouldn't waste your time: most FPGA devices have a 6-core or 12-core ARM chip right onboard, so you might as well just run general purpose tasks on those).  If you want your FPGA to behave like a very stripped-down GPU, you can find a block of logic to do that.  Prefer an analog-to-digital conversion unit that also does an FFT on the incoming signal?  In most respects an FPGA could do that too (it might need a bit of additional hardware to help with the analog signals per-se).  So the FPGA developer is like a chip developer, but rather than sending out the chip design to be burned into silicon (which yields an application-specific integrated circuit or ASIC), you download your design into the FPGA, reboot it, and now your chip logic resides on this "bespoke" chip.

Like a GPU we could think of an FPGA as hosting a set of functions, but here we run into some oddities of the FPGA itself being a chip: at the end of the day, data comes in and out through the FPGA's pins, which have to be "shared" in some way if you want multiple functions on the one device.  Companies like Microsoft are exploring frameworks (they call them "shells") to own the infrastructure, so that this layer of sharing can be standardized.  But that work is nowhere near tackling security in the sense that the O/S community understands the term.  This is a kind of sharing aimed more at ensuring platform stability, not really isolation between potentially hostile competing users.

An FPGA is flexible: you can situate your new chip as a kind of filter interposed on the network (a bump-in-the-wire model), or can use it as a kind of special instruction set accessible from the host computer, and able to operate directly on host memory, or you can treat it much like a separate host side by side with your general purpose host, with its own storage, its own logic, and perhaps a DMA interface to your host memory.  But once again, there is just one set of pins for that FPGA chip.  So if multiple "users" share an FPGA, they wouldn't really be arbitrary general users of AWS.  More likely, they would be multiple tasks that all want to process the identical data in distinct ways: perhaps each incoming frame of data in an IoT setting needs to be decrypted, but then we want to run image segmentation side-by-side with some form of de-duplication, side-by-side with a lighting analysis.  Because you probably designed this entire logic path, you can safely be rather trusting about security.  And after all: given that the FPGA may have DMA access to your DRAM (perhaps via the PCIe bus, which has its own memory map, but perhaps directly via the main memory bus), you wouldn't want to time-share this unit among other users of the same cloud!

Which leads to my point: time-sharing of devices like FPGAs and GPUs is really not a very flexible concept today.  We can share within components of one application, but not between very different users who have never even met and have no intention of cooperating.   The context switch times alone tell us that you really wouldn't want these to change the code they run every time you launch a different end-user application.  You run huge security risks.  A device failure can easily trigger a hardware level host failure, and perhaps could take down the entire rack by freezing the network.  Yes, accelerators do offer incredible speedups, but these devices are not very sharable.

The immense surge in popularity of accelerators stems from big-data uses, but those are often very batched.  We take a million web pages, then perform some massively parallel task on them for an hour.  It might have run for a year without the accelerators.

But this is far from sharability of a general sense.  So, what is this telling us about the event-oriented style of computing seen in the IoT Edge or on the IoT Cloud (the first tier of the true cloud, where a datacenter deals with attached IoT devices)?  In a nutshell, the message is that we will be looking at an exceptionally expensive form of AI/ML unless we find ways to scale our IoT solutions and to batch the actions they are taking.

Here are two small examples to make things a bit more concrete.  Suppose that my IoT objective is to automate some single conference room in a big building.  In today's accelerator model, I may have to dedicate some parallel hardware to the task.  Unless I can multiplex these devices within my conference room application, I'll use a lot of devices per task compared to what back-end experience in the cloud may have led the developers to anticipate.  Thus my solution will be surprisingly costly, when compared with AI of the kind used to place advertising on web pages as we surf from our browsers: thousands of times more costly, because the hardware accelerators for ad placement are amortized over thousands of users at a time.

I suppose I would solve this by offering my conference room solution as a service, having a pool of resources adequate to handle 1000 conference rooms at a time, and hoping that I can just build and deploy more of demand surges and some morning, 10,000 people try to launch the service.  In the cloud, where the need was for general purpose computing, AWS solved this precise puzzle by offering us various IaaS and PaaS options.  With my reliance on dedicated, private, accelerators, I won't have that same option.  And if I blink and agree to use shared accelerators, aside from the 3-second to 30-second startup delay, I'll be accepting some risk of data leakage through the accelerator devices.  They were listening during that conference... and when it ends, someone else will be using that same device, which was created by some unknown FPGA or GPU vendor, and might easily have some kind of backdoor functionality.  So this could play out in my favor ("but AWS had its eye on such risks, and in fact all goes well.")  Or perhaps not ("Massive security breach discovered at KACS (Ken's awesome conference services): 3 years of conferences may have been covertly recorded by an unknown intruder.").

In contrast, if I am automating an entire smart highway, I might have a shot at amortizing my devices provided that I don't pretend to have security domains internal to the application, that might correspond to security obligations on my multi-tasked FPGA or GPU accelerator.  But true security down in that parallel hardware would be infeasible, I won't be able to dynamically reconfigure these devices at the drop of a pin, and I will still need to think about ways of corralling my IoT events into batches (without delaying them!) or I won't be able to keep the devices busy.

Now, facing this analysis, you could reach a variety of conclusions.  For me, working on Derecho, I don't see much of a problem today.  My main conclusion is simply that as we evolve Derecho into a more and more comprehensive edge IoT solution, we'll want to integrate FPGA and GPU support directly into our compute paths, and that we need to start to think about how to do batching when a Derecho-based microservice leverages those resources.  But these are early days and Derecho can be pretty successful long before questions of data-center cost and scalability enter the picture.

In fact the worry should be at places like AWS and Facebook and Azure and Google Cloud.  Those are the companies where an IoT "app" model is clearly part of the roadmap, and where these cost and security tradeoffs will really play out.

And in fact, I were a data-center owner, this situation would definitely worry me!  On the one hand, everyone is betting big on IoT intelligence.  Yet nobody is thinking of IoT intelligence as costly in the sense that I might need to throw (in effect) 10x or 100x more specialized parallel hardware at these tasks, on a "per useful action" basis, to accomplish my goals.  For many purposes those factors of 100 could be prohibitive at real scale.

We'll need solutions within five or ten years.  They could be ideas that let users securely multiplex FPGA or GPU, and with a lot less context switching delay.  Or ideas that allow them to do some form of receiver-side batching that avoids delaying actions but still manages to batch them, as we do in Derecho.  They could be ideas for totally need hardware models that would simply cost a lot less, like FPGA or GPU cores on a normal NUMA die (my worry for that case would be heat dissipation, but maybe the folks at Intel or AMD or NVIDIA would have a clever idea).

The key point, though, remains: if we are serious about the IoT edge, we need to get serious about inventing the specialized accelerator models that the edge will require!

Tuesday, 5 November 2019

The hardest part of a cloud for IoT will be the hardware

At the recent SOSP 2019 conference up in Ottawa, I was one of the "old guard" that had expected a program heavy on core O/S papers aimed at improving performance, code quality or reliability for the O/S itself or for general applications.  How na├»ve of me: Easily half the papers were focused on performance of machine learning infrastructures or applications.  And even in the other half, relatively few fell into the category that I think of as classic O/S research.

Times change, though, and I'm going to make the case here that in fact the pendulum is about to swing back and in quite a dramatic way.  If we ask what AI and ML really meant in the decade or so leading up to this 2019 conference the answer leads directly to the back-end platforms that host big-data analytic tools.

Thus it was common to see a SOSP paper that started by pointing out that training a deep neural network can take many hours, during which the GPU sometimes is stalled, starved for work to do.  Then the authors would develop some sort of technique for restructuring the DNN training system, the GPU would stay fully occupied, and performance would improve 10x.   What perhaps is a bit less evident is the extent to which these questions and indeed the entire setup center on the big-data back-end aspects of the problem.  The problem (as seen today) itself is inherently situated in that world: when you train a DNN, you generally work with vast amounts of data, and the whole game is to parallelize the task by batching the job, spreading the work over large number of nodes that each tackle some portion of the computation, and then iterating, merging the sub-results as the DNN training system runs.  The whole setup is about as far from real-time as it gets.

So we end up with platforms like MapReduce and Hadoop (Spark), or MPI for the HPC fans in the rooms.  These make a lot of sense in a batched, big-data, back-end setting.

But I'm a huge believer in the cloud edge: IoT systems with real-time sensors, that both make decisions and even learn as they run, often under intense time pressure.  Take Microsoft's Farmbeats product, led by a past student of mine who was a star from the outset.  Today, Ranveer Chandra has become the Chief Scientist for Azure Global and is the owner of this product line.  Farmbeats has a whole infrastructure for capturing data (often video or photos) from cameras.  It can support drones that will scan a farm for signs of insect infestations or other problems.  It replans on the fly, and makes decisions on the fly: spray this.  Bessie seems to be favoring her right hoof, and our diagnostic suggests a possible tear in the nail: call the vet.  The tomatoes in the north corner of the field look very ripe and should be picked today.  Farmbeats even learns wind patterns on the fly, and reschedules the dones doing the search pattern to glide on the breeze as a way to save battery power.

And this is the future.  Last time I passed through Newark Airport, every monitor was switching from ads for Microsoft's Farmbeats to ads for Amazon's smart farm products, back and forth.  The game is afoot!

If you walk the path for today's existing DNN training systems and classifiers, you'll quickly discover that pretty much everything gets mapped to hardware accelerators.  Otherwise, it just wouldn't be feasible at these speeds, and this scale.  The specific hardware varies: At Google, TPUs and TPU clusters, and their own in-house RDMA solutions.  Amazon seems very found of GPU and very skeptical of RDMA; they seem convinced that fast datacenter TCP is the way to go.  Microsoft has been working with a mix: GPUs and FPGAs, configurable into ad-hoc clusters that treat the infrastructure like an HPC supercomputing communications network, complete with the kinds of minimal latencies that MPI assumes.  All the serious players are interested in hearing about RDMA-like solutions that don't require a separate InfiniBand network and that can be trusted not to hose the entire network.  Now that you can buy SSDs built from 3-D XPoint, nobody wants anything slower.

And you know what?  This all pays off.  Those hardware accelerators are data-parallel and no general purpose CPU can possibly compete with them.  They would be screamingly fast even without attention to the exact circuitry used to do the computing, but in fact many accelerators can be configured to omit unneeded logic: an FPGA can be pared down until the CPUs implement only the exact operations actually needed, to the exact precision desired.  If you want matrix multiple with 5-bit numbers, there's an FPGA solution for that.  Not a gate or wire wasted... and hence no excess heat, no unneeded slowdowns, and more FPGA real-estate available to make the basic computational chunk sizes larger.  That's the world we've entered, and all these acronyms dizzy you, well, all I can do is to say that you get used to them quickly!

To me this leads to the most exciting of the emerging puzzles: How will we pull this same trick off for the IoT edge? (I don't mean specifically the Azure IoT Edge, but more broadly: both remote clusters and also the edge of the main cloud datacenter).   Today's back-end solutions are more optimized for batch processing and relaxed time constraints than people probably realize, and the same hardware that seems so cost-effective for big data may seem quite a bit less efficient if you try to move it out to near the edge computing devices.

In fact it gets even more complicated.  The edge is a world with some standards.  Consider Azure IoT: the devices are actually managed by the IoT Hub, which provides a great deal of security (particularly in conjunction with IoT Sphere, a hardware security model with its own TCB).  IoT events trigger Azure functions: lightweight stateless computations that are implemented as standard programs but that run in a context where we don't have FPGA or GPU support: initializing those kinds of devices to support a particular kernel (a particular computational functionality) typically requires at least 3-5 seconds to load the function, then reboot the device.  I'm not saying it can't be done: if you know that many IoT functions will need to run a DNN classifier, I suppose you could position accelerators in the function tier and preload the DNN classifier kernel.  But the more general form of the story, where developers for a thousand IoT companies create those functions, leads to such a diversity of needs from the accelerator that it just could never be managed.

So... it seems unlikely that we'll be doing the fast stuff in the function tier.  More likely, we'll need to run our accelerated IoT logic in microservices, living behind the function tier on heavier nodes with beefier memories, stable IoT accelerators loaded with exactly the kernels the particular microservice needs, and then all of this managed for fault-tolerance with critical data replicated (and updated as the system learns: we'll be creating new ML models on the fly, like Farmbeats and its models of cow health, its maps of field conditions, and its wind models).   As a side-remark, this is the layer my Derecho project targets.

But consider the triggers: in the back-end, we ran on huge batches of events, perhaps tens of thousands at a time, in parallel.  At the edge, a single event might trigger the whole computation.  Even to support this, we will still need RDMA to get the data to the microservice instance(s) charged with processing it (via the IoT functions, which behave like small routers, customized by the developer), fast NVM like Optane, and then those FPGA, GPU or TPU devices to do the heavily parallel work.  How will we keep these devices busy enough to make such an approach cost-effective?

And how will the application developer "program" the application manager to know how these systems need to be configured: this microservice needs one GPU per host device; that one needs small FPGA clusters with 32 devices but will only use them for a few milliseconds per event, this other one needs a diverse mix, with TPU here, GPU there, FPGA on the wire...   Even "describing" the needed configuration looks really interesting, and hard.

You see, for researchers like me, questions like those aren't a reason to worry.  They are more like a reason to celebrate: plenty to think about, lots of work to do and papers to write, and real challenges for our students to tackle.  Systems challenges.

So you know what?  I have a feeling that by 2020 or 2021, OSDI and SOSP will start to feel more like classic systems conferences again.  The problems I've outlined look more like standard O/S topics, even if the use-case that drives them is dominated by AI and ML and real-time decision-making for smart farms (or smart highways, smart homes, smart cities, smart grids).  The money will be there -- Bill Gates once remarked during a Cornell visit that the IoT revolution could easily dwarf the Internet and Web and Cloud ones (I had a chance to ask why, and his answer was basically that there will be an awful lot of IoT devices out there... so how could the revolution not be huge?)

But I do worry that our funding agencies may be slow to understand this trend.  In the United States, graduate students and researchers are basically limited to working on problems for which they can get funding.  Here we have a money-hungry and very hard question, that perhaps can only be explored in a hands-on way by people actually working at companies like Google and Microsoft and Amazon.  Will this lead the NSF and DARPA and other agencies to adopt a hands-off approach?

I'm hoping that one way or another, the funding actually will be there.  Because if groups like mine can get access to the needed resources, I bet we can show you some really cool new ideas for managing all that hardware near the IoT edge.  And I bet you'll be glad we invented them, when you jump into your smart car and tell it to take the local smart highway into the office.  Just plan ahead: you won't want to miss OSDI next year, or SOSP in 2021!