Thursday 7 November 2019

Sharable accelerators for the IoT Edge

After returning from the ACM Symposium on Operating Systems Principles (SOSP 2019, where we saw a huge number of ideas more focused on AI/ML than on classic operating systems), I wrote a blog posting that I shared a few days ago.

My core point was that we will face a kind of crisis in AI/ML computing as we try to leverage hardware accelerators.  The problem is that to leverage deep neural networks in IoT environments, both the training and the actions may need to occur under tight real-time pressures.  These IoT uses will huge volumes of data (images and video, voice, multi-spectral imaging data, and so forth).  Often, we  need to combine information from multiple sources before we can extract knowledge (data "fusion").  Add those observations up and you find yourself looking at a class of questions that can only be addressed using hardware accelerators.  This is because the core tasks will be data-parallel: operations that can occur in a single highly parallel step on an accelerator, but that might require billions of clock-cycles on a normal general-purpose computer.

But the cost-effectiveness of today's IoT hardware is very strongly tied to the world in which those devices have evolved.  Look at the SOSP papers and you read all sorts of results about improving the mapping of batch-style workloads into GPU and FPGA clusters (and the same insights apply to custom ASICs or TPUs).

Thus at the edge, we will find ourselves in a pinch: while the demand for cycles is similar while performing the inference or training task, these tasks are going to be event-driven: you fuse the data from a set of sources on a smart highway at the instant that you collect the data, with the intent of updating vehicle trajectory information and revising car guidance within seconds. And the problem this poses is that the accelerator might spend most of its time waiting for work, even though when work does show up, it has magical superpowers that let it discharge the task within microseconds.

To me this immediately suggests that IoT could be profligate in its use of hardware accelerators, demanding a whole FPGA or GPU cluster for a single event.  Under any normal model of the cost of these devices, the model would be extremely expensive.  I suppose you could make the case for some kind of bargain-basement device that might be a bit slower, less heavily provisioned with memory and otherwise crippled, and by doing that drive the costs down quite a bit.  But you'll also slow the tasks down, and that would then make ML for IoT a very feeble cousin to ML performed at the back-end on a big data analytics framework.

What would cost-effective IoT for the edge require, in terms of new hardware?  A first guess might be some form of time-sharing, to allow once device to be multiplexed between a large number of events from a single application that scales out to include many sensors (a relatively easy case), or between events from different users (a much harder one).  But I'm going to argue against this, at least if we base our reasoning on today's existing IoT accelerator options.

Multiplexing is exceptionally difficult for hardware accelerators.  Let's try and understand the root problems.  Now, in proposing this mini-deep-dive, I realize that many readers here are like me, and as such, probably view these units as black boxes.  But where we differ slightly is that over the past few years I've been forcing myself to learn more by teaching a class that looks at the roles of accelerators in modern datacenter computing.  I don't find these papers easy to read, or to teach, but along the way I've slowly gotten familiar with the models and learned about some interesting overheads.

When you look at a GPU in a datacenter, what are you likely to find under the hood?  As it happens, my colleague Weijia Song just asked this question, and arrived at a very nice high level summary.  He tells me that from a mile up, we should view a GPU computing unit as a specialized server, with its own memory (just like a normal computer, a GPU would be equipped with perhaps 12GB of DRAM), but then with processors that run a special kind of SIMD code (often written in a variant of C called CUDA) that performs block-parallel computations using blocks of GPU threads, each with its own GPU core.  L1 and L2 caching are software-managed, which is less exotic than you may think: with modern C and C++ we use annotations on variables to describe the desired consistency semantics and atomicity properties, and in fact the GPU scheme is rather similar.  So: we have a 3-level memory hierarchy, which can be understood by thinking about registers (L1), the normal kind of cache (L2) and DRAM resident on the GPU bus (like any normal DRAM, but fast to access from GPU code).

Weijia's summary is pretty close to what I had expected, although it was interesting to realize that the GPU has quite so much DRAM.  Interesting questions about how to manage that as a cache arise...

At any rate, the next thing to think about is this: when we use a GPU, we assume that it somehow has the proper logic loaded into it: the GPU program is up and ready for input.  But what is a GPU program?

It seems best to think of GPU code as a set of methods, written in CUDA, and comprising a kind of library: a library that was loaded into the GPU when we first took control over it, and that we can now issue calls into at runtime.  In effect the GPU device driver can load the address of a function into a kind of register, put the arguments into other registers, and press "run".  Later an interrupt occurs, and the caller knows the task is finished.  For big data arguments, we DMA transfer them into the GPU memory at the outset, and for big results we have a choice: we can leave them in the GPU memory for further use, or DMA them back out.

Now, loading that library had hardware implications and it took time: lets say 3 seconds to 30 seconds depending on whether on not a reboot was required.  So already we face an issue: if we wanted to share our GPU between multiple users, at a minimum we should get the users to agree on a single GPU program or set of programs at the outset, load them once, and then use them for a long time.  Otherwise, your decision to load such-and-such a tensor package will disrupt my ability to do data fusion on my smart highway data, and my clients (smart cars) will be at high risk of an accident.  After all: 3 to 30 seconds is quite a long delay!

Additionally, I should perhaps note that GPUs don't have particularly good internal security models.  If we actually do allow multiple users share one GPU, we depend on a form of cooperation between the library modules to protect against information leaking across from one user to another.  By and large sharing just isn't supported, but if GPUs really are shared, we either trust that this security is working, or we would have huge context switch overheads -- many seconds -- and would encounter cold caches after each such context switch occurs.  People are actually exploring this kind of sharing: several papers I've read recently take a single GPU, subdivide it among a few tasks, and even experiment with having that single device concurrently running distinct tasks in distinct "address spaces" (this is not the term they use, and the GPU analog of an address space is quite different from a general purpose machine with a virtual address space, but there is still a roughly similar construct).

But here's the thing: even if AWS offers me a steep discount for multitenancy, I'm not sure that I would want to share my smart-highway data fusion GPU.  After all, I don't know you.  Maybe you are running some kind of spyware application that uses sneaky methods to extract my data through our shared device!  AWS might promise some form of security through obscurity: "attackers would never know who they share the GPU unit with."  But would I want to trust that?

What about FPGA?  Here, the situation is analogous but a bit more flexible.  As you probably are aware, FPGA is really a generic way to encode a chip:  even the circuitry itself is reconfigurable.  An FPGA could encode a normal processor, like an ARM (although I wouldn't waste your time: most FPGA devices have a 6-core or 12-core ARM chip right onboard, so you might as well just run general purpose tasks on those).  If you want your FPGA to behave like a very stripped-down GPU, you can find a block of logic to do that.  Prefer an analog-to-digital conversion unit that also does an FFT on the incoming signal?  In most respects an FPGA could do that too (it might need a bit of additional hardware to help with the analog signals per-se).  So the FPGA developer is like a chip developer, but rather than sending out the chip design to be burned into silicon (which yields an application-specific integrated circuit or ASIC), you download your design into the FPGA, reboot it, and now your chip logic resides on this "bespoke" chip.

Like a GPU we could think of an FPGA as hosting a set of functions, but here we run into some oddities of the FPGA itself being a chip: at the end of the day, data comes in and out through the FPGA's pins, which have to be "shared" in some way if you want multiple functions on the one device.  Companies like Microsoft are exploring frameworks (they call them "shells") to own the infrastructure, so that this layer of sharing can be standardized.  But that work is nowhere near tackling security in the sense that the O/S community understands the term.  This is a kind of sharing aimed more at ensuring platform stability, not really isolation between potentially hostile competing users.

An FPGA is flexible: you can situate your new chip as a kind of filter interposed on the network (a bump-in-the-wire model), or can use it as a kind of special instruction set accessible from the host computer, and able to operate directly on host memory, or you can treat it much like a separate host side by side with your general purpose host, with its own storage, its own logic, and perhaps a DMA interface to your host memory.  But once again, there is just one set of pins for that FPGA chip.  So if multiple "users" share an FPGA, they wouldn't really be arbitrary general users of AWS.  More likely, they would be multiple tasks that all want to process the identical data in distinct ways: perhaps each incoming frame of data in an IoT setting needs to be decrypted, but then we want to run image segmentation side-by-side with some form of de-duplication, side-by-side with a lighting analysis.  Because you probably designed this entire logic path, you can safely be rather trusting about security.  And after all: given that the FPGA may have DMA access to your DRAM (perhaps via the PCIe bus, which has its own memory map, but perhaps directly via the main memory bus), you wouldn't want to time-share this unit among other users of the same cloud!

Which leads to my point: time-sharing of devices like FPGAs and GPUs is really not a very flexible concept today.  We can share within components of one application, but not between very different users who have never even met and have no intention of cooperating.   The context switch times alone tell us that you really wouldn't want these to change the code they run every time you launch a different end-user application.  You run huge security risks.  A device failure can easily trigger a hardware level host failure, and perhaps could take down the entire rack by freezing the network.  Yes, accelerators do offer incredible speedups, but these devices are not very sharable.

The immense surge in popularity of accelerators stems from big-data uses, but those are often very batched.  We take a million web pages, then perform some massively parallel task on them for an hour.  It might have run for a year without the accelerators.

But this is far from sharability of a general sense.  So, what is this telling us about the event-oriented style of computing seen in the IoT Edge or on the IoT Cloud (the first tier of the true cloud, where a datacenter deals with attached IoT devices)?  In a nutshell, the message is that we will be looking at an exceptionally expensive form of AI/ML unless we find ways to scale our IoT solutions and to batch the actions they are taking.

Here are two small examples to make things a bit more concrete.  Suppose that my IoT objective is to automate some single conference room in a big building.  In today's accelerator model, I may have to dedicate some parallel hardware to the task.  Unless I can multiplex these devices within my conference room application, I'll use a lot of devices per task compared to what back-end experience in the cloud may have led the developers to anticipate.  Thus my solution will be surprisingly costly, when compared with AI of the kind used to place advertising on web pages as we surf from our browsers: thousands of times more costly, because the hardware accelerators for ad placement are amortized over thousands of users at a time.

I suppose I would solve this by offering my conference room solution as a service, having a pool of resources adequate to handle 1000 conference rooms at a time, and hoping that I can just build and deploy more of demand surges and some morning, 10,000 people try to launch the service.  In the cloud, where the need was for general purpose computing, AWS solved this precise puzzle by offering us various IaaS and PaaS options.  With my reliance on dedicated, private, accelerators, I won't have that same option.  And if I blink and agree to use shared accelerators, aside from the 3-second to 30-second startup delay, I'll be accepting some risk of data leakage through the accelerator devices.  They were listening during that conference... and when it ends, someone else will be using that same device, which was created by some unknown FPGA or GPU vendor, and might easily have some kind of backdoor functionality.  So this could play out in my favor ("but AWS had its eye on such risks, and in fact all goes well.")  Or perhaps not ("Massive security breach discovered at KACS (Ken's awesome conference services): 3 years of conferences may have been covertly recorded by an unknown intruder.").

In contrast, if I am automating an entire smart highway, I might have a shot at amortizing my devices provided that I don't pretend to have security domains internal to the application, that might correspond to security obligations on my multi-tasked FPGA or GPU accelerator.  But true security down in that parallel hardware would be infeasible, I won't be able to dynamically reconfigure these devices at the drop of a pin, and I will still need to think about ways of corralling my IoT events into batches (without delaying them!) or I won't be able to keep the devices busy.

Now, facing this analysis, you could reach a variety of conclusions.  For me, working on Derecho, I don't see much of a problem today.  My main conclusion is simply that as we evolve Derecho into a more and more comprehensive edge IoT solution, we'll want to integrate FPGA and GPU support directly into our compute paths, and that we need to start to think about how to do batching when a Derecho-based microservice leverages those resources.  But these are early days and Derecho can be pretty successful long before questions of data-center cost and scalability enter the picture.

In fact the worry should be at places like AWS and Facebook and Azure and Google Cloud.  Those are the companies where an IoT "app" model is clearly part of the roadmap, and where these cost and security tradeoffs will really play out.

And in fact, I were a data-center owner, this situation would definitely worry me!  On the one hand, everyone is betting big on IoT intelligence.  Yet nobody is thinking of IoT intelligence as costly in the sense that I might need to throw (in effect) 10x or 100x more specialized parallel hardware at these tasks, on a "per useful action" basis, to accomplish my goals.  For many purposes those factors of 100 could be prohibitive at real scale.

We'll need solutions within five or ten years.  They could be ideas that let users securely multiplex FPGA or GPU, and with a lot less context switching delay.  Or ideas that allow them to do some form of receiver-side batching that avoids delaying actions but still manages to batch them, as we do in Derecho.  They could be ideas for totally need hardware models that would simply cost a lot less, like FPGA or GPU cores on a normal NUMA die (my worry for that case would be heat dissipation, but maybe the folks at Intel or AMD or NVIDIA would have a clever idea).

The key point, though, remains: if we are serious about the IoT edge, we need to get serious about inventing the specialized accelerator models that the edge will require!

No comments:

Post a Comment

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.