At the recent SOSP 2019 conference up in Ottawa, I was one of the "old guard" that had expected a program heavy on core O/S papers aimed at improving performance, code quality or reliability for the O/S itself or for general applications. How naïve of me: Easily half the papers were focused on performance of machine learning infrastructures or applications. And even in the other half, relatively few fell into the category that I think of as classic O/S research.
Times change, though, and I'm going to make the case here that in fact the pendulum is about to swing back and in quite a dramatic way. If we ask what AI and ML really meant in the decade or so leading up to this 2019 conference the answer leads directly to the back-end platforms that host big-data analytic tools.
Thus it was common to see a SOSP paper that started by pointing out that training a deep neural network can take many hours, during which the GPU sometimes is stalled, starved for work to do. Then the authors would develop some sort of technique for restructuring the DNN training system, the GPU would stay fully occupied, and performance would improve 10x. What perhaps is a bit less evident is the extent to which these questions and indeed the entire setup center on the big-data back-end aspects of the problem. The problem (as seen today) itself is inherently situated in that world: when you train a DNN, you generally work with vast amounts of data, and the whole game is to parallelize the task by batching the job, spreading the work over large number of nodes that each tackle some portion of the computation, and then iterating, merging the sub-results as the DNN training system runs. The whole setup is about as far from real-time as it gets.
So we end up with platforms like MapReduce and Hadoop (Spark), or MPI for the HPC fans in the rooms. These make a lot of sense in a batched, big-data, back-end setting.
But I'm a huge believer in the cloud edge: IoT systems with real-time sensors, that both make decisions and even learn as they run, often under intense time pressure. Take Microsoft's Farmbeats product, led by a past student of mine who was a star from the outset. Today, Ranveer Chandra has become the Chief Scientist for Azure Global and is the owner of this product line. Farmbeats has a whole infrastructure for capturing data (often video or photos) from cameras. It can support drones that will scan a farm for signs of insect infestations or other problems. It replans on the fly, and makes decisions on the fly: spray this. Bessie seems to be favoring her right hoof, and our diagnostic suggests a possible tear in the nail: call the vet. The tomatoes in the north corner of the field look very ripe and should be picked today. Farmbeats even learns wind patterns on the fly, and reschedules the dones doing the search pattern to glide on the breeze as a way to save battery power.
And this is the future. Last time I passed through Newark Airport, every monitor was switching from ads for Microsoft's Farmbeats to ads for Amazon's smart farm products, back and forth. The game is afoot!
If you walk the path for today's existing DNN training systems and classifiers, you'll quickly discover that pretty much everything gets mapped to hardware accelerators. Otherwise, it just wouldn't be feasible at these speeds, and this scale. The specific hardware varies: At Google, TPUs and TPU clusters, and their own in-house RDMA solutions. Amazon seems very found of GPU and very skeptical of RDMA; they seem convinced that fast datacenter TCP is the way to go. Microsoft has been working with a mix: GPUs and FPGAs, configurable into ad-hoc clusters that treat the infrastructure like an HPC supercomputing communications network, complete with the kinds of minimal latencies that MPI assumes. All the serious players are interested in hearing about RDMA-like solutions that don't require a separate InfiniBand network and that can be trusted not to hose the entire network. Now that you can buy SSDs built from 3-D XPoint, nobody wants anything slower.
And you know what? This all pays off. Those hardware accelerators are data-parallel and no general purpose CPU can possibly compete with them. They would be screamingly fast even without attention to the exact circuitry used to do the computing, but in fact many accelerators can be configured to omit unneeded logic: an FPGA can be pared down until the CPUs implement only the exact operations actually needed, to the exact precision desired. If you want matrix multiple with 5-bit numbers, there's an FPGA solution for that. Not a gate or wire wasted... and hence no excess heat, no unneeded slowdowns, and more FPGA real-estate available to make the basic computational chunk sizes larger. That's the world we've entered, and all these acronyms dizzy you, well, all I can do is to say that you get used to them quickly!
To me this leads to the most exciting of the emerging puzzles: How will we pull this same trick off for the IoT edge? (I don't mean specifically the Azure IoT Edge, but more broadly: both remote clusters and also the edge of the main cloud datacenter). Today's back-end solutions are more optimized for batch processing and relaxed time constraints than people probably realize, and the same hardware that seems so cost-effective for big data may seem quite a bit less efficient if you try to move it out to near the edge computing devices.
In fact it gets even more complicated. The edge is a world with some standards. Consider Azure IoT: the devices are actually managed by the IoT Hub, which provides a great deal of security (particularly in conjunction with IoT Sphere, a hardware security model with its own TCB). IoT events trigger Azure functions: lightweight stateless computations that are implemented as standard programs but that run in a context where we don't have FPGA or GPU support: initializing those kinds of devices to support a particular kernel (a particular computational functionality) typically requires at least 3-5 seconds to load the function, then reboot the device. I'm not saying it can't be done: if you know that many IoT functions will need to run a DNN classifier, I suppose you could position accelerators in the function tier and preload the DNN classifier kernel. But the more general form of the story, where developers for a thousand IoT companies create those functions, leads to such a diversity of needs from the accelerator that it just could never be managed.
So... it seems unlikely that we'll be doing the fast stuff in the function tier. More likely, we'll need to run our accelerated IoT logic in microservices, living behind the function tier on heavier nodes with beefier memories, stable IoT accelerators loaded with exactly the kernels the particular microservice needs, and then all of this managed for fault-tolerance with critical data replicated (and updated as the system learns: we'll be creating new ML models on the fly, like Farmbeats and its models of cow health, its maps of field conditions, and its wind models). As a side-remark, this is the layer my Derecho project targets.
But consider the triggers: in the back-end, we ran on huge batches of events, perhaps tens of thousands at a time, in parallel. At the edge, a single event might trigger the whole computation. Even to support this, we will still need RDMA to get the data to the microservice instance(s) charged with processing it (via the IoT functions, which behave like small routers, customized by the developer), fast NVM like Optane, and then those FPGA, GPU or TPU devices to do the heavily parallel work. How will we keep these devices busy enough to make such an approach cost-effective?
And how will the application developer "program" the application manager to know how these systems need to be configured: this microservice needs one GPU per host device; that one needs small FPGA clusters with 32 devices but will only use them for a few milliseconds per event, this other one needs a diverse mix, with TPU here, GPU there, FPGA on the wire... Even "describing" the needed configuration looks really interesting, and hard.
You see, for researchers like me, questions like those aren't a reason to worry. They are more like a reason to celebrate: plenty to think about, lots of work to do and papers to write, and real challenges for our students to tackle. Systems challenges.
So you know what? I have a feeling that by 2020 or 2021, OSDI and SOSP will start to feel more like classic systems conferences again. The problems I've outlined look more like standard O/S topics, even if the use-case that drives them is dominated by AI and ML and real-time decision-making for smart farms (or smart highways, smart homes, smart cities, smart grids). The money will be there -- Bill Gates once remarked during a Cornell visit that the IoT revolution could easily dwarf the Internet and Web and Cloud ones (I had a chance to ask why, and his answer was basically that there will be an awful lot of IoT devices out there... so how could the revolution not be huge?)
But I do worry that our funding agencies may be slow to understand this trend. In the United States, graduate students and researchers are basically limited to working on problems for which they can get funding. Here we have a money-hungry and very hard question, that perhaps can only be explored in a hands-on way by people actually working at companies like Google and Microsoft and Amazon. Will this lead the NSF and DARPA and other agencies to adopt a hands-off approach?
I'm hoping that one way or another, the funding actually will be there. Because if groups like mine can get access to the needed resources, I bet we can show you some really cool new ideas for managing all that hardware near the IoT edge. And I bet you'll be glad we invented them, when you jump into your smart car and tell it to take the local smart highway into the office. Just plan ahead: you won't want to miss OSDI next year, or SOSP in 2021!
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.