- The multi-core computer itself.
- The Network Interface Cards (NICs) attached to it. Companies like Mellanox are dropping substantial amounts of computing power right into the NIC, although not every NIC will be fully capable. But at least some NICs are going to have substantial multicore chips right onboard in the NIC, with non-trivial amounts of memory and the ability to run a full operating system like Linux or QNX (I'm listing two examples that have widely used "embedded" versions, designed to run in a little co-processor -- these are situations where you wouldn't expect to find a display or a keyboard).
- The storage subsystem. We're seeing a huge growth in the amount of data that can be held by a NAND (aka SSD or flash) disk. Rotational disks are even larger. These devices tend to have onboard control units, to manage the medium, and many have spare capacity. Using that capability makes sense: because of the size, it is less and less practical for the host computer to access the full amount of data, even for tasks like checking file system integrity and making backups. Moreover, modern machine-learning applications often want various forms of data "sketches" or "skims" -- random samples, or statistical objects that describe the data in concise ways. Since customers might find it useful to be able to ask the storage device itself to compute these kinds of things, or to shuffle data around for efficient access, create backups, etc, manufacturers are playing with the idea of augmenting the control units with extra cores that could be programmed by pushing logic right into the lower layers of the storage hierarchy. Then a machine learning system could use that feature to "send" its sketch algorithm down into the storage layer, at which point it would have an augmented storage system that has a native ability to compute sketches, and similarly for other low-level tasks that might be useful.
- Attached co-processors such as NetFPGA devices, GPU clusters, systolic array units.
- The visualization hardware. Of course we've been familiar with graphical co-processors for a long time, but the sophistication available within the display device is growing by leaps and bounds.
- Security hardware. Here the state of the art would be devices like the Intel SGX chip, which can create secure enclaves: the customer can ship an encrypted virtual image to a cloud or some other less-trusted data center operator, where computation will occur remotely, in standard cloud-computing style. But with SGX, the cloud operator can't peer into the executing enclave, even though the cloud is providing the compute cycles and storage for the customer's job. Think of SGX as a way to create a virtually private cloud that requires no trust in the cloud owner/operator at all! Devices like SGX have substantial programmability.
- Other types of peripherals. Not only does your line printer and your WiFi router have a computer in it, it is entirely possible that refrigerator does too, and your radio, and maybe your microwave oven. The oven itself almost definitely does. I point this out not so much because the average computer will offload computing into the oven, but more to evoke the image of something that is also the case inside your friendly neighborhood cloud computing data center: the switches and routers are programmable and could host fairly substantial programs, the power supplies and air conditioner is programmable, the wireless adaptor on the computer itself is a software radio, meaning that it is programmable and can dynamically adapt the kinds of signals it uses for communication, and the list just goes on and on.
The puzzle is that today, none of this is at all easy to do. If I look at some cutting-edge task such as the navigation system of a smart car (sure, I'm not wild about this idea, but I'm a pragmatist too: other people love the concept and even if it stumbles, some amazing technologies will get created along the way), you can see how two kinds of questions arise:
- What tasks need to be carried out?
- Where should they run?
I could break this down in concrete ways for specific examples, but rather than do that, I've been trying to distill my own thinking into a few design principles. Here's what I've come up with; please jump in and share your own thoughts!
- The first question to ask is: what's the killer reason to do this? If it will require a semi-heroic effort to move your data visualization pipeline into a GPU cluster or something, that barrier to actually creating the new code has got to be part of the cost-benefit analysis. So there needs to be an incredibly strong advantage to be gained by moving the code to that place.
- How hard is it to create this code? Without wanting to disparage the people who invented GPU co-processors and NetFPGA, I'll just observe that these kinds of devices are remarkably difficult to program. Normal code can't just be cross-compiled to run on them, and designing the code that actually can run on them is often more "like" hardware design than software design. You often need to do the development and debugging in simulators because the real hardware can be so difficult to even access. Then there is a whole magical incantation required to load your program into the unit, and to get the unit to start to process a data stream. So while each of these steps may be solvable, and might even be easy for an expert with years of experience, we aren't yet in a world where the average student could make a sudden decision to try such a thing out, just to see how it will work, and have much hope of success.
- Will they compose properly? While visiting at Microsoft last fall, I was chatting with someone who knows a lot about RDMA, and a lot about storage systems. This guy pointed out that when an RDMA transfer completes and the application is notified, while you do know that software in the end-point node will see the new data, you do not know for sure that hardware such as disks or visualization pipelines would correctly see it: they can easily have their own caches or pipelines and those could have stale data in them, from activity that was underway just as the RDMA was performed. You can call these concurrency bugs: problems caused by operating the various components side by side, in parallel, but without having any kind of synchronization barriers available. For example, the RDMA transfer layer currently lacks a way to just tell the visualization pipeline: "New data was just DMA'ed into address range X-Y". Often, short of doing a device reset, the devices we deal with just don't have a nice way to flush caches and pipelines, and a reset might leave the unit with a cold cache: a very costly way to ensure consistency. So the insight here is that until barriers and flush mechanisms standardize, when you shift computation around you run into this huge risk that the benefit will be swamped by buggy behaviors, and that the very blunt-edged options for flushing those concurrent pipelines and caches will steal all your performance opportunity!
- How stable and self-managed will the resulting solution be? The world has little tolerance for fragile technologies that easily break or crash or might sometimes malfunction. So if you believe that at the end of the day, people in our business are supposed to produce product-quality technologies, you want to be asking what the mundane "events" your solution will experience might look like (I have in mind things like data corruption caused by static or other effects in a self-driving car, or software crashes caused by Heisenbugs). Are these technologies capable of dusting themselves off and restarting into a sensible state? Can they be combined cleanly with other components?
- Do these devices fit into today's highly virtualized multi-tenancy settings? For example, it is very easy for MPI to leverage RDMA because MPI typically runs on bare metal: the endpoint system owns some set of cores and the associated memory, and there is no major concern about contention. Security worries are mostly not relevant, and everything is scheduled in nearly simultaneous ways. Move that same computing style into a cloud setting and suddenly MPI would have to cope with erratic delays, unpredictable scheduling, competition for the devices, QoS shaping actions by the runtime system, paging and other forms of virtualization, enterprise VLAN infrastructures to support private cloud computing models, etc. Suddenly, what was easy in the HPC world (a very, very, expensive world) becomes quite hard.
In terms of actual benefits, the kinds of things that come to mind are these:
- Special kinds of hardware accelerators. This is the obvious benefit. You also gain access to hardware functionality that the device API might not normally expose: perhaps because it could be misused, or perhaps because the API is somehow locked down by standards. These functionalities can seem like surreal superpowers when you consider how fast a specialized unit can be, compared to a general purpose program doing the same thing on a standard architecture. So we have these faster-than-light technology options, but getting to them involves crossing the threshold to some other weird dimension where everyone speaks an alien dialect and nothing looks at all familiar... (Sorry for the geeky analogy!)
- Low latency. If an RDMA NIC sees some interesting event and can react, right in the NIC, you obviously get the lowest possible delays. By the time that same event has worked its way up to where the application on the end node can see it, quite a lot of time will have elapsed.
- Concurrency. While concurrency creates problems, like the ones listed above, by offloading a task into a co-processor we also insulate that task from scheduling delays and other disruptions.
- Secret knowledge. The network knows its own topology, and knows about overall loads and QoS traffic shaping policies currently in effect. One could draw on that information to optimize the next layer (for example, if Derecho were running in a NIC, it could design its data plane to optimize the flow relative to data center topology objectives. At the end-user layer, that sort of topology data is generally not available, because data center operators worry that users could take advantage of it to game the scheduler in disruptive ways). When designing systems to scale, this kind of information could really be a big win.
- Fault-isolation. If you are building a highly robust subsystem that will play some form of critical role, by shifting out of the very chaotic and perhaps hostile end-user world, you may be in a position to protect your logic from many kinds of failures or attacks.
- Security. Beyond fault-isolation, there might be functionality we would want to implement that somehow "needs" to look at data flows from multiple users or from distinct security domains. If we move that functionality into a co-processor and vet it carefully for security flaws, we might feel less at risk than if the same functionality were running in the OS layer, where just by allowing it to see that cross-user information potentially breaks a protection boundary.
I'll stop here, because (as with many emerging opportunities), the real answers are completely unknown. Someday operating system textbooks will distill the lessons learned into pithy aphorisms, but for now, the bottom line is that we'll need engineering experience to begin to really appreciate the tradeoffs. I have no doubt at all that the opportunities we are looking at here are absolutely incredible, for some kinds of applications and some kinds of products. The puzzle is to sort out the big wins from the cases we might be wiser just not pursuing, namely the ones where the benefits will be eaten up by some form of overhead, or where creating the solution is just impossibly difficult, or where the result will work amazingly well, but only in the laboratory -- where it will be too fragile or too difficult to configure, or to reconfigure when something fails or needs to adapt. Maybe we will need a completely new kind of operating system for "basement programming." But even if such steps are required before we can fully leverage the opportunity, some of those opportunities will be incredible, and it will be a blast figuring out which ones offer the biggest payoff for the least effort!