Tuesday, 6 December 2016

RDMA [1]: A short history of remote DMA networking


Cool factoid: RDMA was co-invented by Amazon’s current CTO, Werner Vogels!

RDMA is the modern commercial form of an idea that originated in work by Thorsten von Eicken, Anindya Basu, Vineet Buch and Werner Vogels, who wrote an SOSP paper in 1995 showing that you could build a parallel supercomputer by running parallel computing code on normal servers, by disabling virtualization (which slows memory access down), pinning a dedicated set of cores to the application, and then mapping the network device right into user space.  

This work was done at Cornell, up the hall from me -- in fact at that time, Werner Vogels worked in my group.  A home run for Cornell!  But one we don't get nearly enough credit for.
You can learn a lot about RDMA from that paper even today, but beyond the technical content, one thing to appreciate is that RDMA, at its birth, was already aimed squarely at high performance parallel computing: this is what Von Eicken's research was focused on, and the early examples were squarely matched to key steps in HPC computing.  This turned out to set the stage for 20 years of RDMA integration more and more deeply into HPC computing tools, and for a co-evolution of RDMA with HPC.  One side-effect is that right from the start, RDMA overlooked many issues that are viewed as must-have considerations in today's general purpose, multi-tenant datacenter or computing cluster.  Even now, RDMA is catching up with the complex mix of puzzles that arise when you try and migrate this idea into a cloud datacenter.
So if you are wondering why RDMA today seems to be so strongly tied to Infiniband (the favorite networking technology of the HPC cluster designers) and to MPI (the most popular HPC programming library), the answer is that they were joined at the hip from birth!  In many ways, HPC and RDMA have been two sides of one HPC story, while RDMA on general purpose platforms was pretty much a non-starter from the beginning.  In fact, RDMA actually does work on general purpose platforms, but it has lacks a number of features needed if we’re ever to see really broad adoption.
So since this is a little history lesson, let’s linger on the distant past.
The U-Net architecture was not something you could have easily thrown together at that time: the central component is a reprogrammed network interface card (NIC) configured to continuously poll a lock-free queue shared with the user-mode program, so that to send or receive data, the user code could simply enqueue an asynchronous send or receive request, and the NIC would instantly notice the new request, perform the I/O without any software involvement at all.  On completion, U-Net notified the sender and receiver by putting a completion record on a separate I/O queue. 
I actually was in the building when Werner and Thorsten got this reprogrammed firmware running (as noted, both were at Cornell in those days).  Seen from up the hallway, they did something astonishingly difficult: sometimes I would leave work late in the day and then come back after a good night of sleep the next day to find them still at it -- for easily two weeks they worked nearly day and night to pull off that first success story.  They used a NIC that wasn't really designed to be reprogrammed, and their first firmware versions destroyed at least a few of the devices before they got things running.  Expensive bugs.
But they got it working, and gained blazing speed:  RPC over U-Net was amazingly fast: you could send a request and receive the reply in just a few 10’s of microseconds (now down to as little as 3 or 4us, and likely to drop even further).  I've already pointed out that Von Eicken’s SOSP paper was focused on parallel supercomputing, and to get those numbers, part of the game was that he coded his applications with a leader and with workers that were dedicated code, waiting for messages from the leader and with absolutely nothing else running on the machines.  But for this model, the paper made the case that special purpose parallel computing hardware architectures could actually be outperformed by commodity computers using modified commodity network devices.  Most of today’s fastest machines use the approach the paper sketched out.
As noted above, one legacy of RDMA’s origins is that even today, RDMA is most widely used via a popular HPC programming tool called the Message Passing Interface (MPI).  MPI wraps IB in a friendly set of software constructs, like a reliable multicast, shared memory data structures, etc.  Internally, MPI talks to the hardware through a layer called IB Verbs.  Verbs is a very low-level, driver-oriented API, and requires the MPI library code to pin memory pages, “register” them with the device (this lets the device precompute the physical addresses where pages that will be used for I/O reside), and then when it comes time to do a send or receive, MPI is expected to construct I/O request records that often include as many as 15 parameters. 
The whole API is documented in much the same style as for a hardware interface.  The community that works with this technology leaves it to the individual vendors to document the libraries that implement the Verbs API.  Some vendors offer additional proprietary APIs, but in my limited experience, none is particularly easy to use.
So from this, it may sound as if you should just use MPI on datacenter systems, or on your cluster up the hall.  Would this make sense?  It might, were it not for some MPI limitations (I want to underscore that none of these are MPI defects: they really reflect ways in which HPC computing differs from general-purpose computing). 
In a nutshell:
  • MPI is heavily optimized for so-called “bare metal” settings (no virtualization).  However, I do want to repeat something from my RDMA intro: the RDMA NIC itself is now starting to evolve to understand that host memory actually might be virtualized and paged.  So while this has been an issue until recently (I'm writing this in 2017), the story is shifting.  The future version might be more along the lines of "keep in mind that with virtualization and especially, with page faults, RDMA obviously might be slower than when the target is a memory-mapped DRAM region").
  • MPI can be used with RDMA over RoCE but in that mode is known to run much slower than when it runs on Infiniband.  As of 2017, at least.   The reason this matters is that a datacenter, cloud or cluster probably runs Ethernet, and nobody is going to listen if you try and talk the IT people into switching over to Infiniband.  So today, if you run MPI in a shared datacenter on RDMA, you’ll be working with one hand tied behind its back, and it may have issues that aren’t easily resolved, since the great majority of MPI experience is on Infiniband.  Thus even early cloud-computing HPC offerings turn out to use Infiniband in the associated clusters (and more than just that: they also schedule the tasks to co-locate them close to one-another, minimizing latency, and launch them simultaneously, minimizing startup skew, and are smart about how memory is allocated, to avoid NUMA effects).  So it isn't a trivial thing to just imagine that MPI could run on multi-tenant RoCE systems with all the complexities that doing so would introduce.  It won't happen overnight, and maybe it won't ever really work as well as MPI works on an HPC cluster today.
  • MPI is gang-scheduled.  What this means is that when you code an MPI application, it will be designed to always have N program instances running at the same time, and as noted, all started simultaneously.  The MPI system requires that all of them be launched when the program starts: you can’t add programs while your application is running, or migrate them to different nodes, or keep running if one fails, and it won't start any of them executing until all the replicas are ready to roll.  Now, these clones will all start up in the identical state, but they won’t have identical roles: the way MPI works is to designate one clone as the head node or leader; the others are configure as its workers.  The leader tells the workers what to do, and they act at its direction.  Of course this can send them off doing long computations and talking to one-another for extended periods without the leader getting involved, but the basic model always puts the leader in a special privileged role.
  • This matters because it is not obvious how a gang-scheduled model could be used for building an elastic cloud computing service.  So you can definitely have a cloud service (like a weather app) that uses an HPC system as a source of data (like a massively parallel weather simulation).  But you would not want to think of a cloud system that has MPI programs directly serving web requests because, basically, that makes no sense: it ignores the MPI model.  Cloud companies are beginning to offer compute-cloud products, like Azure Compute from Microsoft, and these really do use the HPC model, complete with MPI, RDMA on Infiniband, etc.  But they configure these products as HPC clusters that are easily accessible within the cloud – they don’t try to pretend that the HPC cluster “is” the cloud in the sense of receiving browser requests directly from clients.
  • MPI is not fault-tolerant or dynamic: the set of N member programs is static.   In contrast, general datacenter/cloud settings applications need to be elastic (they need to add members or scale back dynamically, responding to varying levels of external demand).  This is obviously true for the outer (sometimes called stateless) tier of the cloud, but also true for services that the cloud uses.
  • MPI is often integrated with the HPC scheduler, which selects the leader in a topology-aware way.  As a result, MPI use of RDMA benefits from a compute model with little contention for resources such as top-of-rack routers, and MPI can launch parallel tasks with minimal skew.
  • Based on what I see on the web, it seems as if MPI startup is slow: seconds to minutes (a big delay is just the time needed to load the associated programs, data files, and perhaps entire VM images on the various nodes your job will run).  Since MPI application often run for many minutes or even hours, this is probably not a major issue in HPC applications.  The issue may have as much to do with HPC cluster data transfer protocols and the way that HPC schedulers do their scheduling as with MPI itself: MPI is gang-scheduled, meaning that the leader tells the workers what to do, so until all N nodes are up and running, MPI can’t start computing. 
So would it make sense to use MPI in general settings?  Maybe, for some purposes, but probably not for the most obvious use cases.  For example, say you wanted to build a utility to make massive numbers of replicas of a file (an easy, obvious, use case).  On MPI you would need a way to launch the thing on the nodes where the replicas will go (not hard thing, but MPI won’t help with that step).  Startup might be slow. Once the file is read in on the master node, the replication step would be fast, but perhaps not as fast as MPI on an HPC cluster, since your cloud setting would expose MPI to running on RoCE, might have virtualization or multitenancy, and definitely would experience irregular scheduling delays.  If something were to crash, MPI would shut down the whole application. 
Add all of this up, and there are good reasons to doubt that using MPI this way really makes sense.  Cornell’s Derecho steps into that gap.  So we can’t trivially use MPI on datacenter systems, but this doesn’t mean that we should give up on the potential of RDMA in the datacenter.

No comments:

Post a Comment

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.