Cool factoid: RDMA was co-invented by
Amazon’s current CTO, Werner Vogels!
RDMA is the modern commercial form of an
idea that originated in work by Thorsten von Eicken, Anindya Basu, Vineet Buch and
Werner Vogels, who wrote an SOSP paper in 1995 showing that you could build a
parallel supercomputer by running parallel computing code on normal servers, by
disabling virtualization (which slows memory access down), pinning a dedicated
set of cores to the application, and then mapping the network device right into
user space.
This work was done at Cornell, up the hall from me -- in fact at that time, Werner Vogels worked in my group. A home run for Cornell! But one we don't get nearly enough credit for.
This work was done at Cornell, up the hall from me -- in fact at that time, Werner Vogels worked in my group. A home run for Cornell! But one we don't get nearly enough credit for.
You can learn a lot about RDMA from that
paper even today, but beyond the technical content, one thing to appreciate is that RDMA, at its birth, was already aimed squarely at high performance
parallel computing: this is what Von Eicken's research was focused on, and the early examples were squarely matched to key steps in HPC computing. This turned out to
set the stage for 20 years of RDMA integration more and more deeply into HPC
computing tools, and for a co-evolution of RDMA with HPC. One side-effect is that right from the start, RDMA overlooked
many issues that are viewed as must-have considerations in today's general
purpose, multi-tenant datacenter or computing cluster. Even now, RDMA is catching up with the complex mix of puzzles that arise when you try and migrate this idea into a cloud datacenter.
So if you are wondering why RDMA today
seems to be so strongly tied to Infiniband (the favorite networking technology
of the HPC cluster designers) and to MPI (the most popular HPC programming
library), the answer is that they were joined at the hip from birth! In many ways, HPC and RDMA have been two
sides of one HPC story, while RDMA on general purpose platforms was pretty much
a non-starter from the beginning. In
fact, RDMA actually does work on general purpose platforms, but it has lacks a
number of features needed if we’re ever to see really broad adoption.
So since this is a little history lesson,
let’s linger on the distant past.
The U-Net architecture was not something
you could have easily thrown together at that time: the central component is a
reprogrammed network interface card (NIC) configured to continuously poll a
lock-free queue shared with the user-mode program, so that to send or receive
data, the user code could simply enqueue an asynchronous send or receive
request, and the NIC would instantly notice the new request, perform the I/O
without any software involvement at all.
On completion, U-Net notified the sender and receiver by putting a
completion record on a separate I/O queue.
I actually was in the building when Werner
and Thorsten got this reprogrammed firmware running (as noted, both were at Cornell in
those days). Seen from up the hallway,
they did something astonishingly difficult: sometimes I would leave work late in the day and then come back after a good night of sleep the next day to find them still at it -- for easily two weeks they worked nearly day and night to pull off that first success story. They used a NIC that wasn't really designed to be reprogrammed, and their first firmware
versions destroyed at least a few of the devices before they got things
running. Expensive bugs.
But they got it working, and gained blazing
speed: RPC over U-Net was amazingly
fast: you could send a request and receive the reply in just a few 10’s of
microseconds (now down to as little as 3 or 4us, and likely to drop even
further). I've already pointed out that Von Eicken’s SOSP paper was
focused on parallel supercomputing, and to get those numbers, part of the game
was that he coded his applications with a leader and with workers that were
dedicated code, waiting for messages from the leader and with absolutely
nothing else running on the machines.
But for this model, the paper made the case that special purpose
parallel computing hardware architectures could actually be outperformed by
commodity computers using modified commodity network devices. Most of today’s fastest machines use the
approach the paper sketched out.
As noted above, one legacy of RDMA’s
origins is that even today, RDMA is most widely used via a popular HPC
programming tool called the Message Passing Interface (MPI). MPI wraps IB in a friendly set of software
constructs, like a reliable multicast, shared memory data structures, etc. Internally, MPI talks to the hardware through
a layer called IB Verbs. Verbs is a very
low-level, driver-oriented API, and requires the MPI library code to pin memory
pages, “register” them with the device (this lets the device precompute the
physical addresses where pages that will be used for I/O reside), and then when
it comes time to do a send or receive, MPI is expected to construct I/O request
records that often include as many as 15 parameters.
The whole API is documented in much the
same style as for a hardware interface.
The community that works with this technology leaves it to the
individual vendors to document the libraries that implement the Verbs API. Some vendors offer additional proprietary
APIs, but in my limited experience, none is particularly easy to use.
So from this, it may sound as if you should
just use MPI on datacenter systems, or on your cluster up the hall. Would this make sense? It might, were it not for some MPI limitations
(I want to underscore that none of these are MPI defects: they really reflect
ways in which HPC computing differs from general-purpose computing).
In a nutshell:
- MPI is heavily optimized for so-called “bare metal” settings (no virtualization). However, I do want to repeat something from my RDMA intro: the RDMA NIC itself is now starting to evolve to understand that host memory actually might be virtualized and paged. So while this has been an issue until recently (I'm writing this in 2017), the story is shifting. The future version might be more along the lines of "keep in mind that with virtualization and especially, with page faults, RDMA obviously might be slower than when the target is a memory-mapped DRAM region").
- MPI can be used with RDMA over RoCE but in that mode is known to run much slower than when it runs on Infiniband. As of 2017, at least. The reason this matters is that a datacenter, cloud or cluster probably runs Ethernet, and nobody is going to listen if you try and talk the IT people into switching over to Infiniband. So today, if you run MPI in a shared datacenter on RDMA, you’ll be working with one hand tied behind its back, and it may have issues that aren’t easily resolved, since the great majority of MPI experience is on Infiniband. Thus even early cloud-computing HPC offerings turn out to use Infiniband in the associated clusters (and more than just that: they also schedule the tasks to co-locate them close to one-another, minimizing latency, and launch them simultaneously, minimizing startup skew, and are smart about how memory is allocated, to avoid NUMA effects). So it isn't a trivial thing to just imagine that MPI could run on multi-tenant RoCE systems with all the complexities that doing so would introduce. It won't happen overnight, and maybe it won't ever really work as well as MPI works on an HPC cluster today.
- MPI is gang-scheduled. What this means is that when you code an MPI application, it will be designed to always have N program instances running at the same time, and as noted, all started simultaneously. The MPI system requires that all of them be launched when the program starts: you can’t add programs while your application is running, or migrate them to different nodes, or keep running if one fails, and it won't start any of them executing until all the replicas are ready to roll. Now, these clones will all start up in the identical state, but they won’t have identical roles: the way MPI works is to designate one clone as the head node or leader; the others are configure as its workers. The leader tells the workers what to do, and they act at its direction. Of course this can send them off doing long computations and talking to one-another for extended periods without the leader getting involved, but the basic model always puts the leader in a special privileged role.
- This matters because it is not obvious how a gang-scheduled model could be used for building an elastic cloud computing service. So you can definitely have a cloud service (like a weather app) that uses an HPC system as a source of data (like a massively parallel weather simulation). But you would not want to think of a cloud system that has MPI programs directly serving web requests because, basically, that makes no sense: it ignores the MPI model. Cloud companies are beginning to offer compute-cloud products, like Azure Compute from Microsoft, and these really do use the HPC model, complete with MPI, RDMA on Infiniband, etc. But they configure these products as HPC clusters that are easily accessible within the cloud – they don’t try to pretend that the HPC cluster “is” the cloud in the sense of receiving browser requests directly from clients.
- MPI is not fault-tolerant or dynamic: the set of N member programs is static. In contrast, general datacenter/cloud settings applications need to be elastic (they need to add members or scale back dynamically, responding to varying levels of external demand). This is obviously true for the outer (sometimes called stateless) tier of the cloud, but also true for services that the cloud uses.
- MPI is often integrated with the HPC scheduler, which selects the leader in a topology-aware way. As a result, MPI use of RDMA benefits from a compute model with little contention for resources such as top-of-rack routers, and MPI can launch parallel tasks with minimal skew.
- Based on what I see on the web, it seems as if MPI startup is slow: seconds to minutes (a big delay is just the time needed to load the associated programs, data files, and perhaps entire VM images on the various nodes your job will run). Since MPI application often run for many minutes or even hours, this is probably not a major issue in HPC applications. The issue may have as much to do with HPC cluster data transfer protocols and the way that HPC schedulers do their scheduling as with MPI itself: MPI is gang-scheduled, meaning that the leader tells the workers what to do, so until all N nodes are up and running, MPI can’t start computing.
So would it make
sense to use MPI in general settings?
Maybe, for some purposes, but probably not for the most obvious use
cases. For example, say you wanted to
build a utility to make massive numbers of replicas of a file (an easy,
obvious, use case). On MPI you would
need a way to launch the thing on the nodes where the replicas will go (not
hard thing, but MPI won’t help with that step).
Startup might be slow. Once the file is read in on the master node, the
replication step would be fast, but perhaps not as fast as MPI on an HPC
cluster, since your cloud setting would expose MPI to running on RoCE, might
have virtualization or multitenancy, and definitely would experience irregular
scheduling delays. If something were to
crash, MPI would shut down the whole application.
Add all of this up,
and there are good reasons to doubt that using MPI this way really makes
sense. Cornell’s Derecho steps into
that gap. So we can’t trivially use MPI
on datacenter systems, but this doesn’t mean that we should give up on the
potential of RDMA in the datacenter.
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.