A Few Thoughts on Distributed Computing: RDMA[4]: Is my [fill-in-your-OS-here] system RDMA-ready?

So maybe I have you convinced and you are ready to rush out and use RDMA on AWS or Microsoft Azure or Google Cloud? And considering that RDMA was actually co-invented by the CTO of Amazon, for sure the AWS version of RDMA is the best of breed, right?

Well, it isn’t so simple…

Not so fast…

Before you rush out to build an RDMA solution, here are some things to be aware of:

There aren’t a lot of vendors yet, so RDMA switches and router prices are quite high compared to the commodity, volume pricing we see for Ethernet devices, where there are dozens if not hundreds of vendors offering a huge variety of products.
RDMA doesn’t work over routed WAN networks, or wireless, although someday it will. My personal bet is that SoftRoCE will work quite well over WAN UDP, but that the key to it will center on coding: a trick used in the Cornell Smoke and Mirrors File System that seems perfectly matched to the SoftRoCE WAN use case. But at any rate, this isn't there yet.
AWS and Azure and Google Cloud are starting to use RDMA internally to implement trusted platform services, but none of them exposes the RDMA API to third party customers. So your cloud vendor is probably using RDMA, carefully, but even so, you won’t have a way to access RDMA even if the hardware supports it! This is because of a lack of confidence that RDMA really is ready for the full chaos of multi-tenancy systems running at maximum load.
Linux and Windows will probably both need to evolve. For RDMA to work well, the pages used for RDMA should be pinned into memory (not subject to being paged in and out, although there has been work on getting the NIC to understand paging and to patiently wait for a page to be pulled in when a page-fault occurs), and registered with the hardware NIC (although again, there has been work on eliminating this step). Furthermore, if an application is running a thread on core k, that thread will get much better performance when touching memory associated with core k on the server. Thus to leverage the full speed of RDMA it is important to allocate a page of memory associated with your current core, fill it fast, and then on the receive side, same deal. With multitenancy on the cloud (many users sharing one server) this pool would be managed across the set of active processes or tasks. But the good news is: these are pretty simple things to implement.
Right now the RDMA programming API is dreadful, very low level. You will end up doing a mix of socket programming and RDMA programming, with standard Linux sockets running IP used to agree to set up your RDMA qpairs and so forth. In effect, RDMA ends up looking like a bypass mode for an application that uses IP for its initialization. Solution? Use Derecho, don’t use RDMA directly.
Cloud vendors go to a lot of trouble to isolate different users so that when they share one machine, they won’t bang into one-another. But this guarantee wouldn’t currently extend to RDMA: the different users are talking to the same NIC, so if my program sets up a single transfer of 10GB, your program may wait while the NIC sends it. Fair bandwidth sharing for RDMA NICs on multi-tenant machines is an open research question. But there is no reason that RDMA on RoCE with DCQCN can't ultimately solve all of these problems.
Your RDMA code probably won’t port between Linux and Windows. You can build RDMA Linux code on Verbs, but the Windows library that corresponds to Verbs (something called Network Direct) isn’t identical to Verbs and porting from one to the other is quite hard. Moreover, Network Direct is really more of a specification than a piece of software. The actual Network Direct implementation needs to come from the RDMA vendor, and companies haven’t treated end-user direct access to their Network Direct modules as a priority. So the module may not even exist, from some vendors, or if it does, may not be well documented, etc. I myself have given up on Network Direct, after trying to get it to work for three months. It definitely does work for people internal to Microsoft, but for me as an outsider, this particular technology has been impossibly frustrating. My bet is that we'll see Verbs (horrible though it can be) spread to become the universal standard. Maybe there will be a Verbs v2 that cleans things up, eventually?
RDMA is asynchronous, with a completion mechanism that notifies the sender or receiver when transfers finish (and optionally will deliver an interrupt). On Linux this isn’t the most natural model because Linux user-level interrupts are modelled after signals, and Linux can lose a signal if two occur at the same time. The upshot is that many developers end up pinning a thread that loops forever, watching for completions. But this can be quite expensive (cores aren’t cheap). Of course your pinned core could sleep now and then, but Linux sleep has a minimum delay of 1ms and once a thread sleeps, it may take far more than 1ms to wake up. Switching between polling and interrupts is complex. So there is an issue of load for RDMA applications. In Derecho we use a hybrid scheme and as the user, you don't see this complexity, but had we not adopted it, those polling loads would be a real problem.
It isn’t totally clear whether you should use reliable bound qpairs (an RDMA mode a bit like TCP, but where the receiver should ideally post a receive buffer before the sender posts its send, or at least, do so concurrently with the send), or use unreliable datagrams (the recent FaSST paper claims that these are more reliable than one would expect, but FaSST was evaluated in a setting very different from a large multitenant datacenter with a complex TOR load). Worst of all is the RDMA multicast mode. I can confidently warn you not to bother with this one. (But don’t despair! See remarks about Derecho, later).
Your friends may tell you that full-bisection-bandwidth is coming soon at the TOR. Your friends are wrong: the TOR layer seems fairly far from offering full-bisection routing except in specialized settings, like HPC clusters. But in our experiments with an overloaded TOR in Derecho, we actually found that our protocols share bandwidth very fairly. So I think this is less of an issue than it is sometimes portrayed as being. Maybe the more interesting puzzle is that cloud vendors want to hide the network topology from you, to prevent certain kinds of cheating: people who know the network layout and the locations of their cloud tasks can game the scheduler or even mount security attacks. Yet optimal use of RDMA definitely involves knowing topology, and vendors like Mellanox and Intel have good tools for exposing topology. To me the answer is this: a trusted layer like Derecho perhaps should shape itself to understand topology, and should be given access to the secret topology maps. Meanwhile, less trusted applications using Derecho would not gain access to this information. But we will have to see how it all plays out.

A Few Thoughts on Distributed Computing

Tuesday, 6 December 2016

RDMA[4]: Is my [fill-in-your-OS-here] system RDMA-ready?

No comments:

Post a Comment