A Few Thoughts on Distributed Computing: RDMA[2]: Cutting through the noise, what does RDMA actually do?

At first glance, RDMA looks like a hardware implementation of the IP protocol suite:

We are all familiar with the TCP sockets model. In fact, you'll use TCP sockets to bootstrap your RDMA application, unless you program with MPI or Derecho, which do that stuff for you. This said, there are definitely some pure RDMA connection management tools, including very clean libraries that make no use of TCP at all. The issue is just that they aren't very integrated into the software libraries we are all familiar with, so we tend not to use them just because we are in the "habit" of doing things the usual way. Anyhow, startup is a minor step.
RDMA builds hardware managed data-passing connections. To set them up, both endpoints need to recite a little magic incantation. A common approach is to just have a TCP connection between the programs, and then to use that to exchange the data needed to create the RDMA connection. You can then close the TCP connection, although many applications just leave them open: sometimes it is helpful to have them handy, for example if you might close and later reopen an RDMA connection.
We call the RDMA paths “connected qpairs” because the endpoints are accessed via a pair of queues (rings) mapped into the end-user address space and manipulated by lock-free put/remove operations. On each endpoint, one of the queues is for asynchronous sends/receives, the other is for completions. There is a way to use RDMA endpoints without connections too (called unreliable RDMA datagrams, or RDMA UD), but I’m not convinced that the RDMA UD model would work well at genuinely large scale, at least for the kinds of use cases we see in Derecho (some people are arguing that UD does have a natural fit with key-value storage at large scale). We have very strong evidence from the HPC community that the connected form of RDMA can run at massive scale, at least on IB networks.
The model is to start an I/O operation and then do other stuff, and then later check for completion of your request (even if you don’t care about the outcome, the request buffer itself will need to be recycled, and you may also need to free up the memory used in the send operation for reuse). To monitor for completion you there are two main options: you can either “pin” a thread to poll for completions, looping non-stop to check for new ones, or use interrupts. The issue is that looping will notice completions (more or less) instantly, yet use up an entire core, which is a huge cost to pay on a server that could be putting that core to work doing other kinds of work. But interrupts have a high cost: probably 10,000 to 25,000 CPU cycles, which would be about 4us. Now, 4us may not sound like much, but RDMA transfers occur so quickly that you could easily end up taking continuous interrupts, which would really slow your computer down and might turn out to be even more expensive that using up a core. (There is an RDMA feature where a sequence of operations is queued up… they all individually get moved to the completion queue, but you only get an interrupt once for the whole batch, rather than one by one; this can reduce interrupt costs). Fancier software systems switch back and forth between modes: they poll while doing stuff for which ultra-low latency is of vital importance, but switch to interrupts when nothing urgent is happening.
One more option: After setting up a connected qpair, the endpoints can define memory windows and share these with their remote peer; if you do this, the remote peer can directly read or write memory within the window. These operations occur in a lock-free manner with no local CPU involvement. Thus in this model, no completions occur at all on the remote side, although the initiating side would still see a completion event.
RDMA transfers over connected qpairs are reliable: the hardware handles flow control, and data is delivered in the order sent, without loss or corruption. Should a problem occur, the completion will report it. But don’t use your intuition from IP here: in the Internet, packets are often dropped for various reasons, whereas RDMA is incredibly reliable, similar to the backplane of your computer. At the layer where RDMA is running, errors occur only if the hardware is starting to break, or if an endpoint crashes and resets its connections. Indeed, the NIC-to-NIC protocol uses positive acks for flow control, but lacks nacks or retransmissions: why bother to check, given that they just aren’t needed?
As mentioned earlier, there are also unreliable RDMA operations: datagrams (like UDP) and even an unreliable RDMA multicast. There is much debate about how frequently loss occurs with RDMA UD mode, but it certainly can occur and would be undetected, so if you decide to use this mode, you also need to add your own end-to-end acknowledgements, negative acknowledgements and retransmission logic (plus flow control, etc). I commented on the welcome page that these kinds of issues have long been a problem over Internet UDP and UDP multicast, which is why I myself am not a big believer in UD. But read about the FaSST system from CMU to get a deeper picture on this question (keep in mind that FaSST is a DHT and has mostly small RDMA transfers: key-value pairs, but with small keys and small values; the conclusions the FaSST team reaches wouldn’t necessarily apply in other situations, and in fact they don’t even seem to apply to other RDMA DHTs like FaRM, which run at a much larger scale than FaSST).
There is also a pure-software implementation of RDMA, called SoftRoCE. I'll discuss this more later, but as of this writing (February 2017) it was still under development and a little buggy. In fact my understanding is that the main team adding it to Linux had a working version but felt that it wasn't designed in the ideal way, and ultimately did a full rewrite, which was being debugged in a beta form as I wrote this blog entry. Apparently it was mostly stable, but not entirely so, but the big worry in early 2017 is that it was strangely slow. Once stable and running at the proper speeds, this may be a far faster data path than normal datacenter TCP, for a number of reasons. Primarily, it eliminates all copying, whereas TCP in a datacenter normally has to copy all the data into the kernel, and does the actual network IO in the kernel. So TCP has a lot of copying, and actually a lot of packet fragmentation and reassembly too. In theory, SoftRoCE can eliminate all of that. I've her conflicting stories about whether SoftRoCE will work well in WAN settings. We may just have to wait and see, but in fact I don't see any obvious reasons it can't be done. So sooner or later, I would guess that we'll have this capability, which is very interesting from a software scalability perspective. (The core technical issue is that even with very high speed WAN links, running at 40Gbps or 100Gbps, there can be UDP packet loss, and since SoftRoCE is like RoCE and designed not to do packet retransmissions, those loses cause connections to break. But there are coding tricks that can let a remote endpoint recover a lost UDP packet, such as the idea used in the Smoke and Mirrors File System, which is a bit like a RAID system but running on a WAN. So I'm thinking that the SMFS idea, plus SoftRoCE as implemented for a datacenter platform, on a "sufficiently" reliable WAN link running at 100Gbps... it just might work).

As you can see from this summary, the RDMA hardware API is surprisingly broad. My research group has tended to use it mostly in the connected qpair mode, where we work with block send/receive operations in which both endpoints post requests and the receiver takes interrupts, or in the one-sided write model, where the receive side opens a memory window and the writer just writes directly within it, without any locking.

This raises a point: RDMA is lock-free, although the hardware does offer certain kinds of synchronized I/O. The model is a bit like lock-free programming in a language like C++ with concurrent threads, and it takes some getting used to.

An RDMA NIC talks directly to the host memory unit, and it does this in chunks of memory that fit within cache-lines. This results in guaranteed cache line atomicity, meaning that data will be read or written atomically provided that it fits entirely within whatever the size of the cache line happens to be – for example, 64 bytes on an Intel or ARM platform. Obviously, most variables we work with in languages like C++ do fit in a cache line, provided that the compiler aligns them properly, which is normally the case. Earlier, I noted that some vendors also support a remote compare and swap operation, and an atomic increment operation. Those are cache-line atomic as well. On the other hand, one thing you can’t do is to increment or decrement a semaphore remotely, or notify to a condition variable (as of this writing, in 2017).

Thus if you have a thread on machine A and you want it to wait for an RDMA event that machine B will initiate, the best option is to either poll, or have B send an RDMA message (A would need to post a receive so that it can actually accept that message). Then the handler for the message could signal the semaphore. There is also a way for B to request that an interrupt occur on A, at which point the interrupt handler could signal the semaphore.

It is important to be aware that RDMA doesn’t do anything special to flush the CPU cache. If you are doing purely memory-oriented computing, you won’t need to worry about this: most modern architectures wire the cache to the MMU bus, and will either invalidate any modified cache entries or will refresh their contents. But the issue could arise if you get fancy and try to use RDMA into a memory region that will then be used to do a DMA transfer to SSD storage or 3D XPoint memory, or perhaps that is mapped into a display. The hardware for that other device might have its own cache, or some other stack that could have stale data within it. Thus when your RDMA incoming data arrives you could get a completion notification and be able to access that data programmatically, yet if you were to immediately display it onto a console (perhaps as part of a game, or for a kind of television application) there might sometimes be glitches: little chunks of stale memory that show up as discontinuities on the display, or that cause the SSD write to be incorrect. Right now, the best bet might be to do a reset to the device controller of that other device, but down the line, better barrier mechanisms are certainly going to become available.

Another way to notice inconsistency is to write a complex object that needs to be updated by modifying several fields or copying a longer chunk of memory. If operations span multiple cache lines, RDMA remains cache-line atomic for individual cache lines, but a reader could see some values from before the update and some from after, if those values live in different cache lines. However, there is still a guarantee that RDMA operations occur in the order they were posted, and also that any single RDMA transfer updates memory from low-order addresses to high-order ones. My group takes advantage of this by having a version number that sits above the data we are updating. So, first we update the data, and only then do we increment the version number. If a remote reader sees that the version number is larger, it can safely access the data, knowing that because the data is in lower memory addresses, it will see the whole object. With an additional acknowledgement field, the reader can signal when it is done. FaRM does something similar using a kind of timestamp, and is able to guarantee full transactional semantics with serializability.

I should perhaps explain that FaRM adds these timestamps on its own: it "annotates" your data structures with its added timestamps. Nothing is done magically here: a timestamp in this use is just an extra data field, the size of an integer, with various stuff packed into it by FaRMs update algorithm. So when you read that FaRM "extends" a cache line with timestamps, they mean that for each chunk of data the cache works with, they are taking over part of that space for this timestamp, and then your data lives in the space left over. Then FaRM automates the logic needed to use its timestamps in order to support its transactional model.

Finally, there are vendor-specific add-on features. The ones I like best are the Mellanox cross-channel synchronization options (part of a package they call Core Direct). This allows the developer to post a series of RDMA operations in which operation A will enable operation B. For example, A could receive a block of data, and then B could relay that same block to some other destination. Without cross-channel, it wouldn’t be safe to post B until after A completes, hence the very fastest A-B relay operation would be limited by the delay for the host software to see that A has completed and then to post B. With cross-channel, the NIC itself enables B when A completes, and no end-host software activity is required at all. Other vendors have proposed interesting alternatives to RDMA Verbs (none of which has really caught on), reliable RDMA multicast, and the list goes on.

In fact there are tons of new add-on features, at least from Mellanox. The company is adding all sorts of in-network tricks to support big data and machine learning applications, and we'll see more and more of the work done in the network, with less and less end-host software involvement. The trend is really exciting and my guess is that Mellanox is on a path to become another one of those giant technology companies: Google owned search, and Microsoft owned our desktops. And Mellanox seems to be positioned to own the network. But they have competition, so some other vendor could surge to take a lead: this is very competitive space.

Lets look at the specific case I described: the Mellanox cross-channel feature. If you can get access to this feature (a big if because Mellanox hasn’t made a huge effort to ensure that its user-level libraries, drivers and OS modules all will actually expose this feature, or to document its use), each node in an RDMA application can queue up a partially-ordered batch of requests. Requests on any single queue will be done in the order you post them, but you can also tell one request to wait until some other request on a different queue has completed (and this will mean that subsequent requests on its queue wait too, because of the FIFO ordering requirement).

Plus if you have a bunch of nodes in some application and they want to collaborate to do something fancy (like routing data down a tree), it should be easy to see how the partial ordering can extend across node boundaries: if your request A waits for me to send you the green chunk of data, and your request B will send that chunk to node X, then the node X transfer is really dependent on my send. So here we have X and Y and Z all coordinated.

Why is this so cool? Because it suggests that we can offload entire data movement protocols directly into the RDMA NIC (and there are some router functionalities emerging that are interesting too, like in-network data reduction). You can literally design complex protocols that push almost all the work into the NICs and then just wait for the whole thing to finish! No end-host work has to occur and no CPU involvement is needed, so potentially this will work like a charm even if the end nodes are very heavily loaded multi-tenant machines experiencing long scheduling delays.

So when I say that we will also see machine-learning displaced into the network, what I mean is that just as we can start to think of "compiling" protocols into the NIC, we can also think about compiling basic operations that arise at low levels in machine learning so that the NIC and switch will do much of the heavy lifting. The kinds of operations in question tend to be aggregation ones: binned sums (histograms), min and max, shuffle exchange.

But it will be a while before we see cloud computing systems using features like this. First, we need to see wide deployment of RDMA via RoCE (as noted, RDMA over IB is never going to be a big player in datacenter settings, but the bulk of today's machine learning systems live in cloud settings, not on HPC platforms). Next, vendors like Mellanox would need to actually promise to expose these features so that we can write protocols that can safely depend on them; right now, you could find an RDMA NIC from Mellanox that does support the features in hardware, yet your protocol would get exceptions if it tried to use them because the drivers for those very same devices might lack the needed functionality. Cloud vendors would be faced with a tough choice: be Mellanox-specific, or require their other RDMA suppliers to mimic Mellanox functionality, without overstepping the Mellanox patents. And this cuts in the other direction too: although Mellanox is currently the market leader, you could easily imagine that Intel might invent cool features of its own that could be equally amazing.

Right now, core-direct isn't actually working on the Connect 4X, as far as my students and I can tell (right now being early 2017). So this illustrates the issue: most likely, Mellanox just doesn’t have customers using the core-direct stuff yet, even though it was introduced years ago, and so doesn’t routinely test that their drivers and software libraries still support the needed functionality. And other vendors haven’t even offered this sort of feature, at all.

Just the same, there are a few vendors that offer IB products with at least the basics: Mellanox is the current market leader, but QLogic (an Intel division) is used in many HPC datacenters, and with its recent acquisition by Intel has become a company to keep one eye on. All the main vendors are thinking about ways to package the technology to be more appealing to the general non-HPC computing community. As we’ll see, those have failed, up to now. But they include ideas such as the Intel iWarp technology line (which was not successful), solutions that run TCP over RDMA, and in fact this is the main motivation for the various proposals to use layers other than Verbs as the RDMA API.

The main issue, then, is that RDMA has really been a purely IB technology. For all its speed, the most dominant use case has been HPC over MPI over IB. And datacenters just don’t use IB: they use Ethernet. But when RDMA over RoCE stabilizes, and it will very soon… watch out!

A Few Thoughts on Distributed Computing

Tuesday, 6 December 2016

RDMA[2]: Cutting through the noise, what does RDMA actually do?

1 comment: