So this is starting to work now and I thought I might share a quick summary of the experience! A big thanks to Weijia Song and Sagar Jha, who worked with Cornell ECE undergraduate Alex Katz to make this happen. It was, as it turned out, remarkably hard.
For those lacking context, let me start with a summary of why this topic is of interest, but then that group of people may want to tune out because I'll go back to being highly technical. As you may know, RDMA is a technology for permitting one computer in a data-center or cluster (close-by, densely packed machines) to do direct memory-to-memory data transfers. RDMA comes in various modes: you can do messaging over RDMA (one machine sends, the other receives), and in this case it will behave much like TCP, waiting to send until the receiver is ready, then transmitting losslessly. The big win is that the transfer will run at DMA speeds, which are extremely high: 100Gbps or more. In contrast, if you code a little TCP-based transfer on the identical computers, same network card, you tend to see performance 1/5th what RDMA can achieve or less, because of all the copying (from user space to kernel, then the kernel to kernel transfer over the network in IP packets, then copying up). Copying takes time, and this ends up being 4/5ths of the total path. RDMA eliminates all of that, at least for large transfer.
For small transfers, people work with "one-sided" RDMA in which the sender and receiver prenegotiate a connection but also share a memory region: B grants A permission to read or write within some chunk of B's local memory. Then A basically ends up thinking of B's memory as mapped into A's own computer, although the data transfer isn't as transparent as just reaching into a mapped memory -- it has to occur via special read and write commands. Still, the model would be one of shared memory, reliably updated or read remotely.
RDMA originated on InfiniBand (IB) networks, popular in today's supercomputer systems. Most supercomputer communication runs over a library called the Message Passing Interface, MPI, via an implementation like MVAPICH, and versions like that are highly tuned to leverage RDMA on IB. Our work on the Cornell Derecho system uses this same technology.
Now, continuing the back story, RDMA in this form dates to work done long ago at Cornell by Thorsten Von Eicken (went on to lead GoToMyPC.com) and Werner Vogels (now CTO at Amazon). RDMA took off in industry, and some think it killed off the older style of parallel supercomputing, replacing it with less costly clusters of commodity machines connected with RDMA networks on IB. Anyhow, right from the start the leaders pushed for a TCP-like model. In fact you actually start by creating a TCP connection, a normal one, then put an RDMA connection in place "next to it", using TCP for passing control messages until the RDMA layer is up and running. RDMA itself involves a firmware program that runs in the NIC (so these NICs are like little co-processors, and in fact they often have a full Linux O/S on them, and are coded in C or C++). The interaction between end-host and NIC is through shared "queue" data structures, which are always grouped: there is a queue used for send/receive/read/write operations (called "verbs" for reasons lost in the mists of time), and then a separate completion queue where the NIC reports on operations that have finished.
Verbs, the RDMA API, define a huge number of specialization options. Data can be "inline" (inside the operation) or external (via a pointer, and a length), or one can specify a vector of locations (scatter-gather). Operations can be selected for completion reporting. There is a way to multiplex and demultiplex more than one logical connection over a single RDMA connection, using "keys". There are "tags" that can carry an additional few bytes of information such as object length data (useful in cases where a big transfer is broken down into little sub-transfers).
Then there are ways to use RDMA verbs for unreliable datagram-like communication. Unlike the normal case, where the sender will be certain that messages got through, in order, reliably, with unreliable datagrams some form of end-to-end acknowledgement and retransmission would be needed in the application. And this mode also includes an unreliable multicast feature.
RDMA is a troublesome technology in normal optical ethernet clusters and data centers, and many customers just run it on a second side-by-side IB network to avoid the hassle. But there is an RDMA on ethernet standard called RoCE, and slowly the industry is learning to deploy it without destabilizing normal TCP/IP traffic that the same network would presumably carry. RoCE hasn't been trivial to use in this way: one reads about storms of "priority pause frames" (PPFs), and there are at least two flow control concepts (DCQCN and TIMELY), both in active use, but neither fully mature. Microsoft has just started to run RoCE+DCQCN on its big Azure clusters, although the company is limiting use of RDMA to just its own internal infrastructure tools. Google uses it too. In contrast, any end-user application on an HPC system depends on RDMA via MPI (Azure HPC permits this too). So there are ways to access RDMA, but not directly, on most platforms. Direct use is limited to people who build new infrastructure tools, like we did with our Derecho work.
Against this backdrop, there is a question of who will win in the RoCE market. Intel seemed likely to become a big player when they purchased QLogic, and Mellanox, the dominate company on the HPC side, is the obvious incumbent, with maybe 80% of market share.
But RDMA has technical limits that worry some companies. One concern is the Verbs API itself, which is sort of weird, since it requires these helper TCP connections (you won't find this documented in any clear way, but I challenge you to use RDMA Verbs without having an n * n array of TCP connections handy... CMU seems to have done this in their HeRD/FaSST system, but I'm not aware of any other project that pulled it off). A second is that every RDMA transfer actually occurs as a parallel transfer of some number of "lanes" of data. Each lane is a pair of single-bit connections, one for data travelling in each direction, and at the high-end of the range, lanes typically run at speeds of 14Gbps. Thus when you read that the Connect-X4 from Mellanox runs at 100Gbp/s, you can assume that they employ 7 or 8 lanes to do this. In fact the rates are quoted as one-way rates so you should double them in the case of Derecho, which is aggressive about using connections in a full bidirectional way. By scaling the number of lanes, the industry is now starting to deploy 200Gbps solutions with 400 within sight and 1Tbps expected by 2025 or so.
Now, at these rates it isn't efficient to send a single bit at a time, so instead, RDMA devices normally send frames of data, generally 84 bytes (this is the frame size for 100GE). Thus if you have 8 lanes, a single RDMA transfer might actually carry 512 bytes, with 1024 or more likely by the time the technology reaches the higher rates just mentioned.
The issue is that not every application wants to move objects in blocks of such large size. After all, with direct RDMA writes, a typical operation might access just a few bytes or a word: perhaps 32 or 64 bits. This will be incredibly inefficient over RDMA.
In Derecho we work around that by designing the system around a style that we think of as "steady flow". We want a stream of updates to transit any given path, never pausing even briefly, because we want ways to aggregate traffic. You can read about how we do this in our paper (it was under review by TOCS, but now we are revising it for resubmission, so the version I pointed to is going to change).
Intel has led a group of vendors arguing that this entire direction is ultimately unworkable: we will get to 1Tbps RDMA this way (and Derecho will let you access that speed), but most applications will find that they get a tiny fraction of the full performance because of patterns of transfers that send entire RDMA frames just to read or write a byte or two at a time -- in such cases the majority of the RDMA transfer is just wasted, because the NICs are exchanging perhaps 8 or 16 bytes of useful information and yet every RDMA transfer is of full size: 512 or 1024 bytes.
One answer is for everyone to use Derecho. But LibFabrics, which is Intel's story, aims at a different concept: they want more flexibility about how the hardware could multiplex large numbers of concurrent remote I/O operations so that these huge frames can carry much more useful traffic.
At the same time, LibFabrics breaks with the RDMA Verbs model and tries to more explicitly mimic the normal Linux sockets API. One can love or hate sockets, but at least they are a widely adopted standard that all of us know how to work with. Lots of software runs through TCP or UDP over sockets, hence all of that software can (with some effort) also run on LibFabrics. Yet RDMA is still available as a LibFabrics mapping: there is a way to run directly on Mellanox and theoretically, it maps to the identical RDMA Verbs we would have used in the first place.
So with this in mind, we bit the bullet and ported Derecho to LibFabrics. Now our main software distribution actually will have both options, and we'll use some form of configuration manager (we're thinking about Getopt or its sibling, the slightly fancier Getpot). Then you'll be able to tell us which NIC to select, which mode to run in, etc.
Derecho it looking pretty good on LibFabrics right now. We're seeing identical performance for RDMA versus LibFabrics (on the identical NIC and network, obviously), provided that the transfer size is large. For small operations, LibFabrics seems slower. This is clearly at odds with a direct mapping to RDMA verbs (otherwise we would have seen identical performance for this case too), so we'll need to understand the root cause more deeply. We've also tested Derecho over LibFabrics mapped to TCP, still on the same network but using the kernel stack, and it seems to be about 4x slower than with RDMA, again in the large objects case. Derecho over TCP with small objects is very slow, so there is definitely a serious puzzle there.
One very cool option will be to look at Derecho on WAN links over TCP. RDMA can't run on a WAN network, but we definitely can run LibFabrics this way.
Summary: work in progress, but work revealing very nice progress, even if some mysteries remain to be resolved. Watch this blog for updates! I'm sure we'll have lots of them in months to come.
(Added May 30): Here's data from an experiment run by Weijia Song, Alex Katz and Sagar Jha for a 3-member Derecho shard or group, first with 1 sender and all three members receiving and then with all sending and all receiving, using our atomic multicast configuration of ordered_send (compare with Zookeeper's ZAB protocol, without checkpointing).
What you see here is that (1) LibFabrics didn't introduce overhead when we map down to RDMA. To see this compare Derecho ROCE and IB against the Derecho Baseline (which used Verbs directly). So, two left data sets versus the one on the right. Axis is in gigabytes-per-second and the color coding is for the message size: 100MB, 1MB and 10KB.
Next look at the middle three cases. These were ones in which we used the same NICs but told LibFabrics not to use RDMA and instead to run on the native TCP for those NICs (TCP works perfectly well on these NICs, but doesn't leverage RDMA). So of course performance drops, by roughly the factor of 4x mentioned above, but what is impressive is that if you were to go to our Derecho paper and compare this with Zookeeper or LibPaxos, we are still close to 100x faster than those libraries! That would be a true apples-to-apples comparison, since ZAB and LibPaxos run on TCP or UDP, not RDMA, so when we compared Derecho against them previously, it was a bit unfair to them -- they had no way to leverage RDMA.
Why is Derecho so much faster than classic Paxos? Clearly the world already had reached a data rate in which streaming data and doing receiver-side batching is far preferable to the classic 2-phase commit for each action. Nobody had noticed this, but Derecho happens to live in this new configuration because our development methodology led us to that formulation of the protocols. And here we can see that it pays off even on a TCP infrastructure!
Sorry for all the explanation points, but after something like 35 years of working on this kind of thing, I'm thrilled to see these numbers and to realize what a great opportunity this represents.
Fantastic...
ReplyDeleteWe're pretty excited about this. It was quite a pain to pull it off. All the APIs look sort of similar, and none really maps one-to-one to the Verbs APIs, so each time it seemed like there was a simple next step it would blow up as soon as the guys tried it. But at this point, we are up and running, except for the puzzle of very small transfers seeming abnormally slow. Could be something we need to understand about LibFabrics lurking in that case (I find it hard to imagine that Intel wouldn't have optimized the heck out of their small message case)
DeleteAha! Weijia and Sagar tracked down the slow performance issue for small objects. Looks like we will be 4x slower across the board (which is pretty amazing because it means we will actually still be 100x faster than comparable TCP/UDP-based libraries like Zookeeper or LibPaxos, in a pure apples-to-apples comparison).
ReplyDeleteI've updated the blog posting with actual data (two graphs) and some discussion, at the bottom.
ReplyDelete