Tuesday, 6 December 2016

RDMA[3]: The stubborn persistence of Ethernet


RDMA on Infiniband versus RDMA on Fast Ethernet (RoCE)

A few high-level takeaways:
  • RDMA hasn’t worked well on Ethernet up to now; it really has been an Infiniband technology.
  • Datacenter computing systems are deeply invested in optical Ethernet and this isn’t likely to change.  Nor would their owners consent to double-wiring (e.g. to IB side by side with Ethernet).
  • RoCE is the RDMA standard that runs on Ethernet.  RoCE v1 was non-routed, so you could only use it in a rack at a time, and it worked perfectly well at that scale.  Indistinguishable from IB.
  • But RoCE v2, which supports routing and uses priority pause frames (PPF) for flow control, is another matter.  There have been many reports of instability, PPF storms, difficulty tuning the routers and NICs.  Ad-hoc ways of generating pause frames are often unstable.  If you talk to the RDMA vendors, they insist that these issues should not occur, but my students have run into such problems on a few platforms, and when I talk to cloud computing vendors at conferences, they tell me they've had major issues with RoCE v2 PPF as well.  So I'm skeptical that this is really a viable option.
  • There is a major breakthrough in this space: Data Center Quantum Congestion Notification (DCQCN).  It controls the generation of pause frames in a way that manages rates on a per-stream basis and seems to work quite well.  RoCE v2 + DCQCN looks like a winning combination.   Interestingly, DCQCN actually is very similar to RDMA on IB, which is one reason to believe that with DCQCN, we have a story that can perform and scale just as well as RDMA on IB.  On the other hand, multi-tenant datacenter operators need some things that HPC systems simply don't need, like virtualization both at the node level (with virtual memory, maybe paging, security, perhaps even strong security from Intel SGX or a similar model), and even at the network level (Enterprise VLAN models).  Then there may be issues of traffic shaping to support priorities (quality of service).  So even if RoCE v2 + DCQCN is rock solid, it isn't quite game over ... not yet. 
For those of us who remember the ancient past, Ethernet has been the technology that never gives up.  Every time the technology press predicted that Ethernet was about to be killed off by some superior wave of networking technology, Ethernet would battle back.  For example, twenty years ago, token ring networking was widely expected to displace Ethernet.  Didn’t happen.  Infiniband was supposed to kill Ethernet too.  Not even close.  The most recent player has been SDN.
In fact, Ethernet expanded enormously over the period we’ve been discussing, responding to a variety of datacenter communication needs that Infiniband more or less ignores.
Ethernet has been the winner (again and again) for the rich world of datacenter computing, for lots of reasons.  One is simply that the installed base already uses Ethernet and upgrading in-place is cheap and easy, whereas switching to something genuinely different can be very disruptive (in the pejorative sense).
A second point is that Ethernet works well in a world dominated by sharing: end-host multitenancy and communication within virtual local area networks (VLANs), creating the illusion of a private network for each customer account.  Cloud operators are finding that they can manage these VLANs in ways that provide fairness for customers, and are getting better and better at quality of service and priority support, so that gold customers really get gold service.  There is a growing push to secure VLANs with cryptographic keys, and the Ethernet vendors are out in front offering these kinds of solutions; RDMA vendors haven’t been quick to respond, perhaps because HPC systems aren’t shared and hence just don’t need such features.
There was a brief period when the style of routing used in switched and routed Ethernet seemed like a performance-limiting obstacle, and the press confidently predicted that routers would simply not be able to keep up.  Had this happened, Ethernet would have died. But guess what?  A technique called fast IP prefix lookup suddenly emerged; it permitted Ethernet routers to offer blazing speed fully competitive with circuit-switched routers, like the ones used by IB. 
And this in turn has enabled Ethernet to exactly track the optical link speeds used by IB: not so surprising because everyone uses the cutting edge optical-silicon chip sets. 
The big puzzle has always been flow control.    IB switches and routers use a credit scheme: link by link as a packet is routed through the network, each upstream device gives the downstream one permission to send a certain number of bytes.  Thus with reliable fabrics, no packet is ever lost (unless something crashes): in effect, when device A actually sends packet X to device B, space is reserved in B for the bytes.  IB is FIFO, so if there is no space in B, A can't forward.  Thus if client C is trying to send to device A, but A is waiting for credits to send some prior packet Y to B, the backlog pushes right to C, who waits.

Now, one big difference with Ethernet is that in a classic Ethernet, routers and switches never wait.  Instead, if someone lacks space, they drop packets.  Since RDMA is required to be reliable, this is a problem for RDMA: a genuine packet loss causes the RDMA connection to break -- no retransmissions ever occur.  Instead, RDMA on RoCE allows an element that is approaching its limits to send a pause request back to the sender: a PPF notification. The sender pauses, and this behavior repeats until the upstream device has more capacity. 

Theoretically, the PPF packet is acting much like a credit scheme.  In practice, the sending of PPF notifications seems to be a tricky problem in RoCE networks, and as noted earlier, people who really work with RoCE v2 are broadly skeptical that the vendors can actually get this to work.  DCQCN, mentioned in a prior posting, disables the normal PPF mechanisms and replaces them with a kind of end-to-end flow credit solution that in many ways mimics the IB approach.  This works, but it not yet widely deployed.

So why not dump Ethernet and use IB?  The issue is that it isn't just about RDMA. IB lacks all these other good features of Ethernet, and hence faces huge barriers to uptake in datacenters.  In some ways it comes down to a kind of inertia: Ethernet offers competitive speed, is more or less universally deployed, is far cheaper (the market for Ethernet routers is perhaps 100x larger than that for IB), and is supported by a deep set of tools that let cloud operators shape their offerings in ways that multitenancy really demands.  IB lacks those tools, so why would anyone want to move to IB?  

In side by side experiments that pit HPC on IB against HPC on RoCE v2 with the PPF algorithm working properly, a further issue arises: HPC systems actually turn out to run about 10% faster on IB than on fast Ethernet (more on this in a moment), so this does keep the HPC world fixated on IB. Yet the HPC world is a very small one compared to the general datacenter world.  So even in cloud offerings of HPC, we tend to see a rejection of RoCE for the HPC workloads, and a rejection of IB for the non-HPC workloads.

What stands out in this story is that the main barrier to adoption, and it’s really a major one, is that IB has not figured out how to deal with network virtualization or multitenancy: by and large, IB runs with non-virtualized endpoint computers: they host either one or several HPC applications, each with its own memory allocation and dedicated cores, each running on “bare metal”, by which we simply mean, without any use of virtual memory.   This is because HPC systems care about speed, and virtualization introduces delay for address translation, for paging, and can also have scheduling delay because of contention for the server cores.  Thus a typical IB deployment doesn’t worry much about virtualization, sharing, scheduling delays, and in fact the schedulers used to launch jobs on HPC machines even have tweaks built into them to optimize layout in ways that match perfectly with the way that MPI protocols use IB.  So you could probably make a case that IB and MPI have co-evolved, a bit like the way that datacenter TCP and Ethernet have co-evolved.  Each specializes in a different application.  Neither breaks into the other’s space, at least easily, because without a huge effort the last 10% or so of performance is so hard to snare. 
I mentioned TCP as a partner for Ethernet in part to get back on topic: RDMA is about moving bytes, and TCP is about moving bytes, and the end-user tends to depend on TCP today, for almost everything.  There has been an amazing amount of work on optimizing TCP to run closer to the raw network speed.  This used to be a lot easier, back in the days when the raw network was a lot slower than the CPU clock and costs like copying bytes from user space to kernel (to form IP packets) weren’t a big concern.  These days, copying is out of the question: a single core on a modern server can only copy at about 13.5Gb/s, whereas the network might be running at 100Gb/s (with 200Gb/s already on the drawing board and projections of 1TB/s by the end of the decade).  But to some degree the kernel TCP stack has adapted to this constraint.  For example, rather than copying data into the kernel, the kernel can build the packet header but use scatter gather to reach directly into user space for the packet body, and similarly on the receiver side.  With such tricks, datacenter TCP stacks can achieve respectable speeds.
On the other hand, the very fastest TCP solutions can be hard to use: the application may need to enable various special features to tell the TCP protocol about its data movement plans, and it may need to place data into memory in a particularly efficient layout and perhaps even send fixed large blocks (which seems natural, but keep in mind that by definition TCP is a byte stream protocol, not a block protocol, so in fact a block transition approach is not nearly as simple to set up as you might expect).  It becomes important to make sure that the receivers are waiting for incoming data before it is sent (again, this sounds obvious, but in fact standard TCP uses kernel buffering and has no such requirement at all: data is copied to the kernel, sent to the kernel of the receiving machine, then copied up to user space.  So ensuring that the receiver is already waiting, is receiving into memory blocks that are actually mapped into DRAM, and that the receiver itself hasn’t been swapped out – none of this is completely straightforward).  Some of the very fastest schemes also require using multiple TCP sessions side by side.
Thus most applications can easily use TCP to send data at perhaps 2Gbp/s on a modern datacenter network, but to get anywhere near 100Gbp/s is quite a different matter.  This requires more-or-less heroic effort by the application designer and increasingly complex TCP stacks in the kernel, all working in concert, and even then, TCP rarely gets anywhere close to 100Gbp/s.  In fact, there are all sorts of products aimed at helping the designer do this.
RoCE: We are not impressed (yet)
So with all this background, we’ve seen that RDMA is basically an IB technology and that Ethernet is basically dominant on the datacenter side.  But now we get to the good part: a series of developments, some dating back ten years or so, but others much more recent.  Broadly, I’m pointing to something called RDMA over Converged Ethernet, or RoCE (pronounced “rocky”).   The convergence term isn’t important here, so just think of this as modern fast Ethernet, same technology you are already using in your clusters and datacenters. 
The basic idea of RoCE is to shoehorn RDMA onto Ethernet so that tools like MPI could potentially run side by side with the incredibly diverse mix of things we see in the cloud.  RoCE offers the same RDMA verbs API, but routes the data over standard fast Ethernet. 
As I've already mentioned, there has been a lot of skepticism about RoCE.  The first generation of RoCE (v1) worked only in a single rack of nodes at a time, and couldn’t route through the top of rack (TOR) network.  Then RoCE v2 came out, but didn’t work very well: v2 routes through the TOR layer, but depends on a priority pause frames, intended to let the router push back and slow down a data flow that might overwhelm it.  The first RoCE v2 routers and NICs were pretty unstable and prone to sending excessive rates of pause frames, which can slow RoCE down to a crawl.
Even when perfectly configured (a black art!), IB seems to be about 10% faster than RoCE v2 if you put them side by side and expose them to the same load.  With contending traffic in a datacenter that has an oversubscribed TOR layer, MPI over RoCE can be drastically slower than MPI on IB in an HPC supercomputer, even if the servers use the same processor speeds and have similar amounts of DRAM.  This has been an obstacle to cloud HPC computing, and a boon to the massive scientific supercomputer centers: nobody will migrate applications like weather predictions to the cloud unless they will run at a performant level.
But the game has changed.
The first driver of change centers on new demand:  there has been a wave of exciting operating systems paper showing that RDMA could be extremely useful in settings remote from HPC.  In particular, the FaRM and Herd papers, from Microsoft Research and CMU, shook up the field by showing that clever use of RDMA could support a massive distributed shared memory model, and that software transactional memory ideas can be applied to make these DHTs strongly consistent as well as incredibly fast and scalable.  Small-scale replication (usually 3 replicas) provides fault-tolerance.  Then there were a series of papers showing that the operating system itself could be split into a more of a data-plane / control-plane model, facilitating programming with RDMA technologies.  Finally, very pragmatically, many studies have noted that datacenters are limited by the huge amount of copying they do over their networks: they replicate programs and virtual machine images, photos and videos, you name it.  All this copying is slow, and if RDMA could speed it up, the datacenter would be used more effectively.  Everyone wants a piece of that amazing speed. 

So this has created a wave of demand for RDMA, except not on IB, and not in a form where one user could easily disrupt some other user.  The demand is for a new style of RDMA, on Ethernet, and with fair traffic apportionment. For any developer who pays for his or her cloud resources, waiting for data to shuffle around at perhaps 250Mb/s is not really fun.  And with today’s big move towards machine learning and AI applications that use a MapReduce computing style, where data is shuffled at the end of each iteration, the time spent shuffling and reducing is a critical bottleneck. 
Another driver of change has been a recognition that RoCE v2 may not be the only option for managing contention between flows in TOR routing.  Datacenter QCN (standing for Quantized Congestion Notification) runs in a quantized scheduling loop and periodically computes a flow rate for each flow, employing a form of end-to-end credit scheme that seems to work extremely well.  People who are deeply involved tell me that at the end of the day, DCQCN makes RDMA on RoCE v2 behave almost exactly like RDMA on IB, with scalability, multi-tenancy and similar performance.  Mellanox is about to roll out DCQCN on its cutting-edge Connext-X4 100Gbps RoCE networking routers and NICs.
I'm focused on DCQCN because of its potential for commercial adoption, but I should point out that another algorithm called TIMELY yields a second new data point in the same space, with similar success. to DCQCN.  So there are two distinct results now both of which suggest that the flow control issue is solvable by making RDMA behave more like TCP.  In fact the key difference is simply that with TCP, packet loss is used to signal overload.  PPF tried to mimic this with "pause sending" messages, but it is very hard to get an in-transit send to pause before loss occurs.  But with DCQCN and TIMELY we move to something much more like an end-to-end credit-based sending approach, which is very much the behavior of IB with these chained credit-to-send patterns in which an upstream resource exhaustion cascades to eventually block the sender -- before any bytes can be lost.  Thus at the end of the day, it shouldn't astonish us that these methods can work on Ethernet -- on RoCE. 

We are left with all the other Ethernet questions: traffic shaping for QoS, virtualization and paging, enterprise VLAN security isolation, etc.  But there is no obvious reason -- perhaps no reason at all -- that RDMA can't live in a world with such requirements, even if HPC systems are simpler.  Talk to the folks developing RDMA technologies at Mellanox and Intel and they will tell you exactly how they are handling every single one of them.  NICs are more and more powerful and one can offload more and more smart stuff into them.

RDMA on RoCE can work.  It will work, and soon!  And it won't ignore even a single one of the critical requirements that arise in enterprise cloud settings.
Honestly, I'm surprised by the vehemence of those who believe that RoCE flow control can never work, can never scale, and hence that RDMA will always be a pure IB technology.  This is one of those situations where the writing is suddenly on the wall...  People refusing to see that are just willfully closing their eyes.  Perhaps the issue is that if you are way out on the limb with your colleagues, having made those kinds of claims for years now, it is a bit awkward to accept that well, maybe RDMA on RoCE has some hope after all?  And that, actually, um, technically, um, well, yes, maybe it does work after all?  But whatever the reason, this phenomenon is very noticeable. 
Interestingly, with flow control schemes like DCQCN we actually have some possibility that RDMA on RoCE could even outperform RDMA on IB – after all, we noted that RDMA on IB runs into the problem senders just send, and credits just get exhausted.  DCQCN creates a new opportunity for genuine traffic sharing.  Quite possibly, these algorithms could ultimately let a RoCE system run closer to its margins, shape loads in ways that match desired profiles specified by the cloud owner, insulate users from disrupting one-another, and could even be secured in interesting ways.  Ultimately, such steps could easily make RoCE both cheaper and faster than IB.
Thus there is suddenly a possibility that RDMA could really be used widely in standard datacenter networks, without needing to deploy a secondary IB network side by side with the fast Ethernet (or to abandon Ethernet entirely in favor of IB).  Given the immense infrastructure investment in fast Ethernet, the appeal of this sort of incremental deployment path is nearly irresistible.



[1] My students sometimes find this confusing, so here’s an example.  Suppose that flow A contracts for Infiniband bandwidth of 100MB/s, and flow B contracts to send 50MB/s.  But what are these numbers, precisely? 
Both Ethernet and IB actually send data in optical frames that are generally fairly small: roughly, 40-1500 bytes of payload, plus various headers in the case of 100Gbps Ethernet, and up to 4096 bytes in the case of Infiniband.  Thus a rate of 100MB/s is really a contract to send perhaps 10,000 completely full frames per second in the case of Ethernet, or 5,000 per second for Infiniband, on these particular network configurations. 
Now think about the router: will it see precisely these rates, or will these be average rates?  In fact for many reasons, they will be average. 
For example, to send data the network interface card (NIC) must fetch it from the host memory, and that operation has unpredictable speed because the DMA transfer engine in the NIC has to do address translation and might even need to fetch the mapping of source data pages from the OS page table. 
There could also be delays, particularly if the NIC finishes all the work currently on its send queue, or if the receiver is a little late in posting a receive request.  Thus bandwidth reservation, on IB, is really an approximation: the router sets aside resources that should be adequate, but then has to deal with rate variability that can be fairly significant at runtime.  If the flow from B is temporarily delayed, by allowing A to send a burst that is temporarily a bit above 100MB/s, the router might buy some slack time to handle a burst from B a bit later. 
Obviously, the bandwidth reservation is still helpful: the router at least has some estimate of how much data will transit on each flow, on average, but at any given instant, there could be fairly significant variations in data rate! 
In contrast, Ethernet routers lack reservation schemes, but often keep statistics: they measure the data rates and know the speeds of the flows that are currently most active. In effect, an Ethernet router might dynamically learn pretty much the same thing we get from IB reservations.  Anyhow, how would we be sure, in advance, that A plans to send 100MB/s, and that B plans to send 50MB/s?  In many situations, at best, the original reservations are just approximations. 
Thus both technologies ultimately depend on a mix of permission to send messages (sent explicitly in some way, but perhaps at a low level), flow control back-off requests (explicit or implicit), etc.   This is why Verbs has so many parameters: to run at peak speeds on IB, the developer often literally needs to fine-tune the behavior of the NICs and the routers!

No comments:

Post a Comment