RDMA on Infiniband versus RDMA on
Fast Ethernet (RoCE)
A
few high-level takeaways:
- RDMA hasn’t worked well on Ethernet up to now; it really has been an Infiniband technology.
- Datacenter computing systems are deeply invested in optical Ethernet and this isn’t likely to change. Nor would their owners consent to double-wiring (e.g. to IB side by side with Ethernet).
- RoCE is the RDMA standard that runs on Ethernet. RoCE v1 was non-routed, so you could only use it in a rack at a time, and it worked perfectly well at that scale. Indistinguishable from IB.
- But RoCE v2, which supports routing and uses priority pause frames (PPF) for flow control, is another matter. There have been many reports of instability, PPF storms, difficulty tuning the routers and NICs. Ad-hoc ways of generating pause frames are often unstable. If you talk to the RDMA vendors, they insist that these issues should not occur, but my students have run into such problems on a few platforms, and when I talk to cloud computing vendors at conferences, they tell me they've had major issues with RoCE v2 PPF as well. So I'm skeptical that this is really a viable option.
- There is a major breakthrough in this space: Data Center Quantum Congestion Notification (DCQCN). It controls the generation of pause frames in a way that manages rates on a per-stream basis and seems to work quite well. RoCE v2 + DCQCN looks like a winning combination. Interestingly, DCQCN actually is very similar to RDMA on IB, which is one reason to believe that with DCQCN, we have a story that can perform and scale just as well as RDMA on IB. On the other hand, multi-tenant datacenter operators need some things that HPC systems simply don't need, like virtualization both at the node level (with virtual memory, maybe paging, security, perhaps even strong security from Intel SGX or a similar model), and even at the network level (Enterprise VLAN models). Then there may be issues of traffic shaping to support priorities (quality of service). So even if RoCE v2 + DCQCN is rock solid, it isn't quite game over ... not yet.
For those of us who remember the ancient
past, Ethernet has been the technology that never gives up. Every time the technology press predicted
that Ethernet was about to be killed off by some superior wave of networking
technology, Ethernet would battle back.
For example, twenty years ago, token ring networking was widely expected
to displace Ethernet. Didn’t
happen. Infiniband was supposed to kill
Ethernet too. Not even close. The most recent player has been SDN.
In fact, Ethernet expanded enormously over
the period we’ve been discussing, responding to a variety of datacenter
communication needs that Infiniband more or less ignores.
Ethernet has been the winner (again and
again) for the rich world of datacenter computing, for lots of reasons. One is simply that the installed base already
uses Ethernet and upgrading in-place is cheap and easy, whereas switching to
something genuinely different can be very disruptive (in the pejorative
sense).
A second point is that Ethernet works well
in a world dominated by sharing: end-host multitenancy
and communication within virtual local
area networks (VLANs), creating the illusion of a private network for each
customer account. Cloud operators are finding that they can manage these VLANs in ways that provide fairness for customers, and are getting better and better at quality of service and priority support, so that gold customers really get gold service. There is a growing
push to secure VLANs with cryptographic keys, and the Ethernet vendors are out
in front offering these kinds of solutions; RDMA vendors haven’t been quick to
respond, perhaps because HPC systems aren’t shared and hence just don’t need
such features.
There was a brief period when the style of
routing used in switched and routed Ethernet seemed like a performance-limiting
obstacle, and the press confidently predicted that routers would simply not be
able to keep up. Had this happened,
Ethernet would have died. But guess what?
A technique called fast IP prefix lookup suddenly emerged; it permitted
Ethernet routers to offer blazing speed fully competitive with circuit-switched
routers, like the ones used by IB.
And this in turn has enabled Ethernet to
exactly track the optical link speeds used by IB: not so surprising because
everyone uses the cutting edge optical-silicon chip sets.
The big puzzle has always been flow
control. IB switches and routers use a credit scheme: link by link as a packet is routed through the network, each upstream device gives the downstream one permission to send a certain number of bytes. Thus with reliable fabrics, no packet is ever lost (unless something crashes): in effect, when device A actually sends packet X to device B, space is reserved in B for the bytes. IB is FIFO, so if there is no space in B, A can't forward. Thus if client C is trying to send to device A, but A is waiting for credits to send some prior packet Y to B, the backlog pushes right to C, who waits.
Now, one big difference with Ethernet is that in a classic Ethernet, routers and switches never wait. Instead, if someone lacks space, they drop packets. Since RDMA is required to be reliable, this is a problem for RDMA: a genuine packet loss causes the RDMA connection to break -- no retransmissions ever occur. Instead, RDMA on RoCE allows an element that is approaching its limits to send a pause request back to the sender: a PPF notification. The sender pauses, and this behavior repeats until the upstream device has more capacity.
Theoretically, the PPF packet is acting much like a credit scheme. In practice, the sending of PPF notifications seems to be a tricky problem in RoCE networks, and as noted earlier, people who really work with RoCE v2 are broadly skeptical that the vendors can actually get this to work. DCQCN, mentioned in a prior posting, disables the normal PPF mechanisms and replaces them with a kind of end-to-end flow credit solution that in many ways mimics the IB approach. This works, but it not yet widely deployed.
So why not dump Ethernet and use IB? The issue is that it isn't just about RDMA. IB lacks all these other good features of Ethernet, and hence faces huge barriers to uptake in datacenters. In some ways it comes down to a kind of inertia: Ethernet offers competitive speed, is more or less universally deployed, is far cheaper (the market for Ethernet routers is perhaps 100x larger than that for IB), and is supported by a deep set of tools that let cloud operators shape their offerings in ways that multitenancy really demands. IB lacks those tools, so why would anyone want to move to IB?
In side by side experiments that pit HPC on IB against HPC on RoCE v2 with the PPF algorithm working properly, a further issue arises: HPC systems actually turn out to run about 10% faster on IB than on fast Ethernet (more on this in a moment), so this does keep the HPC world fixated on IB. Yet the HPC world is a very small one compared to the general datacenter world. So even in cloud offerings of HPC, we tend to see a rejection of RoCE for the HPC workloads, and a rejection of IB for the non-HPC workloads.
What stands out in this story is that the main barrier to adoption, and it’s really a major one, is that IB has not figured out how to deal with network virtualization or multitenancy: by and large, IB runs with non-virtualized endpoint computers: they host either one or several HPC applications, each with its own memory allocation and dedicated cores, each running on “bare metal”, by which we simply mean, without any use of virtual memory. This is because HPC systems care about speed, and virtualization introduces delay for address translation, for paging, and can also have scheduling delay because of contention for the server cores. Thus a typical IB deployment doesn’t worry much about virtualization, sharing, scheduling delays, and in fact the schedulers used to launch jobs on HPC machines even have tweaks built into them to optimize layout in ways that match perfectly with the way that MPI protocols use IB. So you could probably make a case that IB and MPI have co-evolved, a bit like the way that datacenter TCP and Ethernet have co-evolved. Each specializes in a different application. Neither breaks into the other’s space, at least easily, because without a huge effort the last 10% or so of performance is so hard to snare.
Now, one big difference with Ethernet is that in a classic Ethernet, routers and switches never wait. Instead, if someone lacks space, they drop packets. Since RDMA is required to be reliable, this is a problem for RDMA: a genuine packet loss causes the RDMA connection to break -- no retransmissions ever occur. Instead, RDMA on RoCE allows an element that is approaching its limits to send a pause request back to the sender: a PPF notification. The sender pauses, and this behavior repeats until the upstream device has more capacity.
Theoretically, the PPF packet is acting much like a credit scheme. In practice, the sending of PPF notifications seems to be a tricky problem in RoCE networks, and as noted earlier, people who really work with RoCE v2 are broadly skeptical that the vendors can actually get this to work. DCQCN, mentioned in a prior posting, disables the normal PPF mechanisms and replaces them with a kind of end-to-end flow credit solution that in many ways mimics the IB approach. This works, but it not yet widely deployed.
So why not dump Ethernet and use IB? The issue is that it isn't just about RDMA. IB lacks all these other good features of Ethernet, and hence faces huge barriers to uptake in datacenters. In some ways it comes down to a kind of inertia: Ethernet offers competitive speed, is more or less universally deployed, is far cheaper (the market for Ethernet routers is perhaps 100x larger than that for IB), and is supported by a deep set of tools that let cloud operators shape their offerings in ways that multitenancy really demands. IB lacks those tools, so why would anyone want to move to IB?
In side by side experiments that pit HPC on IB against HPC on RoCE v2 with the PPF algorithm working properly, a further issue arises: HPC systems actually turn out to run about 10% faster on IB than on fast Ethernet (more on this in a moment), so this does keep the HPC world fixated on IB. Yet the HPC world is a very small one compared to the general datacenter world. So even in cloud offerings of HPC, we tend to see a rejection of RoCE for the HPC workloads, and a rejection of IB for the non-HPC workloads.
What stands out in this story is that the main barrier to adoption, and it’s really a major one, is that IB has not figured out how to deal with network virtualization or multitenancy: by and large, IB runs with non-virtualized endpoint computers: they host either one or several HPC applications, each with its own memory allocation and dedicated cores, each running on “bare metal”, by which we simply mean, without any use of virtual memory. This is because HPC systems care about speed, and virtualization introduces delay for address translation, for paging, and can also have scheduling delay because of contention for the server cores. Thus a typical IB deployment doesn’t worry much about virtualization, sharing, scheduling delays, and in fact the schedulers used to launch jobs on HPC machines even have tweaks built into them to optimize layout in ways that match perfectly with the way that MPI protocols use IB. So you could probably make a case that IB and MPI have co-evolved, a bit like the way that datacenter TCP and Ethernet have co-evolved. Each specializes in a different application. Neither breaks into the other’s space, at least easily, because without a huge effort the last 10% or so of performance is so hard to snare.
I mentioned TCP as a partner for Ethernet
in part to get back on topic: RDMA is about moving bytes, and TCP is about
moving bytes, and the end-user tends to depend on TCP today, for almost
everything. There has been an amazing
amount of work on optimizing TCP to run closer to the raw network speed. This used to be a lot easier, back in the
days when the raw network was a lot slower than the CPU clock and costs like
copying bytes from user space to kernel (to form IP packets) weren’t a big
concern. These days, copying is out of
the question: a single core on a modern server can only copy at about 13.5Gb/s,
whereas the network might be running at 100Gb/s (with 200Gb/s already on the
drawing board and projections of 1TB/s by the end of the decade). But to some degree the kernel TCP stack has
adapted to this constraint. For example,
rather than copying data into the kernel, the kernel can build the packet
header but use scatter gather to reach directly into user space for the packet
body, and similarly on the receiver side.
With such tricks, datacenter TCP stacks can achieve respectable speeds.
On the other hand, the very fastest TCP
solutions can be hard to use: the application may need to enable various
special features to tell the TCP protocol about its data movement plans, and it
may need to place data into memory in a particularly efficient layout and
perhaps even send fixed large blocks (which seems natural, but keep in mind
that by definition TCP is a byte stream protocol, not a block protocol, so in
fact a block transition approach is not nearly as simple to set up as you might
expect). It becomes important to make
sure that the receivers are waiting for incoming data before it is sent (again,
this sounds obvious, but in fact standard TCP uses kernel buffering and has no
such requirement at all: data is copied to the kernel, sent to the kernel of
the receiving machine, then copied up to user space. So ensuring that the receiver is already
waiting, is receiving into memory blocks that are actually mapped into DRAM,
and that the receiver itself hasn’t been swapped out – none of this is
completely straightforward). Some of the
very fastest schemes also require using multiple TCP sessions side by side.
Thus most applications can easily use TCP
to send data at perhaps 2Gbp/s on a modern datacenter network, but to get
anywhere near 100Gbp/s is quite a different matter. This requires more-or-less heroic effort by
the application designer and increasingly complex TCP stacks in the kernel, all
working in concert, and even then, TCP rarely gets anywhere close to
100Gbp/s. In fact, there are all sorts
of products aimed at helping the designer do this.
RoCE: We are not impressed (yet)
So with all this background, we’ve seen
that RDMA is basically an IB technology and that Ethernet is basically dominant
on the datacenter side. But now we get
to the good part: a series of developments, some dating back ten years or so,
but others much more recent. Broadly,
I’m pointing to something called RDMA over Converged Ethernet, or RoCE
(pronounced “rocky”). The convergence
term isn’t important here, so just think of this as modern fast Ethernet, same
technology you are already using in your clusters and datacenters.
The basic idea of RoCE is to shoehorn RDMA
onto Ethernet so that tools like MPI could potentially run side by side with
the incredibly diverse mix of things we see in the cloud. RoCE offers the same RDMA verbs API, but
routes the data over standard fast Ethernet.
As I've already mentioned, there has been a lot of skepticism about RoCE. The first generation of RoCE (v1) worked only
in a single rack of nodes at a time, and couldn’t route through the top of rack
(TOR) network. Then RoCE v2 came out, but
didn’t work very well: v2 routes through the TOR layer, but depends on a
priority pause frames, intended to let the router push back and slow down a
data flow that might overwhelm it. The
first RoCE v2 routers and NICs were pretty unstable and prone to sending
excessive rates of pause frames, which can slow RoCE down to a crawl.
Even when
perfectly configured (a black art!), IB seems to be about 10% faster than
RoCE v2 if you put them side by side and expose them to the same load. With contending traffic in a datacenter that
has an oversubscribed TOR layer, MPI over RoCE can be drastically slower than MPI on IB in an HPC supercomputer,
even if the servers use the same processor speeds and have similar amounts of
DRAM. This has been an obstacle to cloud
HPC computing, and a boon to the massive scientific supercomputer centers:
nobody will migrate applications like weather predictions to the cloud unless
they will run at a performant level.
But the game has changed.
The first driver
of change centers on new demand: there
has been a wave of exciting operating systems paper showing that RDMA could be
extremely useful in settings remote from HPC.
In particular, the FaRM and Herd papers, from Microsoft Research and
CMU, shook up the field by showing that clever use of RDMA could support a
massive distributed shared memory model, and that software transactional memory
ideas can be applied to make these DHTs strongly consistent as well as
incredibly fast and scalable.
Small-scale replication (usually 3 replicas) provides fault-tolerance. Then there were a series of papers showing
that the operating system itself could be split into a more of a data-plane /
control-plane model, facilitating programming with RDMA technologies. Finally, very pragmatically, many studies
have noted that datacenters are limited by the huge amount of copying they do
over their networks: they replicate programs and virtual machine images, photos
and videos, you name it. All this
copying is slow, and if RDMA could speed it up, the datacenter would be used
more effectively. Everyone wants a piece
of that amazing speed.
So this has created a wave of demand for RDMA, except not on IB, and not in a form where one user could easily disrupt some other user. The demand is for a new style of RDMA, on Ethernet, and with fair traffic apportionment. For any developer who pays for his or her cloud resources, waiting for data to shuffle around at perhaps 250Mb/s is not really fun. And with today’s big move towards machine learning and AI applications that use a MapReduce computing style, where data is shuffled at the end of each iteration, the time spent shuffling and reducing is a critical bottleneck.
So this has created a wave of demand for RDMA, except not on IB, and not in a form where one user could easily disrupt some other user. The demand is for a new style of RDMA, on Ethernet, and with fair traffic apportionment. For any developer who pays for his or her cloud resources, waiting for data to shuffle around at perhaps 250Mb/s is not really fun. And with today’s big move towards machine learning and AI applications that use a MapReduce computing style, where data is shuffled at the end of each iteration, the time spent shuffling and reducing is a critical bottleneck.
Another driver of
change has been a recognition that RoCE v2 may not be the only option for
managing contention between flows in TOR routing. Datacenter QCN (standing for Quantized
Congestion Notification) runs in a quantized scheduling loop and periodically
computes a flow rate for each flow, employing a form of end-to-end credit scheme that seems to work
extremely well. People who are deeply involved tell me that at the end of the day, DCQCN makes RDMA on RoCE v2 behave almost exactly like RDMA on IB, with scalability, multi-tenancy and similar performance. Mellanox is about to
roll out DCQCN on its cutting-edge Connext-X4 100Gbps RoCE networking routers
and NICs.
I'm focused on DCQCN because of its potential for commercial adoption, but I should point out that another algorithm called TIMELY yields a second new data point in the same space, with similar success. to DCQCN. So there are two distinct results now both of which suggest that the flow control issue is solvable by making RDMA behave more like TCP. In fact the key difference is simply that with TCP, packet loss is used to signal overload. PPF tried to mimic this with "pause sending" messages, but it is very hard to get an in-transit send to pause before loss occurs. But with DCQCN and TIMELY we move to something much more like an end-to-end credit-based sending approach, which is very much the behavior of IB with these chained credit-to-send patterns in which an upstream resource exhaustion cascades to eventually block the sender -- before any bytes can be lost. Thus at the end of the day, it shouldn't astonish us that these methods can work on Ethernet -- on RoCE.
We are left with all the other Ethernet questions: traffic shaping for QoS, virtualization and paging, enterprise VLAN security isolation, etc. But there is no obvious reason -- perhaps no reason at all -- that RDMA can't live in a world with such requirements, even if HPC systems are simpler. Talk to the folks developing RDMA technologies at Mellanox and Intel and they will tell you exactly how they are handling every single one of them. NICs are more and more powerful and one can offload more and more smart stuff into them.
RDMA on RoCE can work. It will work, and soon! And it won't ignore even a single one of the critical requirements that arise in enterprise cloud settings.
We are left with all the other Ethernet questions: traffic shaping for QoS, virtualization and paging, enterprise VLAN security isolation, etc. But there is no obvious reason -- perhaps no reason at all -- that RDMA can't live in a world with such requirements, even if HPC systems are simpler. Talk to the folks developing RDMA technologies at Mellanox and Intel and they will tell you exactly how they are handling every single one of them. NICs are more and more powerful and one can offload more and more smart stuff into them.
RDMA on RoCE can work. It will work, and soon! And it won't ignore even a single one of the critical requirements that arise in enterprise cloud settings.
Honestly, I'm surprised by the vehemence of those who believe that RoCE flow control can never work, can never scale, and hence that RDMA will always be a pure IB technology. This is one of those situations where the writing is suddenly on the wall... People refusing to see that are just willfully closing their eyes. Perhaps the issue is that if you are way out on the limb with your colleagues, having made those kinds of claims for years now, it is a bit awkward to accept that well, maybe RDMA on RoCE has some hope after all? And that, actually, um, technically, um, well, yes, maybe it does work after all? But whatever the reason, this phenomenon is very noticeable.
Interestingly, with flow control schemes like DCQCN we
actually have some possibility that RDMA on RoCE could even outperform RDMA on IB – after all, we
noted that RDMA on IB runs into the problem senders just send, and credits just get exhausted.
DCQCN creates a new opportunity for genuine traffic sharing. Quite possibly, these algorithms could ultimately let a RoCE system run closer to its margins, shape loads in ways that match desired profiles specified by the cloud owner, insulate users from disrupting one-another, and could even be secured in interesting ways. Ultimately, such steps could easily make RoCE both cheaper and faster than IB.
Thus there is
suddenly a possibility that RDMA could really be used widely in standard
datacenter networks, without needing to deploy a secondary IB network side by
side with the fast Ethernet (or to abandon Ethernet entirely in favor of
IB). Given the immense infrastructure
investment in fast Ethernet, the appeal of this sort of incremental deployment
path is nearly irresistible.
[1] My students sometimes find this confusing, so here’s an
example. Suppose that flow A contracts
for Infiniband bandwidth of 100MB/s, and flow B contracts to send 50MB/s. But what are these numbers, precisely?
Both Ethernet and IB actually send data in optical
frames that are generally fairly small: roughly, 40-1500 bytes of payload, plus
various headers in the case of 100Gbps Ethernet, and up to 4096 bytes in the
case of Infiniband. Thus a rate of
100MB/s is really a contract to send perhaps 10,000 completely full frames per
second in the case of Ethernet, or 5,000 per second for Infiniband, on these
particular network configurations.
Now
think about the router: will it see precisely
these rates, or will these be average
rates? In fact for many reasons, they
will be average.
For example, to send
data the network interface card (NIC) must fetch it from the host memory, and
that operation has unpredictable speed because the DMA transfer engine in the
NIC has to do address translation and might even need to fetch the mapping of
source data pages from the OS page table.
There could also be delays, particularly if the NIC finishes all the
work currently on its send queue, or if the receiver is a little late in
posting a receive request. Thus
bandwidth reservation, on IB, is really an approximation: the router sets aside
resources that should be adequate, but then has to deal with rate variability
that can be fairly significant at runtime.
If the flow from B is temporarily delayed, by allowing A to send a burst
that is temporarily a bit above 100MB/s, the router might buy some slack time
to handle a burst from B a bit later.
Obviously, the bandwidth reservation is still helpful: the router at least has
some estimate of how much data will transit on each flow, on average, but at
any given instant, there could be fairly significant variations in data
rate!
In contrast, Ethernet routers lack
reservation schemes, but often keep statistics: they measure the data rates and
know the speeds of the flows that are currently most active. In effect, an
Ethernet router might dynamically learn pretty much the same thing we get from
IB reservations. Anyhow, how would we be
sure, in advance, that A plans to send 100MB/s, and that B plans to send
50MB/s? In many situations, at best, the
original reservations are just approximations.
Thus both technologies ultimately depend on a mix of permission to send
messages (sent explicitly in some way, but perhaps at a low level), flow
control back-off requests (explicit or implicit), etc. This is why Verbs has so many parameters: to
run at peak speeds on IB, the developer often literally needs to fine-tune the
behavior of the NICs and the routers!
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.