A Few Thoughts on Distributed Computing: RDMA [topic header]: One Ring to Rule Them All

Lately I’m fascinated by Remote Direct Memory Access (RDMA): a networking technology that has been around for a while but didn’t work well in datacenters (up to now RDMA has mostly been used in HPC clusters, on Infiniband networks). But RDMA is suddenly working much better on datacenter network hardware (fast Ethernet, or RoCE), and will have a disruptive impact.

Over my career, I’ve seen a number of really disruptive changes, where computing shifted quickly in some way that really shook things up. You’ve surely seen this too: often it starts with an unnoticed pent-up demand. Then a technology shift suddenly occurs that enables a response to the demand, customers flock to the hot new things, and we see a kind of non-linear response. The trigger could be new hardware, a new generation of software, or even occur just because a tipping point is reached and everyone suddenly realizes it is time to adopt a new model.

How would we recognize disruption when it happens? Let's review some past examples:

Early computers had small address spaces and we struggled with memory limitations. Then there was a revolution when computers supporting dramatically larger memories and paging suddenly became available. But not only did this eliminate a technical challenge, it also enabled a huge surge forward in terms of the applications computers could support: disruptive change.
The move to GUI-oriented computing with windows, high resolution displays, color, mouse input. While seemingly incremental, taken together this changed the way we think about computers and ultimately gave us tablets and immersive VR (and who knows what will come next). So again, we saw a disruptive change.
I'll break out the emergence of hardware graphics accelerators: game changing for HCI, but because GPU computing is so fast for any kind of matrix task, game changing for ML too.
The move from autonomous computing to the client-server model. This was interesting because it happened twice. There was an early jump, but client-server infrastructure back in the 1980’s wasn’t ready for big time users, and the first adopters really suffered from missing technical components. Ten years later, that first wave of client-server platforms gave way to a second, much better set of options, and at that point, everything shifted. Microsoft was the first to get the model right, and that was the first of its really huge growth "events".
The switch from slow 1600-baud dialup connections over telephone networks to hard-wired networking via Ethernet, then the wide-area Internet, which gave us the Web. Then the web itself became integrated deeply with client-server computing, and everything moved to web standards: disruptive change that transformed the whole industry. Bill Gates almost missed this one, but a memo written by one of his SVPs during a visit to Cornell turns out to have galvanized him and he pivoted the whole company: Microsoft's second big growth event.
The replacement of specialized parallel supercomputing architectures by massive clusters of inexpensive commodity computers with fast interconnects and specialized software (MPI).
The iPhone and then the Android: genuinely successful mobile devices with integrated telephony.
The shift to cloud computing model. This really brought two revolutions: an end-user shift (the sudden dependency on Google, Facebook, Amazon, etc) as well as a computing service side shift (elasticity, virtualization, multitenant resource sharing, outsourced computing and data services). I could write pages and pages on this event. It had all sorts of sub-revolutions: virtualization, big data, global scaling... What an amazing thing to watch! On the downside, it became quite a lot harder to do academic research on computer systems: we used to be able to own a complete "system" right in the computer lab. These days, people like me need friends in high places who are willing to give us access. Without that access, we can't validate our work.
The astonishing successes of big data computing and machine learning, and more recently, unsupervised deep learning and SMT solvers.

I'm sure you could add lots of items to my list, but that would get old. Instead, let's stick with this list and try to see why each of these revolutions had a disruptive, unexpected aspect to it. One aspect that leaps out is that even where the hardware wasn't the actual catalyst, hardware invariably seems to play a huge role. Of the items listed above, only some were directly triggered by hardware changes, but by the end of the day, every one of them forced immense hardware changes.

A second observation is this: if you compare the state of the world before the revolution to the state of the world after it, all had this feeling of throwing out all sorts of assumptions or limitations, and offering a new story that just works incredibly well compared to the stories it pushed to the side. And of course, each new development also brings its own new assumptions, limitations and issues. You can easily be blindsighted when this happens: if you happen to keep working on the old stuff, the new generation of companies and thought leaders won't find your work very exciting. Over the years this has happened to me at least three times.

The opposite is also true. When disruptive change occurs, the new leaders are often a bit surprised to find themselves in that role, and very often are young, technically adept people who may have very little history from the prior generation of technology. They often are simply unaware of the prior knowledge base and may ignore past insights, rediscover ("reinvent") them, or invent completely new and amazing ideas that we missed in the past. But don't assume that the new crowd will be eager for dialog. Generally speaking, they made a kind of choice to work on fun new stuff and not to work on dusty old boring stuff, and if you start to explain that ten years ago, you discovered this amazing thing about consistency... you'll find that you've lost your audience. So adapting to a new world often involves learning to think and talk in new ways.

Further, what you thought you know, what you believe you have figured out and deeply understand, may actually not be the answer in the new world. Learning to think the way the new crowd is thinking may force you to accept the inadequacy of your own past insights when transported to this new environment. It takes humility to survive technology revolutions that directly impact your area of research and innovation, because the very first thing that happens is that they make you dated and irrelevant. And you actually have to accept that this is so before you can hope to catch up and contribute again.

In fact, one hallmark of disruptive change is that the stuff we thought we knew, for sure, is called into question. Mark Twain (Samuel Clement) once wrote that “it ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.” This is twice as true today: when disruption happens, “what you knew just ain’t so no-more,” (I’m guessing at how he would have phrased it). Revolutions force us to open our minds to new opportunities. The agenda changes: even if some of your top agenda items aren’t magically solved by the new thing, don’t assume that they will still be at the top of your agenda a few years into the new world. Yes, some will live on, but others might just vanish, or be transformed into a completely different set of challenges.

We’re about to see something along those lines, centered on the sudden adoption of RDMA in datacenter environments. But the things you know about RDMA from looking at it in the past just won't be true for this new datacenter RDMA. The new datacenter RDMA almost deserves its own acronym (DC-RDMA would be the obvious one). So don't tell me about yesterday's failures, because those stories are about yesterday’s RDMA. This new RDMA is a new beast.

RDMA is already pervasive in HPC settings, but that’s a different world, perhaps more different than many people appreciate. HPC has its own hardware (in particular, Infiniband networks, which are dominant on HPC clusters), and its own software platforms (MPI, Matlab, Linpack, BLAS, etc). HPC use of RDMA is mostly hidden within these packages: you write code in MPI, or use something written using MPI, and the MPI system itself worries about the RDMA.

The new RDMA I’m intrigued by will be a different technology: aa game changer that will be as disruptive in its way as the other seismic events I listed above. But the story is complicated, so I’ve decided to break it down into smaller chunks. This post is sort of an introduction: a "landing page". But I'm also posting a few more threads focused on sub-topics.

Just in case you don't want to read the other postings, here's...

Ken's RDMA cheat sheet:

A new kind of datacenter RDMA is coming. Right now datacenters and other non-HPC cluster computing systems waste a ton of time in inefficient protocols that do things like copying data from user space to kernel buffers and back, fragmenting and retransmitting and acking data, and when a job is launched on multiple nodes (like with MapReduce/Hadoop), the system will often end up sending the same bytes multiple times, perhaps even on the same network links. RDMA (with the proper software layered above it) can eliminate these overheads and slash costs. It offers insane speedups, and we’ll be able to replicate data almost for free.

You won’t use RDMA directly. Just like on HPC platforms, you’ll use it via a software library. My bet: Cornell’s open-source Derecho library will be a vital enabler for many projects. My secret agenda is for Derecho to become a universal solution for RDMA communication on general purpose datacenters, playing a role analogous to the one MPI plays on HPC systems. Derecho is free but we are well aware that even free stuff has costs. By the time you've worked through a few of my blog entries perhaps you'll be thinking that those costs are worth it (costs in terms of learning something new, getting it installed and running, etc)

It might also be good to summarize a few technical points that matter:

RDMA runs on asynchronous message queues (structured as lock-free rings). The NIC directly monitors these queues, carries out direct DMA transfers from sender to receiver, and also supports some fancier functionality like one-sided read and write, unreliable datagrams, unreliable multicast. Some offer remote test-and-set.

RDMA is starting to work well over fast Ethernet (RoCE, pronounced like "Rocky"; if I could embed music, it would need to be the theme song from the Stallone movie -- Gonna Fly Now). For routed topologies a technology called DCQCN is the key (it has a cousin named TIMELY that is equally interesting, but seems not to have much commercial traction as of now). DCQCN stands for data center quantum congestion notification, and is a new way of doing flow-oriented congestion control. Honestly, it works a lot like TCP flow control, although people will scream and tear their hair out because I said that. The basic idea is to layer a new congestion scheme over RoCE v2, which already defines a feature called a priority pause frame. DCQCN disables the standard PPF messages, and then replaces them with PPF messages generated in a new way that works far better. A game changer that enables true RDMA scalability on RoCE.

While RDMA on RoCE with DCQCN is a breakthrough, these are still early days. DCQCN is a huge step forward, but datacenters will need ways to ensure fair bandwidth apportionment, deal with enterprise VLAN and security, etc. I think all such considerations should be solvable over time, and so I am betting on RDMA now when there is a good chance for real impact.

One reason RDMA has worked on HPC systems using MPI but not on general purpose systems is that the RDMA hardware isn’t good at virtualization. This may nor may not be solvable (there has been recent work showing that the NIC can actually handle page faults and virtualization, but on the other hand, page faults and other such events clearly pose an issue if you want super-high data rates). However this story plays out, though, we clearly will need to evolve the operating system. Either we can integrate paging with RDMA, as in this recent work, or the memory region RDMA uses can be pinned and managed by some sort of user-level rdma_malloc, as a separate pool (actually, a set of pools, because you ideally want each core using RDMA pages local to that core). So we do need some evolution in how memory is managed on Linux and maybe Azure, but technically this is a feasible thing to do. Once it happens, the RDMA NIC won’t need to know that the end-host uses virtualization to support multi-tenancy. So then we can have RDMA even in heavily virtualized environments. Moreover, this same change would benefit all sorts of DMA transfer systems.

But there are also some technical questions that don’t yet have such quick and easy answers:

FaRM, is the most successful RDMA-based distributed shared memory system (technically, a transactional DHT, but you can think of keys as memory addresses and the values as the data sitting at those addresses). It gets amazing speed from FaRM at small scale, but scaling up was much harder than the developers expected: at first, when they deployed on a much larger setup they exhausted all sorts of resources in the Mellanox Connect 3X NICs they were using. To work around these they ultimately disabled virtual memory, configured the operating system to use fat (1GB) pages, and modified FaRM itself to be very careful how many connections it had open. On the other hand, the Connect 4X NICs seem to have way larger memory pools for those resources. Moral of this story? The first products to jump on the bandwagon may have challenges scaling up, but on the other hand, with time and insight into hardware limitations, should be able to do it.
FaRM isn’t the only famous transactional DHT: another widely cited project is called Herd, and is basically the CMU competitor to FaRM. Whereas FaRM uses connected qpairs, Herd is unusual in that it works with RDMA datagrams, and claims big speedups. But Herd’s claims are kind of specific to its pattern of data transfers (keys and values are very small in Herd), and they haven’t experimented with RDMA UD at genuinely large scale. So I’m a bit skeptical, for now, about the merits of RDMA UD. My own work has taken a different approach: we view bound qpairs as a costly resource (at least, active bound qpairs: sometimes we make leave connections idle). We think we scale quite well this way. On the other hand, my work isn’t aimed at building a massive transactional DHT, so the setting differs too.
The FaRM story suggests that there may be a future problem in multitenancy situations, it seems at least plausible that a tenant using more than its fair share of NIC resources could harm performance for everyone. However, we’ll need to experiment to understand how big an issue this really is, using real hardware. In fact there are many limited NIC resources. We’ll just have to learn to work within those limits.

Because the issues seen in multitenant settings differ from the ones seen on HPC, my belief is that RDMA on RoCE will ultimately branch off from the IB version of RDMA and evolve into a completely different technology fairly quickly. This is why I suggested above that you think of the disruptive thing as DC-RDMA: a new name for a new thing. A further thing to keep in mind that today, RDMA on RoCE v2 isn’t using DCQCN and RoCE v2 without DCQCN difficult to parameterize correctly (a nice way to word things; some people I talk to insist that the correct wording would be “unstable and at best, still sort of slow compared to IB”). People may already be forming negative impressions about RoCE for this reason. And my message is: yes, they may be absolutely correct. But this will change soon!

A Few Thoughts on Distributed Computing

Tuesday, 6 December 2016

RDMA [topic header]: One Ring to Rule Them All

No comments:

Post a Comment