Tuesday, 27 November 2018

So what's the story about "disaggregated" cloud computing?

The hardware community has suddenly gone wild over a new idea they call "disaggregated" cloud computing infrastructures.  I needed to talk about this in my graduate class, so I've been reading some of the papers and asking around.

If you come at the topic from distributed/cloud computing, as I do, you may have been using disaggregation to refer to a much older trend, namely the "decoupling" of control software from the hardware that performs various data and compute-intensive tasks that would otherwise burden a general purpose processor.  And this is definitely a legitimate use of the term, but as it turns out, the architecture folks are focused on a different scenario: for them, disaggregation relates to a new way of building your data center compute nodes in which you group elements with similar functionalities, using some form of micro-controller with minimal capabilities to manage components that handle storage, memory, FPGA devices, GPU, TPU, etc.  Recent work on quantum computing uses a disaggregated model too: the quantum device is like an experimental apparatus that a control program carefully configures into a specific state, then "runs", and then collects output from.

The argument is that disaggregation reduces wasted power by supporting a cleaner form of multitenancy.  For example, consider DRAM.  The “d” stands for dynamic, meaning that power is expended to keep every bit continuously refreshed.  This is a costly, heat-intensive activity, and yet your DRAM may not even be fully used (studies suggest that 40-50% would be pretty good).  If all the DRAM was in some shared rack, bin-packing could easily give you >95% utilization on the active modules, plus you could power down any unused modules.

So disaggregation as a term really can refer to pretty much any situation in which some resources are controlled from programs running "elsewhere", and where you end up with a distributed application that conceptually is a single thing, but has multiple distinct moving parts, with different subsystems specialized for different roles and often using specialized hardware as part of those roles.

At this point I'm seeing the idea as a spectrum, so perhaps it would make sense to start with the older style of control-plane/data-plane separation, say a few words about the value of such a model, and then we can look more closely at this extreme form and ask how it fits into the picture.

Control/data disaggregation is really pretty common idea and dates back quite far: even the earliest IBM mainframes had dedicated I/O coprocessors to operate disks and tapes, and talked about "I/O channels".   I could easily list five or ten other examples that have popped up during the past decade or two: very few devices lack some form of general purpose on-board computer, and we can view any such system as a form of disaggregated hardware. But the one that interests me most is the connection to Mach, at the stage when the project was publishing very actively at SOSP and was run from CMU, with various people heading it: Rick Rashid (who ultimately left to launch Microsoft Research), then Brian Bershad, and then Tom Anderson.

Mach was an amazing system, but what I find fascinating is that it happens to also have anticipated many aspects of this new push into disaggregation.  As you perhaps will recall (from all those Mach papers you've read...) the basic idea was to have a single micro-kernel that would host various O/S presentation frameworks: a few versions of Linux, perhaps versions of legacy IBM operating systems, etc.  Internally, one of those frameworks consisted of a set of processes sharing DLLs, some acting as servers and others as applications.  Memory was organized into segments with read/write/execute permissions, and all the underlying interactions centered on an ultra-fast form of RPC in which a single CPU core could run an application thread in segment A, then briefly suspend that thread and activate a thread context in segment B (perhaps A was an application and B is the file server, for example), run in B for a short period, then return to A.  The puzzle was to handle permissions (the file server can see the raw disk, but A shouldn't be allowed to, and conversely, the file server can DMA from a specific set of memory areas in A, but shouldn't be able to scan A's memory for passwords).  Then because all of this yielded a really huge address space, Mach ran into page table fragmentation issues that were ultimately solved with a novel software inverted page table.  And capabilities were used for the low-level message passing primitives.  Really cool stuff!

As you might expect, all of these elements had costs: the messaging layer, the segmentation model, context switching, etc.  Mach innovated by optimizing those steps, but keep these kinds of costs in mind, because disaggregation will turn out to run into similar issues.

Which brings us to today!  The modern version of the story was triggered by the emergence of RDMA as an option in general purpose data centers.  As you'll know if you've followed these postings, I'm an RDMA nut, but it isn't just me.  RDMA is one of those rare cases where a technology became very mature in a different setting (namely high-performance computing, where it serves as the main communications backbone for packages like the MPI library, and runs on InfiniBand networks), and yet somehow went mostly unnoticed by the general computing community.... until recently, when people suddenly "discovered" it and went bonkers.

As I've discussed previously, one reason that RDMA matured on Infiniband and yet wasn't used on Ethernet centered on RDMA's seeming need for a special style of credit-based I/O model, in which a sender NIC would only transfer data if the receiver had space waiting to receive the data.   It is only recently that the emergence of RoCE allowed RDMA to jump to more standard datacenter environments.  Today, we do have RDMA in a relatively stable form within a set of racks sharing a TOR switch, assuming that the TOR switch has full bisection bandwidth.  But RDMA isn't yet equally mature across routers, particularly in a COS/Spine model where the core network would often be oversubscribed and hence at risk of congestion.  There are several competing schemes for deploying RDMA over RoCE v2 with PPF at data-center-wide scale (DCQCN and TIMELY), but work is still underway to understand how those can be combined with Diffsrv or Enterprise VLAN routing) to avoid unwanted interactions between standard TCP/IP and the RDMA layer.  It may be a few years before we see "mature" datacenter-scale deployments in which RDMA runs side-by-side with TCP/IP in a non-disruptive way over RoCE v2 (in fact, by then we may be talking about RoCE v5)!  Until then, RDMA remains a bit delicate, and must be used with great care.

So why is RDMA even relevant to disaggregation?  There is a positive aspect to the answer, but it comes with a cautionary sidebar. The big win is that RDMA has a simple API visible directly to the end-user program (a model called "qpairs" that focuses on "I/O verbs".  These are lock-free, and with the most current hardware, permit reliable DMA data transfers from machine to machine at speeds of up to 200Gbps, and latencies as low as .75us.  The technology comes in several forms, all rather similar: Mellanox leads on the pure RDMA model (the company has been acquired and will soon be renamed Xylinx, by the way), while Intel is pushing a variant they call OMNI-Path, and other companies have other innovations.   These latencies are 10x to 100x better than what you see with TCP, and the data rates are easily 4x better than the fastest existing datacenter TCP solutions.  As RDMA pushes up the performance curve to 400Gbps and beyond, the gaps will get even larger.  So RDMA is the coming thing, even it hasn't quite become commonplace today.

RDMA is also reliable in its point-to-point configuration, enabling it to replace TCP in many applications, and also offers a way to support relatively low latency remote memory access with cache-line atomicity.  Now, it should probably be underscored that with latencies of .75us at the low end to numbers more like 3us for more common devices, RDMA offers the most NUMA of NUMA memory models, but even so, the technology is very fast compared to anything we had previously.

The cautions are the obvious ones.  First, RDMA on RoCE is really guaranteed to offer those numbers only if the endpoints live under a TOR switch that can support full bidirectional communication.  Beyond that, we need to traverse a router that might experience congestion, and might even drop packets.  In these cases, RDMA is on slightly weaker ground.

A second caution is simply that applications can't really handle long memory-access delays, so this NUMA thing becomes a conceptual barrier to casual styles of programming.   You absolutely can't view RDMA data-center memory as a gigantic form of local memory.  On the contrary, the developer needs to be acutely aware of potential delays, and hence will need to have full control over memory layout. Today, we lack really easily-used tools for that form of control, so massive RDMA applications are the domain of specialists who have a great deal of arcane hardware insight.

So how does this story relate to disaggregation?  A first comment is that RDMA is just one of several new hardware options that force us to take control logic out of the main data path.  There are a few other examples: with the next generation of NVRAM storage (like Intel's Optane), you can DMA directly from a video camera into a persistent storage card, and with RDMA, you'll be able to do so even if the video isn't on the same machine as the place you plan to store the data.  With FPGA accelerators, we often pursue bump-in-the-wire placements: data streams into a memory unit associated to the FPGA at wire rates, it does something, and then we might not even need to touch the main high-volume stream from general purpose code, at all.  So the RDMA story is pointing to a broader lack of abstractions covering a variety of data-path scenarios.  Moreover, our second caution expands to a broader statement: these different RDMA-like technologies may each bring different constraints and different best-case use scenarios.

Meanwhile, the disaggregation folks are taking the story even further: they start by arguing that we can reduce the power footprint of rack-scale computing systems if we build processors with minimal per-core memory and treat that memory more like a cache than as a full-scale memory.  Because memory runs quite hot but the heat dissipation is a function of the memory size, this can slash the power need and heat generation.  Then if the processor didn't really need much memory, we are golden, and if it did, we can start to think about ways to intelligently manage the on-board cache, paging data in and out from remote memories hosted in racks dedicated to memory resources and shared by the processor racks (yes, this raises all sorts of issues of protection, but that's the idea).  We end up with racks of processors, racks of memory units, racks of storage, racks of FPGA, etc.  Each rack will achieve a much higher density of the corresponding devices, which brings some efficiencies too, and we also potentially have multi-tenancy benefits if some applications are memory-intensive but others need relatively little DRAM, some use FPGA but some don't, etc.

The main issue -- the obvious one -- is latency.  To appreciate the point, consider standard NUMA machines.  The bottom line, as you'll instantly discover should you ever run into it, is that on a NUMA machine, DRAM is split into a few modules, each supporting a couple of local cores, but interconnected by a bus.  The issue is that although a NUMA machine is capable of offering transparent memory sharing across DRAM modules, the costs are wildly different when a thread accesses the closest DRAM than when it reaches across the bus to access a "remote" DRAM module (in quotes because this is a DRAM module on the same motherboard).  At the best, you'll see a 2x delay: your local DRAM might run with 65ns memory fetch speeds, but a "remote" access could cost 125ns.  These delays will balloon if there are also cache-coherency or locking delays: in such cases the NUMA hardware enforces the consistency or locking model, but it runs backplane protocols to do so, and those are costly.  NUMA latencies cause havoc.

Not long ago, researchers at MIT wrote a whole series of papers on optimizing Linux to run at a reasonable speed on NUMA platforms.

The cross-product of RDMA with NUMA-ness makes life even stranger.  Even when we consider a single server, an RDMA transfer in or out will run in slow motion if the RDMA NIC happens to be far from the DRAM module the transfer touches. You may even get timeouts, and I wonder if there aren't even cases where it would be faster to just copy the data to the local DRAM first, and then use RDMA purely from and to the memory unit closest to the NIC!

This is why those of us who teach courses related to modern NUMA and OS architectures are forced to stress that writing shared memory multicore code can be a terrible mistake from a performance point of view.  At best, you need to take full control: you'll have to pin your threads to a set of cores to some single DRAM unit (this helps: you'll end up with faster code.  But... it will be highly architecture-specific and on some other NUMA system, it might perform poorly).

In fact NUMA is probably best viewed as a hardware feature ideal for virtualized systems with standard single-threaded programs (and this is true both for genuine virtualization and for containers). Non-specialists should just avoid the model!

So here we have this new disaggregation trend, pushing for the most extreme form of NUMA-ness imaginable.  With a standard Intel server, local DRAM will be accessible with a latency of about 50-75ns.  Non-local jumps to 125ns, but RDMA could easily exceed 750ns, and 1500ns might not be unusual even for a nearby node, accessible with just one TOR switch hop.  (Optical networking could help, driving this down to perhaps 100ns, but that is still far in the future and involves many assumptions.)

There may be an answer, but it isn't obvious people will like it: we will need a new programming style that systematically separates data flows from control, so that this new style of out of band coding would be easier to pull off.  Why a new programming style?   Well, today's prevailing style of code centers on a program that "owns" the data and holds it in locally malloc-ed memory: I might write code to read frames of video, then ask my GPU to segment the data -- and the data transits through my DRAM (in fact the GPU would often use my DRAM as its memory source).  Similarly for FPGA.  In a disaggregated environment, we would give the GPU or the FPGA memory of its own, then wire it directly to the data source, and control all of this from an out of band control program running on a completely different processor.  The needed abstractions for that style of computing are simply lacking in most of today's programming languages and systems.  You do see elements of the story -- for example, Tensor Flow can be told to run portions of its graph on a designated TPU cluster -- but I would argue that we haven't yet found the right OS and programming language coding styles to make this easy at data-center scale.

Worse, we need to focus on applications that would be inherently latency-tolerant: latency barriers will translate to a slowdown for the end user.  To avoid that, we need a way of coding that promotes asynchronous flows and doesn't require a lot of hopping around, RPCs or locking.  And I'll simply say that we lack that style of coding now -- we haven't really needed it, and the work on this model hasn't really taken off.  So this is a solvable problem, for sure, but it isn't there yet.

Beyond all this, we have the issues that Mach ran into.  If we disaggregate, all of those same questions will arise: RPC on the critical path, context switch delays, mismatch of the hardware MMU models with the new virtualized disaggregated process model, permissions, you name it.  Today's hardware is nothing like the hardware when Mach was created, and I bet there are all sorts of cool new options.  But it won't be trivial to figure them all out.

I have colleagues who work in this area, and I think the early signs are very exciting.  It looks entirely plausible that the answers are going to be there, perhaps in five years and certainly in ten.  The only issue is that the industry seems to think disaggregation is already happening, and yet those answers simply aren't in hand today.

It seems to me that we also risk an outcome where researchers solve the problem, and yet the intended end-users just don't like the solution.  The other side of the Tensor Flow story seems to center on the popularity of just using languages everyone already knows, like Python or Java, as the basis for whatever it is we plan to do.  But Python has no real concept of resource locations: Tensor Flow takes Python and bolts on the concept of execution in some specific place, but here we are talking about flows, execution contexts, memory management, you name it.  So you run some risk that smart young researchers will demonstrate amazing potential, but that the resulting model won't work for legacy applications (unless a miracle occurs and someone finds a way to start with a popular language like Tensor Flow and "compile" it to the needed abstractions).  Can such a thing be done?  Perhaps so!  But when?  That strikes me a puzzle.

So I'll stop here, but I will say that disaggregation seems like a cool and really happening opportunity: definitely one of the next big deals.   Researchers are wise to jump in.  But people headed into the industry, though, who are likely to be technology users rather than inventors, need to appreciate that it is still too early to speculate about what might be successful, and when, and definitely too early for them to get excited: the bottom line message in the lecture I just gave on this topic!

2 comments:

  1. This is one of the best blog articles I have ever read recently: how nicely it connects the pros&cons of NUMA-ness with today's cloud/micro-service computing and ML frameworks. And the writing is so good that I am able to completely read the article and understand it within seconds -- super exciting to read this again and again.

    ReplyDelete
  2. As mentioned in the blog, I think that the main problem is creating a simple and efficient language to program it -- Languages such as Cuda and Frameworks such as Tensorflow, etc.., are still difficult to program. It would be fantastic if we have a language such as Java to program all the different hardware accelerators in the distributed cloud/big-data/ML systems.

    ReplyDelete

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.