Tuesday, 27 November 2018

So what's the story about "disaggregated" cloud computing?

The hardware community has suddenly gone wild over a new idea they call "disaggregated" cloud computing infrastructures.  I needed to talk about this in my graduate class, so I've been reading some of the papers and asking around.

If you come at the topic from distributed/cloud computing, as I do, you may have been using disaggregation to refer to a much older trend, namely the "decoupling" of control software from the hardware that performs various data and compute-intensive tasks that would otherwise burden a general purpose processor.  And this is definitely a legitimate use of the term, but as it turns out, the architecture folks are focused on a different scenario: for them, disaggregation relates to a new way of building your data center compute nodes in which you group elements with similar functionalities, using some form of micro-controller with minimal capabilities to manage components that handle storage, memory, FPGA devices, GPU, TPU, etc.  Recent work on quantum computing uses a disaggregated model too: the quantum device is like an experimental apparatus that a control program carefully configures into a specific state, then "runs", and then collects output from.

The argument is that disaggregation reduces wasted power by supporting a cleaner form of multitenancy.  For example, consider DRAM.  The “d” stands for dynamic, meaning that power is expended to keep every bit continuously refreshed.  This is a costly, heat-intensive activity, and yet your DRAM may not even be fully used (studies suggest that 40-50% would be pretty good).  If all the DRAM was in some shared rack, bin-packing could easily give you >95% utilization on the active modules, plus you could power down any unused modules.

So disaggregation as a term really can refer to pretty much any situation in which some resources are controlled from programs running "elsewhere", and where you end up with a distributed application that conceptually is a single thing, but has multiple distinct moving parts, with different subsystems specialized for different roles and often using specialized hardware as part of those roles.

At this point I'm seeing the idea as a spectrum, so perhaps it would make sense to start with the older style of control-plane/data-plane separation, say a few words about the value of such a model, and then we can look more closely at this extreme form and ask how it fits into the picture.

Control/data disaggregation is really pretty common idea and dates back quite far: even the earliest IBM mainframes had dedicated I/O coprocessors to operate disks and tapes, and talked about "I/O channels".   I could easily list five or ten other examples that have popped up during the past decade or two: very few devices lack some form of general purpose on-board computer, and we can view any such system as a form of disaggregated hardware. But the one that interests me most is the connection to Mach, at the stage when the project was publishing very actively at SOSP and was run from CMU, with various people heading it: Rick Rashid (who ultimately left to launch Microsoft Research), then Brian Bershad, and then Tom Anderson.

Mach was an amazing system, but what I find fascinating is that it happens to also have anticipated many aspects of this new push into disaggregation.  As you perhaps will recall (from all those Mach papers you've read...) the basic idea was to have a single micro-kernel that would host various O/S presentation frameworks: a few versions of Linux, perhaps versions of legacy IBM operating systems, etc.  Internally, one of those frameworks consisted of a set of processes sharing DLLs, some acting as servers and others as applications.  Memory was organized into segments with read/write/execute permissions, and all the underlying interactions centered on an ultra-fast form of RPC in which a single CPU core could run an application thread in segment A, then briefly suspend that thread and activate a thread context in segment B (perhaps A was an application and B is the file server, for example), run in B for a short period, then return to A.  The puzzle was to handle permissions (the file server can see the raw disk, but A shouldn't be allowed to, and conversely, the file server can DMA from a specific set of memory areas in A, but shouldn't be able to scan A's memory for passwords).  Then because all of this yielded a really huge address space, Mach ran into page table fragmentation issues that were ultimately solved with a novel software inverted page table.  And capabilities were used for the low-level message passing primitives.  Really cool stuff!

As you might expect, all of these elements had costs: the messaging layer, the segmentation model, context switching, etc.  Mach innovated by optimizing those steps, but keep these kinds of costs in mind, because disaggregation will turn out to run into similar issues.

Which brings us to today!  The modern version of the story was triggered by the emergence of RDMA as an option in general purpose data centers.  As you'll know if you've followed these postings, I'm an RDMA nut, but it isn't just me.  RDMA is one of those rare cases where a technology became very mature in a different setting (namely high-performance computing, where it serves as the main communications backbone for packages like the MPI library, and runs on InfiniBand networks), and yet somehow went mostly unnoticed by the general computing community.... until recently, when people suddenly "discovered" it and went bonkers.

As I've discussed previously, one reason that RDMA matured on Infiniband and yet wasn't used on Ethernet centered on RDMA's seeming need for a special style of credit-based I/O model, in which a sender NIC would only transfer data if the receiver had space waiting to receive the data.   It is only recently that the emergence of RoCE allowed RDMA to jump to more standard datacenter environments.  Today, we do have RDMA in a relatively stable form within a set of racks sharing a TOR switch, assuming that the TOR switch has full bisection bandwidth.  But RDMA isn't yet equally mature across routers, particularly in a COS/Spine model where the core network would often be oversubscribed and hence at risk of congestion.  There are several competing schemes for deploying RDMA over RoCE v2 with PPF at data-center-wide scale (DCQCN and TIMELY), but work is still underway to understand how those can be combined with Diffsrv or Enterprise VLAN routing) to avoid unwanted interactions between standard TCP/IP and the RDMA layer.  It may be a few years before we see "mature" datacenter-scale deployments in which RDMA runs side-by-side with TCP/IP in a non-disruptive way over RoCE v2 (in fact, by then we may be talking about RoCE v5)!  Until then, RDMA remains a bit delicate, and must be used with great care.

So why is RDMA even relevant to disaggregation?  There is a positive aspect to the answer, but it comes with a cautionary sidebar. The big win is that RDMA has a simple API visible directly to the end-user program (a model called "qpairs" that focuses on "I/O verbs".  These are lock-free, and with the most current hardware, permit reliable DMA data transfers from machine to machine at speeds of up to 200Gbps, and latencies as low as .75us.  The technology comes in several forms, all rather similar: Mellanox leads on the pure RDMA model (the company has been acquired and will soon be renamed Xylinx, by the way), while Intel is pushing a variant they call OMNI-Path, and other companies have other innovations.   These latencies are 10x to 100x better than what you see with TCP, and the data rates are easily 4x better than the fastest existing datacenter TCP solutions.  As RDMA pushes up the performance curve to 400Gbps and beyond, the gaps will get even larger.  So RDMA is the coming thing, even it hasn't quite become commonplace today.

RDMA is also reliable in its point-to-point configuration, enabling it to replace TCP in many applications, and also offers a way to support relatively low latency remote memory access with cache-line atomicity.  Now, it should probably be underscored that with latencies of .75us at the low end to numbers more like 3us for more common devices, RDMA offers the most NUMA of NUMA memory models, but even so, the technology is very fast compared to anything we had previously.

The cautions are the obvious ones.  First, RDMA on RoCE is really guaranteed to offer those numbers only if the endpoints live under a TOR switch that can support full bidirectional communication.  Beyond that, we need to traverse a router that might experience congestion, and might even drop packets.  In these cases, RDMA is on slightly weaker ground.

A second caution is simply that applications can't really handle long memory-access delays, so this NUMA thing becomes a conceptual barrier to casual styles of programming.   You absolutely can't view RDMA data-center memory as a gigantic form of local memory.  On the contrary, the developer needs to be acutely aware of potential delays, and hence will need to have full control over memory layout. Today, we lack really easily-used tools for that form of control, so massive RDMA applications are the domain of specialists who have a great deal of arcane hardware insight.

So how does this story relate to disaggregation?  A first comment is that RDMA is just one of several new hardware options that force us to take control logic out of the main data path.  There are a few other examples: with the next generation of NVRAM storage (like Intel's Optane), you can DMA directly from a video camera into a persistent storage card, and with RDMA, you'll be able to do so even if the video isn't on the same machine as the place you plan to store the data.  With FPGA accelerators, we often pursue bump-in-the-wire placements: data streams into a memory unit associated to the FPGA at wire rates, it does something, and then we might not even need to touch the main high-volume stream from general purpose code, at all.  So the RDMA story is pointing to a broader lack of abstractions covering a variety of data-path scenarios.  Moreover, our second caution expands to a broader statement: these different RDMA-like technologies may each bring different constraints and different best-case use scenarios.

Meanwhile, the disaggregation folks are taking the story even further: they start by arguing that we can reduce the power footprint of rack-scale computing systems if we build processors with minimal per-core memory and treat that memory more like a cache than as a full-scale memory.  Because memory runs quite hot but the heat dissipation is a function of the memory size, this can slash the power need and heat generation.  Then if the processor didn't really need much memory, we are golden, and if it did, we can start to think about ways to intelligently manage the on-board cache, paging data in and out from remote memories hosted in racks dedicated to memory resources and shared by the processor racks (yes, this raises all sorts of issues of protection, but that's the idea).  We end up with racks of processors, racks of memory units, racks of storage, racks of FPGA, etc.  Each rack will achieve a much higher density of the corresponding devices, which brings some efficiencies too, and we also potentially have multi-tenancy benefits if some applications are memory-intensive but others need relatively little DRAM, some use FPGA but some don't, etc.

The main issue -- the obvious one -- is latency.  To appreciate the point, consider standard NUMA machines.  The bottom line, as you'll instantly discover should you ever run into it, is that on a NUMA machine, DRAM is split into a few modules, each supporting a couple of local cores, but interconnected by a bus.  The issue is that although a NUMA machine is capable of offering transparent memory sharing across DRAM modules, the costs are wildly different when a thread accesses the closest DRAM than when it reaches across the bus to access a "remote" DRAM module (in quotes because this is a DRAM module on the same motherboard).  At the best, you'll see a 2x delay: your local DRAM might run with 65ns memory fetch speeds, but a "remote" access could cost 125ns.  These delays will balloon if there are also cache-coherency or locking delays: in such cases the NUMA hardware enforces the consistency or locking model, but it runs backplane protocols to do so, and those are costly.  NUMA latencies cause havoc.

Not long ago, researchers at MIT wrote a whole series of papers on optimizing Linux to run at a reasonable speed on NUMA platforms.

The cross-product of RDMA with NUMA-ness makes life even stranger.  Even when we consider a single server, an RDMA transfer in or out will run in slow motion if the RDMA NIC happens to be far from the DRAM module the transfer touches. You may even get timeouts, and I wonder if there aren't even cases where it would be faster to just copy the data to the local DRAM first, and then use RDMA purely from and to the memory unit closest to the NIC!

This is why those of us who teach courses related to modern NUMA and OS architectures are forced to stress that writing shared memory multicore code can be a terrible mistake from a performance point of view.  At best, you need to take full control: you'll have to pin your threads to a set of cores to some single DRAM unit (this helps: you'll end up with faster code.  But... it will be highly architecture-specific and on some other NUMA system, it might perform poorly).

In fact NUMA is probably best viewed as a hardware feature ideal for virtualized systems with standard single-threaded programs (and this is true both for genuine virtualization and for containers). Non-specialists should just avoid the model!

So here we have this new disaggregation trend, pushing for the most extreme form of NUMA-ness imaginable.  With a standard Intel server, local DRAM will be accessible with a latency of about 50-75ns.  Non-local jumps to 125ns, but RDMA could easily exceed 750ns, and 1500ns might not be unusual even for a nearby node, accessible with just one TOR switch hop.  (Optical networking could help, driving this down to perhaps 100ns, but that is still far in the future and involves many assumptions.)

There may be an answer, but it isn't obvious people will like it: we will need a new programming style that systematically separates data flows from control, so that this new style of out of band coding would be easier to pull off.  Why a new programming style?   Well, today's prevailing style of code centers on a program that "owns" the data and holds it in locally malloc-ed memory: I might write code to read frames of video, then ask my GPU to segment the data -- and the data transits through my DRAM (in fact the GPU would often use my DRAM as its memory source).  Similarly for FPGA.  In a disaggregated environment, we would give the GPU or the FPGA memory of its own, then wire it directly to the data source, and control all of this from an out of band control program running on a completely different processor.  The needed abstractions for that style of computing are simply lacking in most of today's programming languages and systems.  You do see elements of the story -- for example, Tensor Flow can be told to run portions of its graph on a designated TPU cluster -- but I would argue that we haven't yet found the right OS and programming language coding styles to make this easy at data-center scale.

Worse, we need to focus on applications that would be inherently latency-tolerant: latency barriers will translate to a slowdown for the end user.  To avoid that, we need a way of coding that promotes asynchronous flows and doesn't require a lot of hopping around, RPCs or locking.  And I'll simply say that we lack that style of coding now -- we haven't really needed it, and the work on this model hasn't really taken off.  So this is a solvable problem, for sure, but it isn't there yet.

Beyond all this, we have the issues that Mach ran into.  If we disaggregate, all of those same questions will arise: RPC on the critical path, context switch delays, mismatch of the hardware MMU models with the new virtualized disaggregated process model, permissions, you name it.  Today's hardware is nothing like the hardware when Mach was created, and I bet there are all sorts of cool new options.  But it won't be trivial to figure them all out.

I have colleagues who work in this area, and I think the early signs are very exciting.  It looks entirely plausible that the answers are going to be there, perhaps in five years and certainly in ten.  The only issue is that the industry seems to think disaggregation is already happening, and yet those answers simply aren't in hand today.

It seems to me that we also risk an outcome where researchers solve the problem, and yet the intended end-users just don't like the solution.  The other side of the Tensor Flow story seems to center on the popularity of just using languages everyone already knows, like Python or Java, as the basis for whatever it is we plan to do.  But Python has no real concept of resource locations: Tensor Flow takes Python and bolts on the concept of execution in some specific place, but here we are talking about flows, execution contexts, memory management, you name it.  So you run some risk that smart young researchers will demonstrate amazing potential, but that the resulting model won't work for legacy applications (unless a miracle occurs and someone finds a way to start with a popular language like Tensor Flow and "compile" it to the needed abstractions).  Can such a thing be done?  Perhaps so!  But when?  That strikes me a puzzle.

So I'll stop here, but I will say that disaggregation seems like a cool and really happening opportunity: definitely one of the next big deals.   Researchers are wise to jump in.  But people headed into the industry, though, who are likely to be technology users rather than inventors, need to appreciate that it is still too early to speculate about what might be successful, and when, and definitely too early for them to get excited: the bottom line message in the lecture I just gave on this topic!

Friday, 16 November 2018

Forgotten lessons and their relevance to the cloud

While giving a lecture in my graduate course on the modern cloud, and the introduction of sensors and hardware accelerators into machine-learning platforms, I had a sudden sense of deja-vu.

In today's cloud computing systems, there is a tremendous arms-race underway to deploy hardware as a way to compute more cost-effectively, more data faster, or simply to offload very repetitious tasks into specialized subsystems that are highly optimized for those tasks.  My course covers quite a bit of this work: we look at RDMA, new memory options, FPGA, GPU and TPU clusters, challenges of dealing with NUMA architectures and their costly memory coherence models, and similar topics.  The focus is nominally on the software gluing all of this together ("the future cloud operating system") but honestly, since we don't really know what that will be, the class is more of a survey of the current landscape with a broad agenda of applying it to emerging IoT uses that bring new demands into the cloud edge.

So why would this give me a sense of deja-vu?  Well, grant me a moment for a second tangent and then I'll link my two lines of thought into a single point.  Perhaps you recall the early days of client-server computing, or the early web.  Both technologies took off explosively, only to suddenly sag a few years later as a broader wave of adoption revealed deficiencies.

If you take client-server as your main example, we had an early period of disruptive change that you can literally track to a specific set of papers: Hank Levy's first papers on the Vax Cluster architecture when he was still working with DEC.  (It probably didn't hurt that Hank was the main author: in systems, very few people are as good as Hank at writing papers on topics of that kind.)  And in a few tens of pages, Hank upended the mainframe mindset and introduced us to this other vision: clustered computing systems in which lots of components somehow collaborate to perform scalable tasks, and it was a revelation.  Meanwhile, mechanisms like RPC were suddenly becoming common (CORBA was in its early stages), so all of this was accessible.  For people accustomed to file transfer models and batch computing, it was a glimpse of the promised land.

But the pathway to the promised land turned out to be kind of challenging.  DEC, the early beneficiary of this excitement, got overwhelmed and sort of bogged down: rather than being a hardware company selling infinite numbers of VAX clusters (which would have made them the first global titan of the industry), they somehow got dragged further and further into a morass of unworkable software that needed a total rethinking.  Hank's papers were crystal clear and brilliant, but a true client-server infrastructure needed 1000x more software components, and not everyone can function at the level Hank's paper more or less set as the bar.  So, much of the DEC infrastructure was incomplete and buggy, and for developers, this translated to a frustrating experience: a fast on-ramp followed by a very bumpy, erratic experience.  The ultimate customers felt burned and many abandoned DEC for Sun Microsystems, where Bill Joy managed to put together a client-server "V2" that was somewhat more coherent and complete.  Finally, Microsoft swept in and did a really professional job, but by then DEC had vanished entirely, and Sun was struggling with its own issues of overreach.

I could repeat this story using examples from the web, but you can see where I'm going: early technologies, especially disruptive, revolutionary ones, often take on a frenetic life of their own that can get far ahead of the real technical needs.  The vendor then becomes completely overwhelmed and unless they somehow can paper over the issues, collapses.

Back in that period, a wonderful little book came out on this: Crossing the Chasm, by Geoffrey Moore.  It talked about how technologies often have a bumpy adoption curve.  Moore talks about an adoption curve over time.  The first bump is associated with the early adopters (the kind of people who live to be the first to use a new technology, back before it even becomes stable).  But conservative organizations prefer to be "first to be last" as David Bakken says.  They hold back waiting for the technology to mature and hoping that they can avoid the pain but also not miss the actual surge of mainstream adoption.  Meanwhile, the pool of early adopters dries up and some of them wander off for the next even newer thing, so you see the adoption curve sag, perhaps for years.  Wired writes articles about the "failure of client server" (well, back then it would have been ComputerWorld).

Finally, for the lucky few, the really sustainable successes, you see a second surge in adoption and this one would typically play out over a much longer period, without sagging in that same way, or at least not for many years.  So we see a kind of S-curve, but with a bump in the early period.

All of which leads me back to today's cloud and this craze for new accelerators.  When you consider any one of them, you quickly discover that they are extremely hard devices to program. FPGA pools in Microsoft's setting, for example, are clearly going be expert-only technologies (I'm thinking about the series of papers associated with Catapult).  It is easy to see why a specialized cloud micro-service might benefit, particularly because the FPGA performance to power-cost ratio is quite attractive.  Just the same, though, creation of an FPGA is really an experts-only undertaking.  Anyhow, a broken FPGA could be quite disruptive to the data center.  So we may see use of these pools by subsystems doing things like feature ranking for Bing search, crypto for the Microsoft Azure VPC, or data compression and other similar tasks in Cosmos.  But I don't imagine that my students here at Cornell will be creating new services with new FPGA accelerators anytime soon.

GPU continues to be a domain where CUDA programming dominates every other option.  This is awesome for the world's CUDA specialists, and because they are good at packaging their solutions in libraries we can call from the attached general purpose machine, we end up with great specialized accelerators for graphics, vision, and similar tasks.  In my class we actually do read about a generalized tool for leveraging GPU: a language invented at MSR called Dandelion. The real programming language was easy: C# with LINQ, about as popular a technology as you could name.  Then they mapped the LINQ queries to GPU, if available.  I loved that idea... but Dandelion work stalled several years ago without really taking off in a big way.

TPU is perhaps easier to use: With Google's Tensor Flow, the compiler does the hard work (like with Dandelion), but the language is just Python.  To identify the objects a TPU could compute on, the whole model focuses on creating functions that have vectors or matrices or higher-dimensional tensors as their values.  This really works well and is very popular, particular on a NUMA machine with an attached TPU accelerator, particularly for Google's heavy lifting in their ML subsystems.  But it is hard to see Tensor Flow as a general purpose language or even as a general purpose technology.

And the same goes with research in my own area.  When I look at Derecho or Microsoft's FaRM or other RDMA technology, I find it hard not to recognize that we are creating specialist solutions, using RDMA in sophisticated ways, and supporting extensible models that are probably best viewed as forms of PaaS infrastructures even if you tend to treat them as libraries.  They are sensational tools for what they do.  But they aren't "general purpose".  (For distributed computing, general purpose might lead you to an RPC package like the OMG's IDL-based solutions, or to REST, or perhaps to Microsoft's WCF).

So where does this leave us?  Many people who look at the modern cloud are predicting that the cloud operating system will need to change in dramatic ways.  But if you believe that difficulty of use and fragility and lack of tools makes the accelerators "off limits" except for a handful of specialists, and that the pre-built PaaS services will ultimately dominate, than what's wrong with today's micro-service models?  As I see it, not much: they are well-supported, scale nicely (although some of the function-server solutions really need to work on their startup delays!), and there are more and more recipes to guide new users from problem statement to a workable, scalable, high performance solution.  These recipes often talk to pre-built microservices and sure, those use hardware accelerators, but the real user is shielded from their complexity.  And this is a good thing, because otherwise, we would be facing a new instance of that same client-server issue.

Looking at this as a research area, we can reach some conclusions about how one should approach research on the modern cloud infrastructure.

A first observation is that the cloud has evolved into a world of specialized elastic micro-services and that the older style of "rent a pile of Linux machines and customize them" is fading slowly into the background.  This makes a lot of sense, because it isn't easy to end up with a robust, elastic solution.  Using a pre-designed and highly optimized microservice benefits everyone: the cloud vendor gets better performance from the data center and better multi-tenancy behavior, and the user doesn't have to reinvent these very suble mechanisms.

A second is that specialized acceleration solutions will probably live mostly within the specialized microservices that they were created to support.  Sure, Azure will support pools of FPGAs.  But those will exist mostly to speed up things like Cosmos or Bing, simply because using them is extremely complex, and misusing them can disrupt the entire cloud fabric.  This can also make up for the dreadful state of the supporting tools for most if not all cloud-scale elastic mechanisms.  Like early client-server computing, home-brew use of technologies like DHTs, FPGA and GPU and TPU accelerators, RDMA, Optane memory -- none of that makes a lot of sense right now.  You could perhaps pull it off, but more likely, the larger market will reject such things... except when they result in ultra-fast, ultra-cheap micro-services that they can treat as black boxes.

A third observation is that as researchers, if we hope to be impactful, we shouldn't fight this wave.  Take my own work on Derecho.  Understanding that Derecho will be used mostly to build new microservices helps me understand how to shape the APIs to look natural to the people who would be likely to use it.  Understanding that those microservices might be used mostly from Azure's function server or Amazon's AWS Lambda, tells me what a typical critical path would look like, and can focus me on ensuring that this particular path is very well-supported, leverages RDMA at every hop if RDMA is available, lets me add auto-configuration logic to Derecho based on the environment one would find at runtime, etc.

We should also be looking at the next generation of applications and by doing so, should try to understand and abstract by recognizing their needs and likely data access and computational patterns.  On this, I'll point to work like the new paper on Ray from OSDI: a specialized microservice for a style of computing common in gradient-descent model training, or Tensor Flow: ultimately, a specialized microservice for leveraging TPUs, or Spark: a specialized microservice to improve the scheduling and caching of Hadoop jobs.  Each technology is exquisitely matched to context, and none can simply be yanked out and used elsewhere.  For example, you would be unwise to try and build a new Paxos service using Tensor Flow: it might work, but it wouldn't make a ton of sense.  You might manage to publish a paper, but it is hard to imagine such a thing having broad impact.  Spark is just not an edge caching solution: it really makes sense only in the big-data repositories where the DataBricks product line lives.  And so forth.