Sunday, 5 November 2017

Disaggregating data centers

One of the hot new buzz-phrases in the systems community is "disaggregation", as in "the future data center will be disaggregated".

Right now, we tend to build data centers by creating huge rack-mounted collections of computers that often run the same operating systems we use on the desktop: versions of Linux, perhaps virtualized (either with true VMs or via container models, like Docker on Mesos).  The core challenge centers on guaranteeing good performance while sharing resources in ways that optimize power and other costs.  The resulting warehouse-size systems can easily contain tens or hundreds of thousands of racked machines.

Researchers pressing for a disaggregated model generally start by pointing out that we really have a lot of unused resource in these settings.  For example, there might be large numbers of machines equipped with GPU coprocessors that aren't currently running any applications that use them (or ones that are FPGA-capable, but don't actually need FPGA support right now), machines with lots of spare DRAM or SSD space, etc.  You can make a long list of such resources: other forms of ASICs (such as for high-speed SHA3 hashing), accelerators for tasks like DFFT computation, DRAM, unused cores, NICs that actually have computers built into them and hence could be used as network-layer coprocessors, switching and routers with programmable capabilities, massive storage systems with ARM or similar compute power down near the medium, etc.

So, the proposal goes, why not shift towards a greater degree of sharing, where your spare resources could be available to my machine?

With RDMA and RDMA-enabled persistent memory (PMEM) standards starting to work properly in settings that long-ago standardized around Ethernet and IP (RDMA on Infiniband is just not a hugely popular technology in data centers, although Azure HPC offers it), it makes sense to talk about sharing these spare resources in more ambitious ways.  My spare DRAM could be loaned out to you to expand your in-memory cache.  Your unused FPGA could become part of my 1000-FPGA deep neural network service.  There are more and more papers about doing these kinds of things; my class this fall has been reading and discussing a bunch of them.  In our work on Derecho, we've targeted  opportunities in this space.  Indeed, Derecho could be viewed as an example of the kind of support disaggregated computing might require.

But deciding when disaggregation would be beneficial is a hard problem.  A remote resource will never have the sorts of ultra-low latency possible when a task running locally accesses a local resource, and studies of RDMA in large settings show surprisingly large variability in latency, beyond which one sees all sorts of congestion and contention-related throughput impacts.  So the disaggregated data center is generally going to be a NUMA environment: lots of space resources may be within reach, but unless care is taken to share primarily with nearby nodes  (perhaps just one hop away), you would need to anticipate  widely varying latency and some variability in throughput: orders of magnitude differences depending on where something lives.  Get the balance right and you can pull off some really nifty tricks.  But if you accidentally do something that is out of balance, the performance benefit can quickly evaporate.

Even with variability of this kind, a core issue centers on the shifting ratio of latency to bandwidth.  In a standard computer, talking to a local resource, you'll generally get very high bandwidth (via DMA) coupled with low latency.  You can do almost as well when running in a single rack, talking to a nearby machine over a single hop via a single TOR switch.  But when an application reaches out across multiple hops in a data-center network, the latency, and variability of latency, grow sharply. 

The bandwidth story is more complex.  Although modern data centers have oversubscribed leaf and  spine switches and routers, many applications are compute intensive, and it would be rare to see long periods with heavy contention.  Thus, even if latency becomes high, bandwidth might not be lost.

In our experiments with RDMA we've noticed that you often can get better bandwidth when talking to a remote resource than to local ones.  This is because with two machines connected by a fast optical link, two memory modules can cooperate to transfer data "in parallel", with one sending and one receiving.  The two NICs will need to get their DMA engines synchronized, but this works well, hence "all" their capacity can be dedicated (provided that the NICs have all the needed data for doing the transfer in cache). 

In contrast, on a single machine, a single memory unit is asked to read data from one place and write it to some other place.  This can potentially limit the operation to half its potential speed, simply because with one core and one memory unit, it is possible to "overload" one or both.

We thus confront a new balancing act: a need to work with applications that run over hardware with relatively high latency coupled to very high bandwidth.  In my view, this shifting ration of latency to bandwidth is a game-changer: things that worked well in classic systems will need to be modified in order to take advantage of it.  The patterns of communication that work best in a disaggregated world are ones dominated by asynchronous flows: high-rate pipelines that move data steadily with minimal delays: no locking, no round-trip requests where a sender has to pause waiting for responses. 

As you  know from these blogs, almost all of my work is on distributed protocols.  Mapping these observations to my area, you'll quickly realize that in modern settings, you really don't want to run 2-phase commit or similar protocols.  Traditional versions of Paxos or BFT are a performance disaster.  In fact, this is what led us to design Derecho in the way we ultimately settled upon: the key challenge was to adapt Paxos to run well in this new setting.  It wasn't easy and it isn't obvious how to do this in general, for some arbitrary distributed algorithm or protocol.

So the puzzle, and I won't pretend to have any really amazing ideas for how to solve it, centers on the generalized version of this question.  Beyond the protocols and algorithms, how does Linux need to evolve for this new world?  What would the ideal O/S look like, for a disaggregated world?

As a concrete example, think about a future database system.  Today's database systems mostly work on a read, compute, output model: you start by issuing a potentially complex query.  Perhaps, it asks  the system to make a list of open orders that have been pending for more than six weeks and involve a supplier enrolled in a special "prioritized orders" program, or some other fairly complex condition. 

The system reads the relations containing the data, performs a sequence of joins and selects and projects, and eventually out comes the answer.  With small data sets, this works like a charm.

But suppose that the database contains petabytes of data spread over thousands of storage systems.
Just reading the data could saturate your NIC for days!  In such a situation, you may need to find new ways to run your query.  You would think about disaggregation if the underlying hardware turns out to be reasonably smart.  For example, many modern storage devices have a local processor, run Linux, and can run a program on your behalf.

Clearly you want the database operations to be pushed right into the storage fabric, and would want to harness all that compute power.  You might want to precompute various indices or data samples, so that when the query is issued, the work has partly been done in advance, and the remaining work can happen down in the layer that hosts the data.

But of course I'm a systems person, so my system-centric question, posed in this context, is this: what will you want as an O/S for that environment?  And more to the point: how will that O/S differ from todays’s cloud systems?

If you are unsure how to think of such a question, consider this: if you wanted to implement a disaggregated database today, how would you even convince the storage layer to run these little add-on programs?  In a purely mechanical sense: how does one run a program on a modern SSD storage unit?  I bet you have absolutely no idea.

If you wanted to do everything by hand, logging into the storage servers one by one is an option.  Today, it might be the only option, and the environment you would find inside them could be pretty arcane.  Each vendor has its own weird variation on the concept. 

But we could change this.  We could create a new kind of operating system just for such tasks.  The power of an operating system is that it can impose standardization and increase productivity through portability, easy access paths, good debugging support, proper integration of the needed tools, virtualization for sharing, etc.  You would lack all of those features, today, if you set out to build this kind of database solution.

I mentioned at the start that data centers are optimized to minimize wasted resources.  One would need to keep an eye on several metrics in solving this database challenge using a disaggregated model.  For example, if DRAM on my machine gets loaned out to your machine, did the total performance of the data center as a whole improve, or did your performance improve at my expense?  Did the overall power budget just change in some way that the data center owner might need to think about?  How did this shift in resources impact the owner's ability to manage (to schedule) resources, so as to match profiles of intended level of resources for each class of job?

Then there are lower level questions of what the needed technical enablers would be.  I happen to believe that Derecho could become part of the answer -- this is one of the motivating examples we thought about when we developed the system, and one of my main reasons for writing these thoughts down in this blog.  But we definitely don't have the full story in hand yet, and honestly, it isn't obvious to me how to proceed!  There are a tremendous number of choices, options, and because the vendors don't have any standards for disaggregated computing, a total lack of. existing structure!  It’s going to be the Wild West, for a while... but of course, as a researcher, that isn't necessarily a bad news story...