A Few Thoughts on Distributed Computing: November 2017

Tuesday, 28 November 2017

Dercho discussion thread

As requested by Scott... Scott, how about reposting your questions here?
===

Scott Lewis28 November 2017 at 16:04

Hi Ken.

Thanks very much for the info. I wasn't familiar with LibFabrics. I think since I don't have access to RDMA hardware right now I will wait for the derecho over libfabrics. How can I know when it is available?

In the mean time a question: you mentioned in one of your blog posts that it would be difficult to use other languages than c++ for using the derecho replicated object API. Could you comment on why that would be?...e.g. relative to python or java say.

To share one thought: the OSGi (java) concept of a service...or rather a remote service...is conceptually similar. OSGi services are plain 'ol object instances that are managed by a local broker (aka service registry) and accessed by interfaces.

One thing that makes OSGi services different from java objects is support for dynamics...i.e. OSGi services can come and go at runtime

Which brings me to another question: Is it possible for replicated object instances to be created and destroyed at runtime within a derecho process group?

BTW, if there is some derecho mailing list that is more appropriate than your blog for such questions please just send me there.

Scott

Friday, 24 November 2017

Can distributed systems have predictable delays?

I ran into a discussion thread about “deterministic networking” and got interested...

Here's the puzzle: suppose a system requires the lowest possible network latencies, the highest possible bandwidth, and steady timing. What obstacles prevent end-to-end timing from being steady and predictable, and how might we overcome them? What will be the cost?

To set context, it helps to realize that modern computing systems are fast because of asynchronous pipelines at every level. The CPU runs in a pipelined way, doing branch prediction, speculative prefetch, and often executing multiple instructions concurrently, and all of this is asynchronous in the sense that each separate aspect runs independent of the others, as much as it can. Now and then barriers arise that prevent concurrency, of course. Those unexpected barriers make the delays associated with such pipelines unpredictable.

Same for the memory unit. It runs asynchronously from the application, populating cache entries, updating or invalidation them, etc. There is a cache of memory translation data, managed in an inverted lookup format that depends on a fast form of parallel lookup hardware for fast address translations. With a cache hit memory performance might be 20x better than if you get a miss.

Or consider reading input from a file. The disk is asynchronous, with a buffer of pending operations in between your application and the file system, then another in between the file system buffer area and the disk driver, and perhaps another inside the disk itself (a rotating disk or SSD often has a battery-backed DRAM cache to mask hardware access delays, and to soak up DMA transfers from main memory, which ideally occur at a higher speed than the SSD can really run at). If everything goes just right, your performance could be 100x better than if luck runs against you, for a particular operation.

NUMA effects cause a lot of asynchronous behavior and nondeterminism. Plaforms hide the phenomenon by providing memory management software that will assign memory from a DRAM area close to your core in response to a malloc, but if you use data sharing internal to a multithreaded application and it runs on different cores, beware: cross core accesses can be very slow. Simon Peter researched this, and ended up arguing that you should invariably make extra copies of your objects and use memcpy to rapidly move the bytes in these cross-core cases, so that each thread always accesses a local object. His Barrelfish operating system is built around that model. Researchers at MIT sped up Linux locking this way, too, using “sloppy counters” and replicating inode structures. In fact locking is a morass of unpredictable delays.

Languages like C# and Java can hide some NUMA effects, but on the other hand, they also do garbage collection and other memory compaction operations, which can be costly and hard to anticipate or control. You can eliminate this with C++, but the standard libraries sometimes do copying in unexpected ways too, so you can’t just assume that everything is roses even with these low level languages. These days, even C can have strange library-level delays.

The case I worry about is when activities with different priorities share a lock: it is easy to create a priority inversion in which urgent action A is stuck waiting for thread B, which holds a lock but is very low priority and has been preempted by medium priority thread C. This seems arcane as I describe it, but if you think of a NIC or an interrupts handler as a thread, that could be A. Your user code is thread B. And the garbage collector is thread C. Voila: major problems are lurking if A and C share a lock. Locks make concurrent coding easier, but wow, do they ever mess up timing!

In distributed systems, any form of remote interaction brings sources of nondeterministic timing. First, on the remote side, we need some kind of task to wake up and run, so you have a scheduling effect. But in addition to this, most networks use oversubscribed COS structures, with spine and lead routers or switches that can’t actually run at a full all-to-all data rate. They congest and drop packets or send “slow down” messages or mark traffic to warn the endpoints about excessive load. So any kind of cross-traffic effect can cause variable performance.

Few of us have a way to control the layout of our application tasks into the nodes in the computer. So network latency can introduce nondeterminism simply because when things run, every different launch of your application may end up with nodes P and Q at different distances from each other, measured in network units. In fact routing can also change, and tricks involving routing are surprisingly common. Refreshing IP address leases and other kinds of dynamic host bindings can also glitch otherwise rock-steady performance.

If you use VPN security, encryption and decryption occurs at the endpoints, adding delay. There are cryptographic schemes that have constant cost, but many are more or less rapid depending on the data you hand them, hence more non-determinism creeps in at that step.

Failure handling and any kind of system management activity can cause performance to vary in unexpected ways. This would include automated data replication, copying, compaction and similar tasks on your file storage subsystem.

So... could we create a new operating system with genuinely deterministic timing? Cool research question! If I find the right PhD student, maybe I’ll let you know (but don’t hold your breath: with research, timing can be unpredictable).

Sunday, 5 November 2017

Disaggregating data centers

One of the hot new buzz-phrases in the systems community is "disaggregation", as in "the future data center will be disaggregated".

Right now, we tend to build data centers by creating huge rack-mounted collections of computers that often run the same operating systems we use on the desktop: versions of Linux, perhaps virtualized (either with true VMs or via container models, like Docker on Mesos). The core challenge centers on guaranteeing good performance while sharing resources in ways that optimize power and other costs. The resulting warehouse-size systems can easily contain tens or hundreds of thousands of racked machines.

Researchers pressing for a disaggregated model generally start by pointing out that we really have a lot of unused resource in these settings. For example, there might be large numbers of machines equipped with GPU coprocessors that aren't currently running any applications that use them (or ones that are FPGA-capable, but don't actually need FPGA support right now), machines with lots of spare DRAM or SSD space, etc. You can make a long list of such resources: other forms of ASICs (such as for high-speed SHA3 hashing), accelerators for tasks like DFFT computation, DRAM, unused cores, NICs that actually have computers built into them and hence could be used as network-layer coprocessors, switching and routers with programmable capabilities, massive storage systems with ARM or similar compute power down near the medium, etc.

So, the proposal goes, why not shift towards a greater degree of sharing, where your spare resources could be available to my machine?

With RDMA and RDMA-enabled persistent memory (PMEM) standards starting to work properly in settings that long-ago standardized around Ethernet and IP (RDMA on Infiniband is just not a hugely popular technology in data centers, although Azure HPC offers it), it makes sense to talk about sharing these spare resources in more ambitious ways. My spare DRAM could be loaned out to you to expand your in-memory cache. Your unused FPGA could become part of my 1000-FPGA deep neural network service. There are more and more papers about doing these kinds of things; my class this fall has been reading and discussing a bunch of them. In our work on Derecho, we've targeted opportunities in this space. Indeed, Derecho could be viewed as an example of the kind of support disaggregated computing might require.

But deciding when disaggregation would be beneficial is a hard problem. A remote resource will never have the sorts of ultra-low latency possible when a task running locally accesses a local resource, and studies of RDMA in large settings show surprisingly large variability in latency, beyond which one sees all sorts of congestion and contention-related throughput impacts. So the disaggregated data center is generally going to be a NUMA environment: lots of space resources may be within reach, but unless care is taken to share primarily with nearby nodes (perhaps just one hop away), you would need to anticipate widely varying latency and some variability in throughput: orders of magnitude differences depending on where something lives. Get the balance right and you can pull off some really nifty tricks. But if you accidentally do something that is out of balance, the performance benefit can quickly evaporate.

Even with variability of this kind, a core issue centers on the shifting ratio of latency to bandwidth. In a standard computer, talking to a local resource, you'll generally get very high bandwidth (via DMA) coupled with low latency. You can do almost as well when running in a single rack, talking to a nearby machine over a single hop via a single TOR switch. But when an application reaches out across multiple hops in a data-center network, the latency, and variability of latency, grow sharply.

The bandwidth story is more complex. Although modern data centers have oversubscribed leaf and spine switches and routers, many applications are compute intensive, and it would be rare to see long periods with heavy contention. Thus, even if latency becomes high, bandwidth might not be lost.

In our experiments with RDMA we've noticed that you often can get better bandwidth when talking to a remote resource than to local ones. This is because with two machines connected by a fast optical link, two memory modules can cooperate to transfer data "in parallel", with one sending and one receiving. The two NICs will need to get their DMA engines synchronized, but this works well, hence "all" their capacity can be dedicated (provided that the NICs have all the needed data for doing the transfer in cache).

In contrast, on a single machine, a single memory unit is asked to read data from one place and write it to some other place. This can potentially limit the operation to half its potential speed, simply because with one core and one memory unit, it is possible to "overload" one or both.

We thus confront a new balancing act: a need to work with applications that run over hardware with relatively high latency coupled to very high bandwidth. In my view, this shifting ration of latency to bandwidth is a game-changer: things that worked well in classic systems will need to be modified in order to take advantage of it. The patterns of communication that work best in a disaggregated world are ones dominated by asynchronous flows: high-rate pipelines that move data steadily with minimal delays: no locking, no round-trip requests where a sender has to pause waiting for responses.

As you know from these blogs, almost all of my work is on distributed protocols. Mapping these observations to my area, you'll quickly realize that in modern settings, you really don't want to run 2-phase commit or similar protocols. Traditional versions of Paxos or BFT are a performance disaster. In fact, this is what led us to design Derecho in the way we ultimately settled upon: the key challenge was to adapt Paxos to run well in this new setting. It wasn't easy and it isn't obvious how to do this in general, for some arbitrary distributed algorithm or protocol.

So the puzzle, and I won't pretend to have any really amazing ideas for how to solve it, centers on the generalized version of this question. Beyond the protocols and algorithms, how does Linux need to evolve for this new world? What would the ideal O/S look like, for a disaggregated world?

As a concrete example, think about a future database system. Today's database systems mostly work on a read, compute, output model: you start by issuing a potentially complex query. Perhaps, it asks the system to make a list of open orders that have been pending for more than six weeks and involve a supplier enrolled in a special "prioritized orders" program, or some other fairly complex condition.

The system reads the relations containing the data, performs a sequence of joins and selects and projects, and eventually out comes the answer. With small data sets, this works like a charm.

But suppose that the database contains petabytes of data spread over thousands of storage systems.
Just reading the data could saturate your NIC for days! In such a situation, you may need to find new ways to run your query. You would think about disaggregation if the underlying hardware turns out to be reasonably smart. For example, many modern storage devices have a local processor, run Linux, and can run a program on your behalf.

Clearly you want the database operations to be pushed right into the storage fabric, and would want to harness all that compute power. You might want to precompute various indices or data samples, so that when the query is issued, the work has partly been done in advance, and the remaining work can happen down in the layer that hosts the data.

But of course I'm a systems person, so my system-centric question, posed in this context, is this: what will you want as an O/S for that environment? And more to the point: how will that O/S differ from todays’s cloud systems?

If you are unsure how to think of such a question, consider this: if you wanted to implement a disaggregated database today, how would you even convince the storage layer to run these little add-on programs? In a purely mechanical sense: how does one run a program on a modern SSD storage unit? I bet you have absolutely no idea.

If you wanted to do everything by hand, logging into the storage servers one by one is an option. Today, it might be the only option, and the environment you would find inside them could be pretty arcane. Each vendor has its own weird variation on the concept.

But we could change this. We could create a new kind of operating system just for such tasks. The power of an operating system is that it can impose standardization and increase productivity through portability, easy access paths, good debugging support, proper integration of the needed tools, virtualization for sharing, etc. You would lack all of those features, today, if you set out to build this kind of database solution.

I mentioned at the start that data centers are optimized to minimize wasted resources. One would need to keep an eye on several metrics in solving this database challenge using a disaggregated model. For example, if DRAM on my machine gets loaned out to your machine, did the total performance of the data center as a whole improve, or did your performance improve at my expense? Did the overall power budget just change in some way that the data center owner might need to think about? How did this shift in resources impact the owner's ability to manage (to schedule) resources, so as to match profiles of intended level of resources for each class of job?

Then there are lower level questions of what the needed technical enablers would be. I happen to believe that Derecho could become part of the answer -- this is one of the motivating examples we thought about when we developed the system, and one of my main reasons for writing these thoughts down in this blog. But we definitely don't have the full story in hand yet, and honestly, it isn't obvious to me how to proceed! There are a tremendous number of choices, options, and because the vendors don't have any standards for disaggregated computing, a total lack of. existing structure! It’s going to be the Wild West, for a while... but of course, as a researcher, that isn't necessarily a bad news story...