Tuesday, 6 December 2016

Heavy tails and skewed WAN latencies

Researchers in systems often focus on raw performance, and it isn’t uncommon to test under clean-room conditions: idle machines, dedicated networks, applications that work with just the right memory regions to get the maximum possible performance from their memory accesses, workloads that just happen to be optimal for the new algorithms.  But the bigger challenge is often to have predictable and good performance in the real world, which is never quite so clean or simple.
While wandering the world on sabbatical, I’ve taken some time to track down development teams that own products at a few major companies, and to ask about performance challenges at large scale. I was very struck by the consistent mention of heavy tailed loads, skewed performance, and similar issues. 

For example, when I visited Google I had a chance to talk to the group that owns Chubby, Google's locking service (and later, at Cambridge University, to Flaviu Junqueria, who did Zookeeper with Ben Reed).  I thought both of them would be ready to drop everything and switch instantly to Derecho (Cornell's new platform).  But in fact they both said that while Derecho's raw speed is amazing, we'll also need to deal with heavy tail latencies and be able to demonstrate stability in WAN deployments.  It turns out that for technologies like these, you can't really compete against them with a point solution: the owners want global-scale stories that work well under stress, deployed to a dozen data centers worldwide, and with some components malfunctioning.

This is interesting because while we all know that most modern computing occurs in multi-tenant datacenters, where machines and other resources need to be shared among potentially large numbers of concurrent users, we don’t always think through the implications:  when you launch ten copies of a program to perform a task in parallel, they run all at once in a clean-room setting.  In a multi-tenant environment, a few of those programs might start late, or run slowly, and hence could finish the job far later than the others. If those stragglers slow the whole system down, it won't matter that without slow nodes it would have blazed.  All anyone will see is that your amazing technology runs slowly.
What causes this kind of skew?  Here are some factors people have mentioned recently:
  • A single core can have variable speed depending on cache miss rates, quality of branch prediction, and whether the memory the program is accessing is from a block that resides close to the core: on a modern multi-core machine, there are strong NUMA effects that can be quite visible if your thread runs on a core that isn’t the same one where the memory was allocated.
  • Lock contention can also be a major factor, as can false sharing.    The issue is similar: accessing a lock generally means accessing a variable shared among multiple threads, and perhaps among multiple cores.  When your thread tries to obtain a lock, the logic of the locking algorithm will need to issue a compare-and-swap or an atomic-increment or a similar instruction.  So, if the memory is remote from the core you are running on, that memory cell (that “cache line”) has to be fetched and this can stall the machine and in fact can even stall other threads on other cores. False sharing occurs if threads access different variables, but they happen to have been allocated into the same memory page or cache line, causing it to ping-pong back and forth between the cores that run those threads.
In fact those first two points have nothing to do with multi-tenancy: they are just the reality of concurrency on NUMA multicore computers.  I actually teach graduate courses at Cornell where in the first week, we get the students to build a single-threaded solution to a problem like Conway’s famous “Game of Life” (a cellular automata visualization), and then to try and understand how changing it to be multi-threaded impacts performance.  The quick answer is that adding threads can kill performance, and moving the multi-threaded version to a true multi-core setting is the worst of all.  In our classes, most people conclude that the best use of a multi-core server is to run VMs, with lots of programs but allocating them one core each!  Certainly, getting a speedup on a multi-core NUMA platform isn’t a simple thing.  You need to know your stuff, and it can take a lot of effort.
Beyond this we get true issues of multi-tenancy and sharing:
  • Paging delays or scheduling delays on a shared processor.  With cloud computing and virtualization, we often see many VMs mapped to one server, and we may even see enough memory in use to force some pages out.  Schedulers aren’t always as smart as we might wish, and can easily wander into modes of scheduling that put some applications at a huge disadvantage.  For example, some scheduling algorithms that have known bias towards applications that run in very short bursts before issuing an I/O system call or some other blocking action, and others that are biased to favor long-running jobs. 
  • In a modern cloud, the entire rack of computers you are mapped to might be running slowly.  First, it is rare for the network layer at the top of the data center to have anything close to the full NxN bandwidth (we call this full bisection bandwidth) that could theoretically be needed if programs interact at random.  Next, some communication patterns generate huge amounts of data, and this can overload the TOR layer in ways that put small applications at a disadvantage.  Then beyond all of those considerations, your application could end up scattered widely, with high latency between its components.  And beyond all of that, cloud systems are big enough so that something is invariably in the process of crashing.  Add it all up, and you have a lot of issues.
  • When an application runs on multiple datacenters, we add the extra complication of slow WAN links, which will commonly be heavily loaded with high delays, even if they offer great throughput.  WAN links are also prone to slowdowns for reasons not often seen inside a datacenter, such as a burst of solar energy that causes high noise rates on a satellite link or even on a ground-based one.  We never see issues like this inside the datacenter, but WAN systems experience them all the time (to say nothing of water seeping into a connection or a router, goats chewing on wires, backhoes).

To illustrate my point, here’s a true story.  I had a chance to sit down for a few hours with the current owner of the Google chubby locking service.  My plan was to awe him with the astonishing speed of Derecho, Cornell’s new library, and maybe convince him to just switch all of Chubby to Derecho.  In fact he agreed that Derecho is amazing and way faster than he thought was possible for an atomic multicast system or a version of Paxos. 
Just the same, he said that he wouldn’t switch to Derecho because in his environment, it might not work nearly as well.  Which led to yet another discussion of handle heavy-tailed latencies, particularly in WAN settings.
It turns out that there is a reason that the Chubby group would think hard about erratic WAN latency.  Google makes heavy use of Chubby, and often deploys it worldwide: some kind of stateful file or lock is shared and every datacenter in the global network has a copy for mostly read-only use.  They have a concept of a coherent cached copy, and to update it, the update site has to invalidate all of those cached replicas before it can change the value and release the new version.
But when an update is issued, the problem arises of how to invalidate all those remotely cached copies.  For Chubby, this is the big issue: when the system just sends out invalidate messages, 90% get through quickly, but 10% might be delayed for seconds, minutes… even days.  Meanwhile, since the lock withdrawal hasn’t finished, the update has to wait.  Thus a write could literally take minutes or longer. 
There are other options that center on encouraging applications to adopt weaker forms of consistency, but when the need is for strong guarantees, heavy tails dominate (meaning distributions of delay in which a lot of the requests finish quickly, but enough finish very slowly to add up to a significant percentage even so: the “heavy” aspect of the “tail of the distribution”).  Skew means nearly the same thing: it estimates the delay from when the first application process learns that the cached copy is being invalidated to when the last one learns this.
Leases can help, but then the skew turns out to cause a different kind of problem.  If cached copies have bounded validity time, then they have to be refetched periodically, so our cache invalidation will have bounded delay, but now local read access could become slow (since leases are on a per-object basis, every cache in the world will need to periodically refresh every leased object).  So load might be absurdly high.  Worse, if a particular refresh happens to experience high latency, the associated datacenter may get stuck: At Google, some of these are global variables that really, really matter for important applications.  So with leases we just push the problem from one place to another!
What works best?  Some ideas:
  • Use a model in which updates occur at preplanned times, not frequent.  To issue an update you queue up your request and it waits until the update mode starts up.  Then it runs rapidly to completion.  Meanwhile, the caches learned yesterday about a planned update for today, so they automatically invalidate their versions of those objects, or perhaps shift into a mode offering a subset of the functionality (they do have a copy that was valid until 10am, if you are ok with weaker consistency).  After the updates finish, they are aggressively pushed to the caches that held copies, which don’t have to ask for data in advance.
  • Fragment your data, so that each item has a primary update location, and move that primary location as needed to track the sun, or the moon, or the tides.  Updates will be fast at the primary site.  Then stream the new values to remote sites.  Remote sites offer weak consistency mode (using their local cached copies) or strong consistency (in which all work vectors via the current primary site for that data).
  • Use a globally ordered multicast to send your update to the entire world-wide set of replicas, perhaps well in advance of the time when the update should be done.  Then all can do the operation, in unison, at the proper time.   We plan to experiment with Derecho in a WAN to see how well we can make this work.
  • Google Spanner adopts this approach, but uses a method that employs super-synchronized clocks and carefully measured worst-case latency bounds.  Spanner is a database system in which transactions are broadcast globally using a very simple but very reliable method.  On arrival, transactions queue up until enough time has elapsed to ensure that the system has every potentially conflicting transaction for a given temporal window.  Then they sort the operations into a standard order and execute them.  This method won’t work if a transaction shows up late, but only at some sites:  it can’t tolerate heavy tailed latencies or high skew.  The paper talks about various techniques used by Spanner to make sure that if a problem could cause this sort of delayed delivery, the network management layer would sense it as a form of global failure. 
What about using quorums?  I'll end by pointing out the puzzle here.  Some of my friends in industry are enamored of quorum protocols such as Paxos: the idea is that because the protocol only waits for responses from a majority of the replicas of a replicated object, progress occurs even if a minority of replicas are running slowly. 
This would be true were it not for backlogs.  In practice, any real Paxos protocol runs over TCP.  And TCP is a FIFO, lossless protocol.  Thus if some Paxos leader sends out requests A, B, C... and is able to really quickly zip up to request Z because acceptors P and Q were quick to respond, if we think about acceptor R, a backlog of unprocessed work will be developing in the channel from the leader to R!  Committing A and B didn't somehow magically remove them from the TCP channel. 
Parissa Jallali studied this with Fernando Pedone at Lugano and was able to show that in fact, Paxos becomes very bursty in situations where this occurs.  The reason is that with multi-tenancy, it won't be long before P stutters for some reason, and now the leader needs responses from Q and R to make progress.  But R has to catch up with all these old requests (recall that TCP is FIFO, so R is not able to see new requests until it finishes dealing with old ones, even though the old protocols long ago committed).  In fact the leader will spend a while discarding old replies from R, to requests that are no longer in the first phase of Paxos.  Parissa showed that each time this happens, the Paxos service pauses.  Log merge, needed when reading (learning) the Paxos log state, also becomes slow in precisely the same way: backlogs make it bursty.
The alternative, used in Derecho, is to be aggressive about declaring faults and to reconfigure our entire group when a process lags: Derecho would drop R, then later R could rejoin and Derecho could drop P.  But this involves reconfiguring the group to update membership, and doing a state transfer.  Which is better: bursty behavior with stable membership but quorums updates, or bursty behavior with frequent membership updates?
I don't have a pat answer here.  My systems obviously use membership reconfiguration: I like to quote Fellini, who titled one of his movies "E il Nave Va": the ship sailed on.  The point being that in a big system, you want to make progress, not get stuck waiting for a slow node.  But how best to carry out that advice in practical protocols... not obvious!

1 comment:

  1. "... Run single core vms." Counterintuitive but true! At least sometimes.