A Few Thoughts on Distributed Computing: September 2017

Monday, 18 September 2017

What we can learn about specifications from ZooKeeper's asynchronous mode, and its unsafe ForceSync=no option?

At the time of this writing, ZooKeeper is surely the world's most widely used data replication tool. You run it on a few machines, normally 3, and it offers a file system API with very strong guarantees to the user. In fact, if configured to do so, ZooKeeper implements the Paxos specification: Leslie Lamport's formalism of the properties required for correct state machine replication. (My post-doc, Weijia Song, points out that actually, the Zookeeper Atomic Broadcast, ZAB, isn't necessarily a true Paxos protocol, but the issue he raises is very subtle, so I'll set that to the side. For our purposes here, ZAB is close enough).

In a recent blog posting, we discussed some of the missing aspects of that very specification. As a result, when I read the ZooKeeper documentation, I was intrigued to realize that the documentation more or less urges that the system be configured to violate Paxos! In fact the document is short, and easy to read, so have a look if you are skeptical.

You'll learn about all sorts of parameters that represent ZooKeeper's response to those missing specification elements, such as how to deal with disks that fill up completely, or avoiding inconsistency in the list of servers running the ZooKeeper service.

And then, in the middle of the same document, you run into a fascinating option: there is a small section called "Unsafe configuration options" that explains that "The following options can be useful, but be careful when you use them. The risk of each is explained along with the explanation of what the variable does."  Then we read about an option called ForceSync: "If this option is set to no, ZooKeeper will not require updates to be synced to the media." There is no discussion of risks at all.

Some people know about this but think of it in terms of a broader approach to "using Zookeeper asynchronously". Used asynchronously, Zookeeper lets you start a series of operations but either ignore their termination, or at least not wait one by one. Of course flow control always kicks in eventually, to prevent congestion, but you end up with a stream of requests. In this mode it is nearly universal that you would also set ForceSync=no.

So how safe are such actions?

Elsewhere, on the ZooKeeper blog, Flavio Junquera writes that the system would perfectly well if this option is used, and that it can offer big speedups. He comments that for safety, there are several options: "You could consider using write barriers, or battery-backed raid SSD". The write barrier remark relates to a Linux system call, "fsync". A battery-backed raid SSD is a type of SSD storage with a DRAM cache that can hold pending writes in memory (in DRAM), but with battery backup so that if power fails, the pending writes will definitely complete. Then behind the DRAM are a set of SSD storage units arranged to handle transfers in parallel, so that the aggregate bandwidth might be enough to keep up with the DRAM transfer rates.

On StackOverflow and elsewhere, you can easily find threads encouraging you to configure ZooKeeper with ForceWrites=no, and assuring the reader that nobody has ever observed any bad consequences.

In effect, there is very little discussion of risks, except in the sense of "yes, you should definitely use this feature, but remember to also do these other things...."

So what's the issue, and why is it interesting?

At the core of any Paxos implementation is the transaction log where Paxos stores its state. In Derecho, this takes the form of replicated data residing in the replicated C++ objects defined by the developer. In classic Paxos, it was a list of log entries associated with the "acceptor role". Most people understand this to have been an append-only disk file, but my colleague and friend Robbert van Renesse, a Paxos expert, questions that assumption. He thinks that Leslie was deliberately vague about where the logs live, with the intent that it could equally well be used as an in-memory atomic multicast. Derecho does exactly that: it has one protocol, with two configuration options, and you get to pick. Durable storage on disk gives you a durable Paxos, and in-memory storage, a form of atomic multicast with total ordering and fault-tolerance.

The same is true in ZooKeeper, in which performance centers on the speed of the ZooKeeper transaction log. You need to tell it where you want the log to reside. Some popular options include placing it in RamDisk (in memory), or on a real disk, or perhaps an SSD. Above you saw recommendations that it be on a battery-backed raid SSD.

The problem is that if you just put the log on a normal disk or even a normal SSD disk, you get Paxos guarantees of durability... but you also see a heck of a big slowdown. Partly this is because DMA to an SSD is quite slow compared to copying in memory. But the bigger issue is that each time you do an SSD write, if you actually wait for the write to fully complete ("a forced sync"), you pay a full millisecond just waiting.

Even with concurrency this limits the typical SSD configuration of ZooKeeper to about 1000 write operations per second.

Early in the ZooKeeper story, the developers ran into this issue, and added a new option: ForceSync=no. With it, ZooKeeper "on its own" ceases to be a true Paxos log, because it will build a backlog of in-memory updates queued up to be written to disk, and won't actually carry out those updates instantly. But it gains hugely in performance: 50,000 writes per second become completely feasible. A 50x speedup... at what cost?

This is where those comments about battery-backed SSDs and write barriers enter the picture. And this is the puzzle: in fact, you can use ZooKeeper safely in this mode, at no cost and no risk at all. But it depends on your perception of cost, and of risk.

Lets start by setting ForceWrites=no but ignoring the helpful advice. ZooKeeper will be buggy. But, to be bit by this particular bug two things have to happen. First, you need to have a service that crashes and develops amnesia about a batch of committed transactions (updates) that were pending at the time of the crash. And second, someone or something needs to notice.

The point about "someone noticing" is the key to why so many applications get away with setting ForceSync=no, and yet pay no attention to Flavio's advice. Think about the sequence of events for an application using ZooKeeper. Some application is about not to complete something important, like launching the rocket ship. So it writes to the ZookKeeper log "... two, one, ignition!" and presses the launch button.

Exactly as this occurs, the power goes out, and on recovery, the system has no record that the button was about to get pushed. So we have an inconsistency that Paxos normally doesn't permit: Lamport requires that Paxos must never forget a committed transaction, meaning that once the application is told the commit has occurred, Paxos has an obligation to not lose it.

But this is not a likely failure sequence! The amnesia part, sure, that really is likely. A bit like with a normal Linux file system: if a program crashes before calling fsync, the last bytes it wrote could easily be lost (maybe even the last few thousand). We know that, and learn to call fsync. But someone actually caring, about that specific operation, yet neglecting to manually call fsync? Seems very unlikely...

So here we have ZooKeeper acting... like Linux file systems normally act! In fact, you can manually call fsync anytime you like in ZooKeeper, so if you do need it, there it is. That's the write-barrier approach.

The battery-backed raid SSD option is less common.

So who is wrong: Leslie, for including this rule in the specification?   The good user, who learns to call fsync when necessary? Or the bad user, for recklessly breaking the properties of Paxos, all for a lousy 50x or 100x speedup?

As a builder, I have to feel sympathy for any developer who wants the speed. And I honestly question that Paxos specification. Maybe the requirement really is too strong! Couldn't it be reexpressed in terms of fsync: "no committed request will ever be lost if fsync was invoked, and completed after the commit?"

In fact the interesting issue here is that when ForceSync=no, ZooKeeper simply imposes an extra obligation on the behavior of the developer (use fsync, or confirm that you have the right kind of specialized SSD). As we discussed in that prior blog entry, Paxos already imposes obligations on its users, and doesn't express those either. Why is this different?

Yet I also understand Leslie. I've asked him about this, and he thinks developers will just get it wrong. They want the speed, because otherwise they look bad, so they flip this switch, but do something for extra speed in a situation where it really isn't appropriate.

Here in a college town, with students who definitely drive unsafely and fast, I get it.

How many of the developers who push ZooKeeper's insane speed button actually know what they are doing, and think about when to manually call fsync?  Yet on the other hand, how many of their applications would break if the ZooKeeper storage subsystem were to slow down by 50x or 100x?

So you tell me: should systems follow the ZooKeeper lead?

Seriously: what do you think? Should Derecho support ForceSync=no?

Thursday, 7 September 2017

Inadequacy of the Paxos specification, and what we can learn from the issue

In a blog ten days ago, I discussed the issue of specifications that omit coverage for cases that actually arise in real systems. Since then two colleagues who follow the blog asked for examples to illustrate the issues, so I thought I might say a few more words on this, focusing on the classic specification of Paxos: Leslie Lamport's solution to the State Machine Replication problem (also sometimes called Consensus, or Uniform Agreement).

The traditional specification of Paxos has the following elements:

A specified set of participants.
An assignment of roles {leader, acceptor, learner} to the participants. Each can have more than one role, and we often think of external clients as adding two additional roles: {command-initiator, command-consumer}.

A leader runs a protocol for putting new commands into the Paxos log.
An acceptor holds a Paxos log (an ordered list of slots, each of which can be empty, or can hold a command), and information about which commands are known to have committed. Any given acceptor might have gaps: slots for which it lacks the committed command (on top of this, there is also a somewhat subtle failure case in which a slot will permanently be left empty, so that every acceptor would have a gap at that spot).
A learner runs a protocol for computing the full list of committed commands and their ordering. Any single acceptor might have gaps in its log, so the learner does this by merging logs in order to fill those gaps.
Paxos really says nothing at all about command initiators and consumers, except that the initiator waits for a response to the request that the command be posted (in doing so, Paxos eliminates what we sometimes refer to as a causal order obligation, in which a system is expected to track and respect ordering on pending asynchronous actions).

A rule for what it means for a command to be valid (non-triviality).
Agreement on ordering.
Durability: in any state where a Paxos service responds to "learn" requests, it needs to return the entire list of previously ordered commands.

Many authors just focus on the three bolded properties. Yet notice that from this set of five elements, we can easily discern a whole series of additional, unspecified aspects:

There is a Paxos reconfiguration protocol, but it doesn't introduce additional specification elements, except to change the set of participants. Yet there are several aspects one would normally wish to see addressed:

Malkhi has convincingly argued that one should terminate a membership epoch before starting actions in the next epoch. This avoids having a command that was initiated in epoch k commit much later, perhaps in epoch k+1. While Lamport has often said that this isn't necessarily a bad thing, Malkhi's point is that a delayed commit can be confusing, because many systems operate in a configuration-sensitive way.
A new member of a Paxos acceptor group should probably not have an initially empty log. A means of performing state transfer to it is thus required. Simply copying some existing log to the joiner is incorrect, because by doing so, a command that previously lacked a quorum and hence was uncommitted (Paxos has a scenario in which it leaves a slot "empty") can become committed because duplicating a log effectively duplicates a vote for that command.
When Paxos restarts after all its members crash, the protocol doesn't specify the rule for figuring out the proper initial configuration for the restarted service (this matters with dynamic membership).

The specification says nothing about flow-control, yet if a Paxos protocol has no flow control mechanism at all, a quorum of acceptors could advance unboundedly far into the future relative to a stalled or failed acceptor. This might mean that the faulty acceptor has no feasible way to catch up later, and in effect, would decrease the resilience of the service: rather than having N members, we would have to think of it as having just N-1.
The specification says nothing about sizes of objects (size of commands), yet acceptors will presumably have bounded amounts of memory, and might not be able to accept arbitrarily large objects. Solving this isn't necessarily hard (one could have commands that point to external objects), but then the objects would be less replicated than the commands, and one has to ask whether that somehow violates the durability property.
The specification says nothing about fairness or other quality of service properties, such as timeliness of response. Yet real systems need fairness when many clients share a Paxos service and all the clients want a fair-share of access. In fact, one can then ask what specific notion of fairness is desired: should it be round-robin (like in Derecho)? Or should some clients be allowed to send 2x more commands than others, because they happen to be busier? Should a slower client be "delayed" by activity of a faster client?
I mentioned that Paxos seemingly excludes situations where clients might issue a series of requests, without waiting for replies. In practice, this is common (clients often "stream" requests). We would want the client ordering to be respected by the system: if a client sends request A, then B, Paxos should preserve the A happened before B relationship.
At Google, the Paxos owner (in fact the leader of the team responsible for the Chubby service) pointed out to me that his big issue is concerned with wide area deployments of Paxos, which introduces a whole set of issues Lamport never considered:

Proper specification of a hierarchical Paxos. Guerraoui and Pedone and Quema all have looked at this question, primarily in connection with Ring-Paxos protocols.
Heavy tailed behaviors. Chubby struggles to deal with very delayed data that can be caused by overloaded or damaged WAN links.
One way to get around "late" data is to design systems that assume they have correct and current data, for example using locks or leases. Chubby does this, but it turns out that when one does so, getting WAN service instances to release their read locks so that the data can be updated, or preventing them from renewing those leases, can be very slow. This might violate a specification that requires fast normal read behavior, but also requires a fast way to be able to update "rarely changed" configuration parameters or program versions or other forms of WAN data that does change now and then.

Beyond these points, one encounters a further concern. The Paxos specification is ideally suited to verifying the correctness of Paxos, with respect to its own "promises". But the specification doesn't tell us anything at all about correct use of Paxos, or required behavior of the application using the service.

For example, suppose that Paxos isn't the real repository for data but is playing an intermediary role: the program using the Paxos service might itself be a replicated database, or some other form of replicated service that wants updates delivered in a deterministic order.

Do we expect that such a service would always be able to use the identical order to the Paxos log? What does this imply about the specification the client service must respect?
On restart from crashes, we now have two forms of state: state in the durable Paxos logs, and state in the database service replicas. How should these be reconciled?
We won't want the Paxos state to grow without bounds, so we will need to truncate the Paxos logs. There is a truncate protocol in Paxos, but what obligations fall upon the client service that wishes to make use of that truncate command? How does truncation interplay with failure cases that might arise?

Believe or not, I could actually go on and list even more issues! My point, though, is that when we use Lamport's core specification, we are really working from an inadequate specification that omits major, very important aspects of the real service we intend to build and use.

But notice how hard it would be to check a Paxos specification for adequacy. We would need a fairly elaborate (and adequate) specification of the environment, and of our larger goals. Otherwise, questions such as flow-control or other aspects of bounding resource consumption, or of client state, could never even be posed. So there is sense of chicken and egg: to understand if a Paxos specification is adequate, we really need an adequate specification of the setting where Paxos will be used, so that we can study the questions it poses about the Paxos service per-se.

I'll stop on that point, but there is more one can say about adequacy of specifications. A good topic for some other posting down the road...