At the time of this writing, ZooKeeper is surely the world's most widely used data replication tool. You run it on a few machines, normally 3, and it offers a file system API with very strong guarantees to the user. In fact, if configured to do so, ZooKeeper implements the Paxos specification: Leslie Lamport's formalism of the properties required for correct state machine replication. (My post-doc, Weijia Song, points out that actually, the Zookeeper Atomic Broadcast, ZAB, isn't necessarily a true Paxos protocol, but the issue he raises is very subtle, so I'll set that to the side. For our purposes here, ZAB is close enough).
In a recent blog posting, we discussed some of the missing aspects of that very specification. As a result, when I read the ZooKeeper documentation, I was intrigued to realize that the documentation more or less urges that the system be configured to violate Paxos! In fact the document is short, and easy to read, so have a look if you are skeptical.
You'll learn about all sorts of parameters that represent ZooKeeper's response to those missing specification elements, such as how to deal with disks that fill up completely, or avoiding inconsistency in the list of servers running the ZooKeeper service.
And then, in the middle of the same document, you run into a fascinating option: there is a small section called "Unsafe configuration options" that explains that "The following options can be useful, but be careful when you use them. The risk of each is explained along with the explanation of what the variable does." Then we read about an option called ForceSync: "If this option is set to no, ZooKeeper will not require updates to be synced to the media." There is no discussion of risks at all.
Some people know about this but think of it in terms of a broader approach to "using Zookeeper asynchronously". Used asynchronously, Zookeeper lets you start a series of operations but either ignore their termination, or at least not wait one by one. Of course flow control always kicks in eventually, to prevent congestion, but you end up with a stream of requests. In this mode it is nearly universal that you would also set ForceSync=no.
So how safe are such actions?
Elsewhere, on the ZooKeeper blog, Flavio Junquera writes that the system would perfectly well if this option is used, and that it can offer big speedups. He comments that for safety, there are several options: "You could consider using write barriers, or battery-backed raid SSD". The write barrier remark relates to a Linux system call, "fsync". A battery-backed raid SSD is a type of SSD storage with a DRAM cache that can hold pending writes in memory (in DRAM), but with battery backup so that if power fails, the pending writes will definitely complete. Then behind the DRAM are a set of SSD storage units arranged to handle transfers in parallel, so that the aggregate bandwidth might be enough to keep up with the DRAM transfer rates.
On StackOverflow and elsewhere, you can easily find threads encouraging you to configure ZooKeeper with ForceWrites=no, and assuring the reader that nobody has ever observed any bad consequences.
In effect, there is very little discussion of risks, except in the sense of "yes, you should definitely use this feature, but remember to also do these other things...."
So what's the issue, and why is it interesting?
At the core of any Paxos implementation is the transaction log where Paxos stores its state. In Derecho, this takes the form of replicated data residing in the replicated C++ objects defined by the developer. In classic Paxos, it was a list of log entries associated with the "acceptor role". Most people understand this to have been an append-only disk file, but my colleague and friend Robbert van Renesse, a Paxos expert, questions that assumption. He thinks that Leslie was deliberately vague about where the logs live, with the intent that it could equally well be used as an in-memory atomic multicast. Derecho does exactly that: it has one protocol, with two configuration options, and you get to pick. Durable storage on disk gives you a durable Paxos, and in-memory storage, a form of atomic multicast with total ordering and fault-tolerance.
The same is true in ZooKeeper, in which performance centers on the speed of the ZooKeeper transaction log. You need to tell it where you want the log to reside. Some popular options include placing it in RamDisk (in memory), or on a real disk, or perhaps an SSD. Above you saw recommendations that it be on a battery-backed raid SSD.
The problem is that if you just put the log on a normal disk or even a normal SSD disk, you get Paxos guarantees of durability... but you also see a heck of a big slowdown. Partly this is because DMA to an SSD is quite slow compared to copying in memory. But the bigger issue is that each time you do an SSD write, if you actually wait for the write to fully complete ("a forced sync"), you pay a full millisecond just waiting.
Even with concurrency this limits the typical SSD configuration of ZooKeeper to about 1000 write operations per second.
Early in the ZooKeeper story, the developers ran into this issue, and added a new option: ForceSync=no. With it, ZooKeeper "on its own" ceases to be a true Paxos log, because it will build a backlog of in-memory updates queued up to be written to disk, and won't actually carry out those updates instantly. But it gains hugely in performance: 50,000 writes per second become completely feasible. A 50x speedup... at what cost?
This is where those comments about battery-backed SSDs and write barriers enter the picture. And this is the puzzle: in fact, you can use ZooKeeper safely in this mode, at no cost and no risk at all. But it depends on your perception of cost, and of risk.
Lets start by setting ForceWrites=no but ignoring the helpful advice. ZooKeeper will be buggy. But, to be bit by this particular bug two things have to happen. First, you need to have a service that crashes and develops amnesia about a batch of committed transactions (updates) that were pending at the time of the crash. And second, someone or something needs to notice.
The point about "someone noticing" is the key to why so many applications get away with setting ForceSync=no, and yet pay no attention to Flavio's advice. Think about the sequence of events for an application using ZooKeeper. Some application is about not to complete something important, like launching the rocket ship. So it writes to the ZookKeeper log "... two, one, ignition!" and presses the launch button.
Exactly as this occurs, the power goes out, and on recovery, the system has no record that the button was about to get pushed. So we have an inconsistency that Paxos normally doesn't permit: Lamport requires that Paxos must never forget a committed transaction, meaning that once the application is told the commit has occurred, Paxos has an obligation to not lose it.
But this is not a likely failure sequence! The amnesia part, sure, that really is likely. A bit like with a normal Linux file system: if a program crashes before calling fsync, the last bytes it wrote could easily be lost (maybe even the last few thousand). We know that, and learn to call fsync. But someone actually caring, about that specific operation, yet neglecting to manually call fsync? Seems very unlikely...
So here we have ZooKeeper acting... like Linux file systems normally act! In fact, you can manually call fsync anytime you like in ZooKeeper, so if you do need it, there it is. That's the write-barrier approach.
The battery-backed raid SSD option is less common.
So who is wrong: Leslie, for including this rule in the specification? The good user, who learns to call fsync when necessary? Or the bad user, for recklessly breaking the properties of Paxos, all for a lousy 50x or 100x speedup?
As a builder, I have to feel sympathy for any developer who wants the speed. And I honestly question that Paxos specification. Maybe the requirement really is too strong! Couldn't it be reexpressed in terms of fsync: "no committed request will ever be lost if fsync was invoked, and completed after the commit?"
In fact the interesting issue here is that when ForceSync=no, ZooKeeper simply imposes an extra obligation on the behavior of the developer (use fsync, or confirm that you have the right kind of specialized SSD). As we discussed in that prior blog entry, Paxos already imposes obligations on its users, and doesn't express those either. Why is this different?
Yet I also understand Leslie. I've asked him about this, and he thinks developers will just get it wrong. They want the speed, because otherwise they look bad, so they flip this switch, but do something for extra speed in a situation where it really isn't appropriate.
Here in a college town, with students who definitely drive unsafely and fast, I get it.
How many of the developers who push ZooKeeper's insane speed button actually know what they are doing, and think about when to manually call fsync? Yet on the other hand, how many of their applications would break if the ZooKeeper storage subsystem were to slow down by 50x or 100x?
So you tell me: should systems follow the ZooKeeper lead?
Seriously: what do you think? Should Derecho support ForceSync=no?