A Few Thoughts on Distributed Computing: June 2020

Distributed computing, like many mature areas of computer science, has its share of impossibility results. I was lucky to be in the field before many of them were discovered, and because I'm basically a software builder, my research group built all sorts of distributed systems and tools. Later, though, the theory community showed that some of those things were impossible, which leaves a puzzle: How can anything solve an impossible problem?

Today, I thought it might be fun to explore an example. I want to go deeply enough into the question to shed light on the limitations of what is feasible in computer networks and cloud datacenters. At the same time, we'll also discover limitations on what can be modelled and proved, and even some limitations on the subtle ways that choices of wording (in English) can shape perceptions about what can and cannot be done.

When we talk about consistency in a distributed system, people generally map the work to their favorite model. For fans of database models, this might be transactional serializability and the ACID properties: Atomicity, Correctness, Isolation and Durability. Distributed systems researchers would point to State Machine Replication or Linearizability. The systems I created also worry about dynamic changes to the pool of servers, and have a consistency mechanism called virtual synchrony: membership changes in atomic steps. During a period when membership is stable (an epoch), we use state machine replication within the members. Protocols like Paxos solve this part of the problem.

If you take a big step back and abstract, all of these forms of consistency boil down to forms of fault-tolerant consensus: we have some set of processes, and they propose various kinds of actions, and vote on what action to do next. Consistency is the property that that they all decide the same thing. Thus, anything we can say about fault-tolerant consensus sheds light on fundamental limitations that apply to databases, distributed systems, and all sorts of other settings.

Any mathematically-based discussion starts with a model. For example, what does it mean to say that a failure has occurred? The most obvious choice is to say that well, some component halts -- it crashes suddenly and without doing anything horrible right as it collapses. But it is easy to come up with other failure models.

For example, consider timing. If a processor has a local clock that fails (drifts from the truth, or perhaps starts to show nonsense values), sometimes the remainder of the system stays healthy, even including the process on the machine with the faulty clock. Yet that could cause the process running on that machine to "lie" about what time some event occurred, or to miss a deadline. We tend to wrap all of these up, and call them timing faults.

Communication faults are tricky to model too. You may not be aware of this, but on Linux a process could send a message, but then the operating system or network could drop it. Should we blame the process itself? The network? And should we say that the message wasn't sent, or wasn't received? Worse, a network could split into parts that can't talk to each other: network "partitioning" failures.

Then we get the "hit by a cosmic ray" issues. Things like that (or more mundane problems like a power fluctuation) can cause the computer memory to flip bits. As a result, a process could experience some form of data corruption. And this doesn't even consider the case where the process is hijacked by a virus. We tend to lump all such issues into what we call "malicious" failure models, but even within the malicious models, there is a range that includes whether or not one allows collusion, as opposed to a strictly isolated form of nefarious misbehavior: A virus that can infect one process might be able to infect every process running the same code, and then mount a coordinated attack on something else. In contrast, that bit flip is just a case of flakey hardware and would only impact a single process at a time.

There is a lot of work that explores this range of behaviors. In fact one model, called the BAR model of failures, starts with these cases and then goes further by introducing incentives: are the participants out to cause chaos (a so-called byzantine case)? Or are they altruistic? Purely rational? Crash failures are then layered in, giving a cross-product that you can study to yield an abundance of impossibility results for tasks like reaching agreement or electing a leader.

For our purposes today, the result I want to discuss is one from a paper by Fischer, Lynch and Paterson called Impossibility of Consensus With One Faulty Process. We often refer to this just as the FLP impossibility result, and it is arguably the most famous of all such results. As the title suggests, the paper seemingly shows that agreement (consensus) is actually not possible if a system is at risk of crash failures. On the face of it, the FLP assumptions about the network and the possible failures are very mild -- and the result seemingly implies that databases, Cornell's amazing new Derecho system, dozens of other Paxos-based systems, Blockchain solutions for Bitcoin (all of which can solve consensus) are "doing something impossible." Indeed, FLP seems to imply that the very idea of consistency as a goal is hopeless: if we accept the title, don't waste your time building a database! But as we will see, a lot centers on the details of what the words in that title actually mean. The result is a real, valid, one. It does apply to all of those kinds of systems. But it doesn't mean we should pack up and head home.

The paper came out in 1985, which was an important year for me: it just happened to be the year when I wrote my first paper about a distributed system my students and I had created, one that implemented atomic multicast and Paxos and could be used to solve all sorts of practical agreement problems. We called it Isis (as it turned out, an unfortunate choice). Isis was released to the public, open source, and by 1987 or so it had a surprisingly wide uptake. The system was ultimately used in the New York Stock Exchange to run the trading floor, was adopted in the French Air Traffic Control System where still is used for many purposes, and even into the Oracle database product, which launches Isis every time the Oracle cluster management system boots -- and this is just a few examples.

As you can guess, right from the start I was getting questions about FLP, including in panel sessions at major conferences where this oddity was debated. My mother (an academic) has a saying that academic debates become heated precisely because the issues really don't matter. The big conference in systems, SOSP, was infamous in that time period for fireworks, very much validating my mother's point. In retrospect, I would have to say that relatively few SOSP attendees cared in a deep sense about FLP. But they were curious to understand the result: The paper is short and easy to read, but surprisingly hard to map to any simple intuition about what it "means". And let's face it: they also enjoyed lively panel debates.

Isis users didn't fuss too much about FLP: if they knew about it at all, they perceived it as an oddity. For a decade, all those traders on the floor of the NYSE happily traded stocks. Eventually the parts of the system that used Isis were phased out during a routine upgrade, but not because it had caused any issues -- in fact there wasn't a single disruption to trading during that entire ten-year period (crashes did occur, but Isis orchestrated self-healing recovery in every case).

By now, the French ATC system has been safely guiding airplanes since 1995 and will probably not be upgraded before 2025: a 30 year run! The designers of those platforms, and others, liked Isis both as a practical tool and because of its consistency-preserving consensus mechanisms. Isis created a world of state machine replication elements, which enabled self-managed, self-healing applications. Moreover, just as Isis itself could be described easily using the mathematics of state machine replication, those applications could also be proved correct and safe, even when components failed.

For example, one obviously wants to be sure that an Air Traffic Control System guarantees that each plane and each airport runway will have a single designated controller who is responsible for managing it. Isis allowed the French system to obtain this property. Each flight should have a single active flight plan; any change to the flight plan subsumes the prior one. Isis was central to their correctness proof for this property, too (one can think of an ATC system as centering on a log of flight plan versions, in which each change is a durable append to the log, and the current flight plan is always the version closest to the end of the log).

ATC systems never generate huge loads, and for this reason it was also possible to quantify the system's performance profile, and even to show that performance was stable across a wide range of stress tests and failure sequences (one does this by designing tests that carefully measure delays and bandwidth even as failures are injected to mimic scenarios believed possible during actual operation). This enabled the developers to convince skeptics that if the workstation used by some controller were to fail, someone else would take over the role within 30 seconds. During certification, the French used red-teams that were free to pour over the protocols, the code, and the way the system used it. Then they would often demand proofs for challenging scenarios they would construct. Sometimes the best response included a mathematical argument, but more often, these red-teams wanted to see experimental responses: an experiment that would mimic the case they worried about, and ride it out successfully. Over time, the system evolved to have an enormous range of unit tests and integration tests. Isis passed every test... and yet, FLP sits there.

Curiously, the one time I can remember this coming up, the red-team dismissed the FLP work, feeling that it made assumptions that didn't apply in an ATC setting (as we will see, FLP is posed in a very abstracted way, and assumes a very general model of the network). Yet I still felt that I needed a really cogent answer. Suppose that someone really were to challenge Isis in this way. What would be the best possible way to respond and explain how Isis relates to the FLP result?

In fact this was an actual worry for me. Those customers trusted me, and under my guidance, we were using the Isis system for applications where it genuinely matters that the solution have safe, consistent, self-healing behavior, and within seconds too! We had correctness proofs for the system (really, proofs of safety, not liveness, but the performance work had the flavor of a liveness claim). FLP proclaimed this combination of guarantees to be impossible... how can that be? Indeed, there were follow-on papers that appeared soon after FLP, pointing out that Isis and its version of Paxos were clearly subject to the FLP impossibility result. I had no choice but to figure out the proper rebuttal.

A great deal of the explanation centers on the FLP paper's approach to modelling the problem. Earlier, I said a few words about their system model, but didn't mention the network, and I didn't explain how they define asynchronous execution. So let's tackle those aspects first. The FLP paper assumes a network that mimics several aspects of TCP. Processes are permitted to send each other messages, and FLP assumes that these are sent over reliable network channels that never drop, corrupt or duplicate packets. I mentioned earlier that Linux, and the network itself, can both drop messages. But in fact if you use TCP, the TCP protocol itself compensates. On the face of it, FLP seems to assume exactly what TCP guarantees in a standard networked system, like a cloud data center. TCP, as you probably know, obtains this behavior by sequencing data, and then using explicit acknowledgments or complaints ("negative" acknowledgements) to trigger retransmissions and rate control. Duplicates can be filtered out because TCP has already seen those bytes.

On closer study, the FLP network model departs from TCP by allowing out-of-order delivery. Moreover, this matters: their proof requires this property, because it involves delaying some messages and allowing others to skip past the delayed ones (we'll see how they use this feature in a moment). For a while, it seemed plausible that this could be the key. Perhaps state machine replication seems to be very possible because we run on TCP (mostly), and it delivers messages in order. However in the end, it turned out that this particular aspect of the FLP model was unimportant. FLP applies even to a protocol built over TCP or RDMA.

Another puzzle relates to the way the FLP model defines asynchronous behavior. In the FLP description of processes, there are no clocks and indeed nothing even faintly resembling a clock: a set of processes could run the protocol for a billion years before achieving agreement, and this would be "just fine." Obviously, an air traffic control system wouldn't be happy with billion year delays, so real systems like Isis and Derecho have timers built in: if some process isn't responsive, the others vote the silent one out. To avoid a partitioning event (where our ATC system might split in half, with two subsets that each believe the other to have crashed, implying that two controllers would think themselves responsible for the same plane), we just require that in any such vote, a majority of the system has to "survive" long enough to vote in favor of the new membership. The majority rule eliminates the risk of split-brain behaviors, which could threaten safety.

These points, it turns out, are a bit more complicated than the one involving TCP. The first thing to appreciate is that rapid responsiveness is very important in an ATC system. When a pilot is approaching a landing strip, the ATC system is under a timed obligation to tell the pilot what to do: should plane A land first, or will plane B land first? Can plane A to climb 2500 feet to avoid that thunderhead?

Failures can threaten those quick responses, and this means that an ATC system may be in a hurry to kick out an unresponsive component. Yet if a pilot was cleared to land, but then the original controller's computer freezes and a new controller takes over, he or she should know about that prior clearance. This turns out to be the same requirement as the FLP rule that if any process decides v, then every process decides v. This tells us that even though ATC systems are biased towards availability, the most fundamental aspect of the FLP way of defining consensus still applies.

What about clocks? Real-world control systems like ATC platforms make heavy use of time, and take precautions to protect against errors that could be introduced by machines with faulty clocks. FLP had no clocks... does this somehow take us out of the FLP scope? One can imagine that it might: clocks enable timed actions, and we use timeout for fault detection: an ATC platform will give up on any tardy process and treat it as faulty, simply because we don't want the entire system to end up waiting for some very overloaded machine to catch up. Doesn't FLP itself need a way to model the detection of failures? And because timeouts are a form of non-determinism, how can we reconcile this use of time with the deterministic state machine model for the protocol itself?

As it turns out, FLP does have a way to handle non-determinism. In the state machine formalism, processes are state machines, but FLP allows states to have null transition edges. That is, if a process (call it p) is in some state s, FLP models p as having a set of possible next states that we could reach from s: perhaps, states s' or s'' (loop-back edges are fine, by the way, so one these could just be s itself). Each such transition is described by an edge in a state transition graph, and these edges are labeled by a kind of pattern that will unambiguously tell us which transition to take. Thus, when a message is available, we could either consume that message, or take a null transition: a non-deterministic event.

Given that the behavior of a deterministic state machine is fully determined by its sequence of inputs, you can see that the real question centers on the decision as to whether the next input will be a message or a null. FLP gives the network this power: they model the network as an active entity that makes its own decisions. In general, the network has a set of messages ready to deliver, and decides which to deliver, in what order, and whether or not to deliver a null to some process rather than a message. Thus, in the FLP model, timeouts are viewed as a network "decision" to deliver a null message in a state where the protocol may have been waiting for something else (perhaps p is waiting for a message from q, but instead gets a null, and interprets this to mean that a timeout has occurred).

FLP's network is unusually powerful. In effect, it is able to scrutinize each and every message, selectively deciding when to deliver it. There is an obligation to eventually deliver every message... but no bound on how long the network can first delay it. And here, it turns out, is the crux of why FLP concludes that consensus is impossible, even though protocols based on the Paxos model solve consensus billions of times every day. Real systems always timeout when a process has been unresponsive for more than a few seconds. But FLP doesn't require the network to behave that way. The network controls null transitions, yet it might not do so in a way that has anything to do with timeouts. Here, then, is another possible candidate for explaining why FLP doesn't preclude building your favorite consensus-based technology: FLP's way of using nulls has no connection at all to timeouts, and doesn't actually model the way that timeouts are used in real systems. And let me add: this is a very credible argument. Many people invested years thinking about this exact argument, or ideas somewhat like this.

There is an old adage about big systems that if something can happen, it will happen. The FLP authors would argue that their network simply schedules messages, that nulls are examples of places where a timeout could have occurred, and that because even TCP may have to retransmit a few times before a message gets through, that delayed delivery is a real and valid thing. Thus any schedule FLP invents could arise in a real system, particularly if you imagine that a network link was somehow broken, then later repaired. But as we will see (and one can easily turn the observation into a rigorous proof), the FLP network really has impossible superpowers, because the particular messages that end up being delayed this way, and the particular moments when a null transition occurs, need to be chosen with exquisite care. Yes, each individual event could occur in the wild, but every one of them would be so unlikely that any sequence of them would be of zero probability: the probability of a sequence of unlikely events is the product of their probabilities, and with small probabilities, we approach zero exponentially quickly. Yet just this kind of zero-probability sequence is the core of the FLP proof.

To see all of this in action, let's walk through a scenario that the FLP proof constructs. Imagine that we have built a system that must vote on something, 0/1 and that we happen to be in a scenario with a near-tie. Further, assume that we were given a tie-breaking rule. Normally, there actually won't be a tie, but a failure could result in a tie, and then we would use the tie-breaker. For example, perhaps we have 11 processes, 5 vote 0 and 6 vote 1. The winner of this vote should be 1, but only if all of the 6 votes for 1 are tabulated. If one crashes, we have a 5:5 tie, and the tie-breaking rule might award the win to 0.

FLP sets up this scenario, and the proof centers on delaying messages to confuse the protocol about whether to use its normal rule, or the tie-breaking rule. They show that any fault tolerant protocol that can switch over and use its tie-breaking rule would require a few state-machine transitions during which, presumably, it would be internally reconfiguring to switch modes. Then, just at this moment, FLP delivers the delayed vote, selecting some other message and delaying it instead. In effect, just as the vote itself was a multi-step procedure, they are playing with the idea that switching to tie-breaking mode would also be a form of consensus: another multi-step procedure.

In a formal analysis they actually treat both of these as transitions from a state with two possible outcomes (0 or 1) to a state with just one decision. And what they prove is that in any such state, a network can selectively delay some messages, and selectively deliver other messages (or nulls), and force the system back to a two-outcome condition. Then they let the delayed messages through and attack again, in the identical way. Thus, any decision is indefinitely delayed.

Interestingly, they can do this without any failures at all: they prove that any fault tolerant consensus protocol has a run in which no process fails, and yet even though nothing fails, no decision is ever reached! But the key is that the network must have that superpower enabling it to selectively delay just the right messages at just the right instant, while also knowing just when to deliver the delayed messages, so that the network obligation to eventually deliver every message is respected.

From this, we can see that FLP authors actually meant something peculiar by their use of the word "impossibility." In normal conversation, impossible means just what it sounds like. If I tell you that it is impossible for a protocol that can only decide 0 or 1 to decide -1, we would all agree on this. But given what I've described, if I claimed that it is impossible to build a network that has the properties assumed by FLP, you would probably agree with me. The FLP authors would disagree: for them, "exponentially convergent to 0 probability" still leaves that tiny, ultra-unlikely bad case. No matter how unlikely, if a system could exhibit some behavior, the FLP model would consider that it is possible for that behavior to occur.

Conversely, in FLP, impossible also has a meaning... but not the meaning I might instinctively assume. Think about the classic definition of correctness for a protocol, as Dijkstra first did: a correct algorithm is one that guarantees safety (nothing bad happens, using a predicate to define bad behavior), and liveness (something good eventually happens). The FLP definition of impossibility centers on liveness: if you really dig deep, FLP is telling us that if we have a safe consensus protocol for some asynchronous distributed system, it cannot also be a live protocol. A person could be forgiven for assuming from the paper's title that that there are no correct (safe) consensus protocols, but this is not what the paper actually does. In fact it does the opposite: it assumes we have a safe protocol, and then shows that the protocol cannot also guarantee liveness, by pointing out that the zero-probability schedule discussed above is capable of delaying decisions indefinitely.

We arrive at a peculiar insight. On the one hand, ATC systems and other similar distributed applications need to make quick progress: we don't want an ATC system to delay for a billion years before deciding who can land on runway 2-E. This leads them to reconfigure if a process seems unresponsive, dropping that slow process to remain highly available. At the same time, they do require safety guarantees, and those stem from the safety properties of consensus. Due to the risk of network partitioning, we know this approach can't guarantee liveness, but we accept that because the frequency of network partitioning is very low -- data center power loss is a more common problem.

Then, on the other hand, we have FLP. One could brush it to the side and say that well, FLP isn't relevant here. But is that really correct? FLP doesn't consider the network partitioning scenario we knew about (and that already precluded liveness, given our availability goals). Yet FLP seems to be warning that even if we could somehow eliminate this exposure to a crash due to partitioning, there is actually is another "unstoppable" scenario, involving a peculiar network behavior, that no consensus protocol can defend against.

But this leads to another insight: Reverting to Dijkstra, those of us who deliver safety-critical code to ATC organizations might actually want to prove that our systems are safe and live. FLP teaches us that if you wish to prove that a system like Derecho will always make progress, you'll need to introduce some extra assumptions beyond the ones FLP employs. Without extra assumptions, we can't create such proofs because if we could, we would have violated FLP. How might those assumptions look? Think of them as a list: "if there are no network partitions, and if the network has no way to selectively delay messages... ". What would that list of conditions need to include?

There has been wonderful work on this question too: in a famous paper, Chandra and Toueg (later joined by Hadzilacos) described a very basic two-phase commit protocol for consensus, and worked out the weakest assumptions one can make that would still allow it to guarantee liveness: something they called the <>W failure detection oracle. The plain-English name for <>W is "eventually weak", and in plain English, <>W is a failure detector that can make mistakes, but where eventually, some non-faulty process is recognized as healthy by all the other non-faulty processes. This state needs to be sustained for long enough for the consensus protocol to terminate, and <>W expresses that by just saying "and we stay in this state forever".

More recent work by Idit Keidar and her systems, notably Alex Shraer, showed that in real systems, one can generally assume a stronger failure detector called <>P. This failure detector can make mistakes for a while, but eventually settles into a state in which it makes perfect failure discoveries for long enough to allow consensus to complete.

In a real system, there is actually a way to create a <>P guarantee. The trick centers on making sure that if any process suspects q of having failed, q really has crashed. How do we solve that aspect? The real but somewhat silly answer is that real systems reboot, reimage or replace malfunctioning components. James Hamilton came up with this little adage, and we refer to it as "Hamilton's three R's." In effect, we unplug q. Voila! That process has definitely failed.

In a setting where we can assume <>P, FLP would actually not apply. Now, Derecho and similar systems can be shown to be both safe and live. Of course, events that violate our assumptions can still crash the system -- network partitioning, loss of datacenter power -- but our proof would hold as long as those conditions for progress are maintained.

If you really ponder the point, with a <>P solution based on Hamilton's three R's, FLP becomes a network partitioning attack: rather than delaying messages, it kills so many processes that Derecho would shut down, and Paxos would stop accepting new updates or letting applications learn (read) the committed part of the Paxos log.

All of this discussion... just to address one impossibility result. In fact we have many others, and they come with oddities too. For example, there are many such results for Byzantine Agreement, where we have a set of processes, and some subset of them can fail in any way you like, including malicious behavior crafted to disrupt the system. But the Byzantine model normally is explored in networks with perfect clocks and perfect synchronization -- and we don't have such networks. Moreover, with Byzantine solutions, if the number of faulty components reaches and then passes the assumed limit, all bets are off.

Let's generalize this to the broader question I posed at the start. If distributed computing theory is full of these kinds of pitfalls and peculiar results, what does that tell us about the discipline as a science? To me, the moral of the story is simply that theory motivated by real systems can shed valuable light but that one has to view these results with a degree of sophistication and caution: they might not mean quite what you would assume, even if the paper you are studying has an especially provocative title. The mathematics teaches deep things, but it often isn't trivial to relate those insights back to your actual system.

But this also makes it very hard to teach this material, especially to students who distributed computing as an interesting curiosity, but not necessarily as a topic they really want to wrestle with. While a comprehensive course would certainly delve deeply into the theory, the subtle nature of the result makes it very hard to include FLP in a more practical course, like my spring class on cloud computing. For very practical students, when they hear that FLP says that consensus is impossible, there can be a tendency to jump wholeheartedly into Brewer's CAP model, with its "you can have two out of three" mantra. It happens that CAP is not a theorem and is often not even true, yet it can seem like a very simple and appealing rule of thumb. CAP tells my students to just assume that consistency is hard and promises them (perhaps, falsely) that inconsistency is far easier.

I did spend part of one lecture on FLP this spring. I worried that the course might otherwise be unbalanced -- that I needed to at least expose the students to a few of the most important theoretical results. In that lecture, my main message was that one shouldn't take every paper's title at face value. I get them to propose definitions of "impossible" and then surprised them with the FLP meaning of impossibility. And I do think they understood the point: most later got a quiz question on this right. A good tradeoff: after all, FLP really is a lovely bit of mathematics, and at least they heard about it.

A Few Thoughts on Distributed Computing

Saturday 13 June 2020

That's impossible!