A Few Thoughts on Distributed Computing: fault tolerance

Showing posts with label fault tolerance. Show all posts

Saturday, 13 June 2020

That's impossible!

Distributed computing, like many mature areas of computer science, has its share of impossibility results. I was lucky to be in the field before many of them were discovered, and because I'm basically a software builder, my research group built all sorts of distributed systems and tools. Later, though, the theory community showed that some of those things were impossible, which leaves a puzzle: How can anything solve an impossible problem?

Today, I thought it might be fun to explore an example. I want to go deeply enough into the question to shed light on the limitations of what is feasible in computer networks and cloud datacenters. At the same time, we'll also discover limitations on what can be modelled and proved, and even some limitations on the subtle ways that choices of wording (in English) can shape perceptions about what can and cannot be done.

When we talk about consistency in a distributed system, people generally map the work to their favorite model. For fans of database models, this might be transactional serializability and the ACID properties: Atomicity, Correctness, Isolation and Durability. Distributed systems researchers would point to State Machine Replication or Linearizability. The systems I created also worry about dynamic changes to the pool of servers, and have a consistency mechanism called virtual synchrony: membership changes in atomic steps. During a period when membership is stable (an epoch), we use state machine replication within the members. Protocols like Paxos solve this part of the problem.

If you take a big step back and abstract, all of these forms of consistency boil down to forms of fault-tolerant consensus: we have some set of processes, and they propose various kinds of actions, and vote on what action to do next. Consistency is the property that that they all decide the same thing. Thus, anything we can say about fault-tolerant consensus sheds light on fundamental limitations that apply to databases, distributed systems, and all sorts of other settings.

Any mathematically-based discussion starts with a model. For example, what does it mean to say that a failure has occurred? The most obvious choice is to say that well, some component halts -- it crashes suddenly and without doing anything horrible right as it collapses. But it is easy to come up with other failure models.

For example, consider timing. If a processor has a local clock that fails (drifts from the truth, or perhaps starts to show nonsense values), sometimes the remainder of the system stays healthy, even including the process on the machine with the faulty clock. Yet that could cause the process running on that machine to "lie" about what time some event occurred, or to miss a deadline. We tend to wrap all of these up, and call them timing faults.

Communication faults are tricky to model too. You may not be aware of this, but on Linux a process could send a message, but then the operating system or network could drop it. Should we blame the process itself? The network? And should we say that the message wasn't sent, or wasn't received? Worse, a network could split into parts that can't talk to each other: network "partitioning" failures.

Then we get the "hit by a cosmic ray" issues. Things like that (or more mundane problems like a power fluctuation) can cause the computer memory to flip bits. As a result, a process could experience some form of data corruption. And this doesn't even consider the case where the process is hijacked by a virus. We tend to lump all such issues into what we call "malicious" failure models, but even within the malicious models, there is a range that includes whether or not one allows collusion, as opposed to a strictly isolated form of nefarious misbehavior: A virus that can infect one process might be able to infect every process running the same code, and then mount a coordinated attack on something else. In contrast, that bit flip is just a case of flakey hardware and would only impact a single process at a time.

There is a lot of work that explores this range of behaviors. In fact one model, called the BAR model of failures, starts with these cases and then goes further by introducing incentives: are the participants out to cause chaos (a so-called byzantine case)? Or are they altruistic? Purely rational? Crash failures are then layered in, giving a cross-product that you can study to yield an abundance of impossibility results for tasks like reaching agreement or electing a leader.

For our purposes today, the result I want to discuss is one from a paper by Fischer, Lynch and Paterson called Impossibility of Consensus With One Faulty Process. We often refer to this just as the FLP impossibility result, and it is arguably the most famous of all such results. As the title suggests, the paper seemingly shows that agreement (consensus) is actually not possible if a system is at risk of crash failures. On the face of it, the FLP assumptions about the network and the possible failures are very mild -- and the result seemingly implies that databases, Cornell's amazing new Derecho system, dozens of other Paxos-based systems, Blockchain solutions for Bitcoin (all of which can solve consensus) are "doing something impossible." Indeed, FLP seems to imply that the very idea of consistency as a goal is hopeless: if we accept the title, don't waste your time building a database! But as we will see, a lot centers on the details of what the words in that title actually mean. The result is a real, valid, one. It does apply to all of those kinds of systems. But it doesn't mean we should pack up and head home.

The paper came out in 1985, which was an important year for me: it just happened to be the year when I wrote my first paper about a distributed system my students and I had created, one that implemented atomic multicast and Paxos and could be used to solve all sorts of practical agreement problems. We called it Isis (as it turned out, an unfortunate choice). Isis was released to the public, open source, and by 1987 or so it had a surprisingly wide uptake. The system was ultimately used in the New York Stock Exchange to run the trading floor, was adopted in the French Air Traffic Control System where still is used for many purposes, and even into the Oracle database product, which launches Isis every time the Oracle cluster management system boots -- and this is just a few examples.

As you can guess, right from the start I was getting questions about FLP, including in panel sessions at major conferences where this oddity was debated. My mother (an academic) has a saying that academic debates become heated precisely because the issues really don't matter. The big conference in systems, SOSP, was infamous in that time period for fireworks, very much validating my mother's point. In retrospect, I would have to say that relatively few SOSP attendees cared in a deep sense about FLP. But they were curious to understand the result: The paper is short and easy to read, but surprisingly hard to map to any simple intuition about what it "means". And let's face it: they also enjoyed lively panel debates.

Isis users didn't fuss too much about FLP: if they knew about it at all, they perceived it as an oddity. For a decade, all those traders on the floor of the NYSE happily traded stocks. Eventually the parts of the system that used Isis were phased out during a routine upgrade, but not because it had caused any issues -- in fact there wasn't a single disruption to trading during that entire ten-year period (crashes did occur, but Isis orchestrated self-healing recovery in every case).

By now, the French ATC system has been safely guiding airplanes since 1995 and will probably not be upgraded before 2025: a 30 year run! The designers of those platforms, and others, liked Isis both as a practical tool and because of its consistency-preserving consensus mechanisms. Isis created a world of state machine replication elements, which enabled self-managed, self-healing applications. Moreover, just as Isis itself could be described easily using the mathematics of state machine replication, those applications could also be proved correct and safe, even when components failed.

For example, one obviously wants to be sure that an Air Traffic Control System guarantees that each plane and each airport runway will have a single designated controller who is responsible for managing it. Isis allowed the French system to obtain this property. Each flight should have a single active flight plan; any change to the flight plan subsumes the prior one. Isis was central to their correctness proof for this property, too (one can think of an ATC system as centering on a log of flight plan versions, in which each change is a durable append to the log, and the current flight plan is always the version closest to the end of the log).

ATC systems never generate huge loads, and for this reason it was also possible to quantify the system's performance profile, and even to show that performance was stable across a wide range of stress tests and failure sequences (one does this by designing tests that carefully measure delays and bandwidth even as failures are injected to mimic scenarios believed possible during actual operation). This enabled the developers to convince skeptics that if the workstation used by some controller were to fail, someone else would take over the role within 30 seconds. During certification, the French used red-teams that were free to pour over the protocols, the code, and the way the system used it. Then they would often demand proofs for challenging scenarios they would construct. Sometimes the best response included a mathematical argument, but more often, these red-teams wanted to see experimental responses: an experiment that would mimic the case they worried about, and ride it out successfully. Over time, the system evolved to have an enormous range of unit tests and integration tests. Isis passed every test... and yet, FLP sits there.

Curiously, the one time I can remember this coming up, the red-team dismissed the FLP work, feeling that it made assumptions that didn't apply in an ATC setting (as we will see, FLP is posed in a very abstracted way, and assumes a very general model of the network). Yet I still felt that I needed a really cogent answer. Suppose that someone really were to challenge Isis in this way. What would be the best possible way to respond and explain how Isis relates to the FLP result?

In fact this was an actual worry for me. Those customers trusted me, and under my guidance, we were using the Isis system for applications where it genuinely matters that the solution have safe, consistent, self-healing behavior, and within seconds too! We had correctness proofs for the system (really, proofs of safety, not liveness, but the performance work had the flavor of a liveness claim). FLP proclaimed this combination of guarantees to be impossible... how can that be? Indeed, there were follow-on papers that appeared soon after FLP, pointing out that Isis and its version of Paxos were clearly subject to the FLP impossibility result. I had no choice but to figure out the proper rebuttal.

A great deal of the explanation centers on the FLP paper's approach to modelling the problem. Earlier, I said a few words about their system model, but didn't mention the network, and I didn't explain how they define asynchronous execution. So let's tackle those aspects first. The FLP paper assumes a network that mimics several aspects of TCP. Processes are permitted to send each other messages, and FLP assumes that these are sent over reliable network channels that never drop, corrupt or duplicate packets. I mentioned earlier that Linux, and the network itself, can both drop messages. But in fact if you use TCP, the TCP protocol itself compensates. On the face of it, FLP seems to assume exactly what TCP guarantees in a standard networked system, like a cloud data center. TCP, as you probably know, obtains this behavior by sequencing data, and then using explicit acknowledgments or complaints ("negative" acknowledgements) to trigger retransmissions and rate control. Duplicates can be filtered out because TCP has already seen those bytes.

On closer study, the FLP network model departs from TCP by allowing out-of-order delivery. Moreover, this matters: their proof requires this property, because it involves delaying some messages and allowing others to skip past the delayed ones (we'll see how they use this feature in a moment). For a while, it seemed plausible that this could be the key. Perhaps state machine replication seems to be very possible because we run on TCP (mostly), and it delivers messages in order. However in the end, it turned out that this particular aspect of the FLP model was unimportant. FLP applies even to a protocol built over TCP or RDMA.

Another puzzle relates to the way the FLP model defines asynchronous behavior. In the FLP description of processes, there are no clocks and indeed nothing even faintly resembling a clock: a set of processes could run the protocol for a billion years before achieving agreement, and this would be "just fine." Obviously, an air traffic control system wouldn't be happy with billion year delays, so real systems like Isis and Derecho have timers built in: if some process isn't responsive, the others vote the silent one out. To avoid a partitioning event (where our ATC system might split in half, with two subsets that each believe the other to have crashed, implying that two controllers would think themselves responsible for the same plane), we just require that in any such vote, a majority of the system has to "survive" long enough to vote in favor of the new membership. The majority rule eliminates the risk of split-brain behaviors, which could threaten safety.

These points, it turns out, are a bit more complicated than the one involving TCP. The first thing to appreciate is that rapid responsiveness is very important in an ATC system. When a pilot is approaching a landing strip, the ATC system is under a timed obligation to tell the pilot what to do: should plane A land first, or will plane B land first? Can plane A to climb 2500 feet to avoid that thunderhead?

Failures can threaten those quick responses, and this means that an ATC system may be in a hurry to kick out an unresponsive component. Yet if a pilot was cleared to land, but then the original controller's computer freezes and a new controller takes over, he or she should know about that prior clearance. This turns out to be the same requirement as the FLP rule that if any process decides v, then every process decides v. This tells us that even though ATC systems are biased towards availability, the most fundamental aspect of the FLP way of defining consensus still applies.

What about clocks? Real-world control systems like ATC platforms make heavy use of time, and take precautions to protect against errors that could be introduced by machines with faulty clocks. FLP had no clocks... does this somehow take us out of the FLP scope? One can imagine that it might: clocks enable timed actions, and we use timeout for fault detection: an ATC platform will give up on any tardy process and treat it as faulty, simply because we don't want the entire system to end up waiting for some very overloaded machine to catch up. Doesn't FLP itself need a way to model the detection of failures? And because timeouts are a form of non-determinism, how can we reconcile this use of time with the deterministic state machine model for the protocol itself?

As it turns out, FLP does have a way to handle non-determinism. In the state machine formalism, processes are state machines, but FLP allows states to have null transition edges. That is, if a process (call it p) is in some state s, FLP models p as having a set of possible next states that we could reach from s: perhaps, states s' or s'' (loop-back edges are fine, by the way, so one these could just be s itself). Each such transition is described by an edge in a state transition graph, and these edges are labeled by a kind of pattern that will unambiguously tell us which transition to take. Thus, when a message is available, we could either consume that message, or take a null transition: a non-deterministic event.

Given that the behavior of a deterministic state machine is fully determined by its sequence of inputs, you can see that the real question centers on the decision as to whether the next input will be a message or a null. FLP gives the network this power: they model the network as an active entity that makes its own decisions. In general, the network has a set of messages ready to deliver, and decides which to deliver, in what order, and whether or not to deliver a null to some process rather than a message. Thus, in the FLP model, timeouts are viewed as a network "decision" to deliver a null message in a state where the protocol may have been waiting for something else (perhaps p is waiting for a message from q, but instead gets a null, and interprets this to mean that a timeout has occurred).

FLP's network is unusually powerful. In effect, it is able to scrutinize each and every message, selectively deciding when to deliver it. There is an obligation to eventually deliver every message... but no bound on how long the network can first delay it. And here, it turns out, is the crux of why FLP concludes that consensus is impossible, even though protocols based on the Paxos model solve consensus billions of times every day. Real systems always timeout when a process has been unresponsive for more than a few seconds. But FLP doesn't require the network to behave that way. The network controls null transitions, yet it might not do so in a way that has anything to do with timeouts. Here, then, is another possible candidate for explaining why FLP doesn't preclude building your favorite consensus-based technology: FLP's way of using nulls has no connection at all to timeouts, and doesn't actually model the way that timeouts are used in real systems. And let me add: this is a very credible argument. Many people invested years thinking about this exact argument, or ideas somewhat like this.

There is an old adage about big systems that if something can happen, it will happen. The FLP authors would argue that their network simply schedules messages, that nulls are examples of places where a timeout could have occurred, and that because even TCP may have to retransmit a few times before a message gets through, that delayed delivery is a real and valid thing. Thus any schedule FLP invents could arise in a real system, particularly if you imagine that a network link was somehow broken, then later repaired. But as we will see (and one can easily turn the observation into a rigorous proof), the FLP network really has impossible superpowers, because the particular messages that end up being delayed this way, and the particular moments when a null transition occurs, need to be chosen with exquisite care. Yes, each individual event could occur in the wild, but every one of them would be so unlikely that any sequence of them would be of zero probability: the probability of a sequence of unlikely events is the product of their probabilities, and with small probabilities, we approach zero exponentially quickly. Yet just this kind of zero-probability sequence is the core of the FLP proof.

To see all of this in action, let's walk through a scenario that the FLP proof constructs. Imagine that we have built a system that must vote on something, 0/1 and that we happen to be in a scenario with a near-tie. Further, assume that we were given a tie-breaking rule. Normally, there actually won't be a tie, but a failure could result in a tie, and then we would use the tie-breaker. For example, perhaps we have 11 processes, 5 vote 0 and 6 vote 1. The winner of this vote should be 1, but only if all of the 6 votes for 1 are tabulated. If one crashes, we have a 5:5 tie, and the tie-breaking rule might award the win to 0.

FLP sets up this scenario, and the proof centers on delaying messages to confuse the protocol about whether to use its normal rule, or the tie-breaking rule. They show that any fault tolerant protocol that can switch over and use its tie-breaking rule would require a few state-machine transitions during which, presumably, it would be internally reconfiguring to switch modes. Then, just at this moment, FLP delivers the delayed vote, selecting some other message and delaying it instead. In effect, just as the vote itself was a multi-step procedure, they are playing with the idea that switching to tie-breaking mode would also be a form of consensus: another multi-step procedure.

In a formal analysis they actually treat both of these as transitions from a state with two possible outcomes (0 or 1) to a state with just one decision. And what they prove is that in any such state, a network can selectively delay some messages, and selectively deliver other messages (or nulls), and force the system back to a two-outcome condition. Then they let the delayed messages through and attack again, in the identical way. Thus, any decision is indefinitely delayed.

Interestingly, they can do this without any failures at all: they prove that any fault tolerant consensus protocol has a run in which no process fails, and yet even though nothing fails, no decision is ever reached! But the key is that the network must have that superpower enabling it to selectively delay just the right messages at just the right instant, while also knowing just when to deliver the delayed messages, so that the network obligation to eventually deliver every message is respected.

From this, we can see that FLP authors actually meant something peculiar by their use of the word "impossibility." In normal conversation, impossible means just what it sounds like. If I tell you that it is impossible for a protocol that can only decide 0 or 1 to decide -1, we would all agree on this. But given what I've described, if I claimed that it is impossible to build a network that has the properties assumed by FLP, you would probably agree with me. The FLP authors would disagree: for them, "exponentially convergent to 0 probability" still leaves that tiny, ultra-unlikely bad case. No matter how unlikely, if a system could exhibit some behavior, the FLP model would consider that it is possible for that behavior to occur.

Conversely, in FLP, impossible also has a meaning... but not the meaning I might instinctively assume. Think about the classic definition of correctness for a protocol, as Dijkstra first did: a correct algorithm is one that guarantees safety (nothing bad happens, using a predicate to define bad behavior), and liveness (something good eventually happens). The FLP definition of impossibility centers on liveness: if you really dig deep, FLP is telling us that if we have a safe consensus protocol for some asynchronous distributed system, it cannot also be a live protocol. A person could be forgiven for assuming from the paper's title that that there are no correct (safe) consensus protocols, but this is not what the paper actually does. In fact it does the opposite: it assumes we have a safe protocol, and then shows that the protocol cannot also guarantee liveness, by pointing out that the zero-probability schedule discussed above is capable of delaying decisions indefinitely.

We arrive at a peculiar insight. On the one hand, ATC systems and other similar distributed applications need to make quick progress: we don't want an ATC system to delay for a billion years before deciding who can land on runway 2-E. This leads them to reconfigure if a process seems unresponsive, dropping that slow process to remain highly available. At the same time, they do require safety guarantees, and those stem from the safety properties of consensus. Due to the risk of network partitioning, we know this approach can't guarantee liveness, but we accept that because the frequency of network partitioning is very low -- data center power loss is a more common problem.

Then, on the other hand, we have FLP. One could brush it to the side and say that well, FLP isn't relevant here. But is that really correct? FLP doesn't consider the network partitioning scenario we knew about (and that already precluded liveness, given our availability goals). Yet FLP seems to be warning that even if we could somehow eliminate this exposure to a crash due to partitioning, there is actually is another "unstoppable" scenario, involving a peculiar network behavior, that no consensus protocol can defend against.

But this leads to another insight: Reverting to Dijkstra, those of us who deliver safety-critical code to ATC organizations might actually want to prove that our systems are safe and live. FLP teaches us that if you wish to prove that a system like Derecho will always make progress, you'll need to introduce some extra assumptions beyond the ones FLP employs. Without extra assumptions, we can't create such proofs because if we could, we would have violated FLP. How might those assumptions look? Think of them as a list: "if there are no network partitions, and if the network has no way to selectively delay messages... ". What would that list of conditions need to include?

There has been wonderful work on this question too: in a famous paper, Chandra and Toueg (later joined by Hadzilacos) described a very basic two-phase commit protocol for consensus, and worked out the weakest assumptions one can make that would still allow it to guarantee liveness: something they called the <>W failure detection oracle. The plain-English name for <>W is "eventually weak", and in plain English, <>W is a failure detector that can make mistakes, but where eventually, some non-faulty process is recognized as healthy by all the other non-faulty processes. This state needs to be sustained for long enough for the consensus protocol to terminate, and <>W expresses that by just saying "and we stay in this state forever".

More recent work by Idit Keidar and her systems, notably Alex Shraer, showed that in real systems, one can generally assume a stronger failure detector called <>P. This failure detector can make mistakes for a while, but eventually settles into a state in which it makes perfect failure discoveries for long enough to allow consensus to complete.

In a real system, there is actually a way to create a <>P guarantee. The trick centers on making sure that if any process suspects q of having failed, q really has crashed. How do we solve that aspect? The real but somewhat silly answer is that real systems reboot, reimage or replace malfunctioning components. James Hamilton came up with this little adage, and we refer to it as "Hamilton's three R's." In effect, we unplug q. Voila! That process has definitely failed.

In a setting where we can assume <>P, FLP would actually not apply. Now, Derecho and similar systems can be shown to be both safe and live. Of course, events that violate our assumptions can still crash the system -- network partitioning, loss of datacenter power -- but our proof would hold as long as those conditions for progress are maintained.

If you really ponder the point, with a <>P solution based on Hamilton's three R's, FLP becomes a network partitioning attack: rather than delaying messages, it kills so many processes that Derecho would shut down, and Paxos would stop accepting new updates or letting applications learn (read) the committed part of the Paxos log.

All of this discussion... just to address one impossibility result. In fact we have many others, and they come with oddities too. For example, there are many such results for Byzantine Agreement, where we have a set of processes, and some subset of them can fail in any way you like, including malicious behavior crafted to disrupt the system. But the Byzantine model normally is explored in networks with perfect clocks and perfect synchronization -- and we don't have such networks. Moreover, with Byzantine solutions, if the number of faulty components reaches and then passes the assumed limit, all bets are off.

Let's generalize this to the broader question I posed at the start. If distributed computing theory is full of these kinds of pitfalls and peculiar results, what does that tell us about the discipline as a science? To me, the moral of the story is simply that theory motivated by real systems can shed valuable light but that one has to view these results with a degree of sophistication and caution: they might not mean quite what you would assume, even if the paper you are studying has an especially provocative title. The mathematics teaches deep things, but it often isn't trivial to relate those insights back to your actual system.

But this also makes it very hard to teach this material, especially to students who distributed computing as an interesting curiosity, but not necessarily as a topic they really want to wrestle with. While a comprehensive course would certainly delve deeply into the theory, the subtle nature of the result makes it very hard to include FLP in a more practical course, like my spring class on cloud computing. For very practical students, when they hear that FLP says that consensus is impossible, there can be a tendency to jump wholeheartedly into Brewer's CAP model, with its "you can have two out of three" mantra. It happens that CAP is not a theorem and is often not even true, yet it can seem like a very simple and appealing rule of thumb. CAP tells my students to just assume that consistency is hard and promises them (perhaps, falsely) that inconsistency is far easier.

I did spend part of one lecture on FLP this spring. I worried that the course might otherwise be unbalanced -- that I needed to at least expose the students to a few of the most important theoretical results. In that lecture, my main message was that one shouldn't take every paper's title at face value. I get them to propose definitions of "impossible" and then surprised them with the FLP meaning of impossibility. And I do think they understood the point: most later got a quiz question on this right. A good tradeoff: after all, FLP really is a lovely bit of mathematics, and at least they heard about it.

Sunday, 20 January 2019

Derecho status update

As we swing into 2019 mode, I wanted to share a quick Derecho status report and point to some goals for the coming months.

First, our ACM Transactions on Computer Sysms paper will be appearing sometime soon, which should give nice visibility for the work, and also the validation that comes from a tough peer-to-peer reviewing process. The paper has hugely improved through the pushback our reviews provided, so it was a challenge but, I think, worth it. The system itself works really well!

Next, we are starting to focus on a stronger integration with Azure IoT, where Derecho could be used either as a tool for creating new micro-services with strong fault tolerance and consistency guarantees, or as an ultra fast RDMA-capable object store. Microsoft has been supportive of this and Derecho should be available from their third party portal, still as a free and open source technology.

But that portal isn’t available yet. So right now, use Derecho via the v0.9 release of the system, which will be available by February 1 (we are saving v1.0 for later in the year, after we have a reasonable amount of end user experience). As of today, we still have one or two bugs we want to fix before doing that release.

Some key points:

We are urging people to use the Ubuntu Linux version, because this interoperates between normal Linux environments and Azure (the Microsoft cloud). On our release site (download here), you can find the source code but also some VMs (a container and then a true VM) with the library preinstalled. But in fact Derecho should work on any Linux-compatible system.
Right now, Derecho has only been tested in a single cluster, cloud (Azure, including Azure IoT, AWS, etc). We have some limited experience with virtualization, and with ROCE as opposed to pure Infiniband.
The easiest path to using Derecho is via the new key-value store. In this both keys and values can be any serializable object type you like, and we offer a wide range of features: Put and Get, but also a conditional put, which checks that the version you are writing was based on the most current version of the underlying object (useful for atomic replace, like in Zookeeper), plus a watch operation that works just like a topic based pub-sub or DDS. Objects can be stateless, stateful but not versioned, or versioned and persistent with strong consistency and extremely accurate temporal indexing. On this we will eventually support pub-sub (think of Kafka or OpenSplice), file systems (HDFS, Ceph), and maybe even a genuine Zookeeper look-alike. The only caution is that the watch feature isn’t designed to support huge numbers of watched topics. So if you would have more than 50 or 100 active topics, consider using Dr. Multicast to squeeze that set down.
The full system can only be used directly from our templates library API in C++, but you can easily build a wired-down library with no templated methods and then load it from Java or Python or whatever.
Runs on RDMA, OMNIPath, and even on normal TCP with no special hardware help at all. You just configure it via a configuration file, to tell the system how to set itself up. We use LibFabrics for this mapping to the underlying hardware.
Right now, all the Derecho group members need to be on hardware with identical endian and byte alignment policies, but clients not in the group can use RESTful RPC, the OMG DDS stack, WCF or JNI to issue RPCs to Derecho group members, which can then relay the request as appropriate.

Later this year we will extend the system in various ways. The API should stay stable, but the new features would include:

Hierarchically structured WAN layer that does read-only content mirroring for the object store.
A form of ultra fast and scalable LAN and WAN BlockChain support.
Machine checked correctness proofs, and a reference-version of the core Derecho protocols both in a high level form, and as proved and then re-extracted from those proofs in C or C++.
External client access to our API via RDMA, supporting point-to-point send and query.
Integration with Matt Milano’s mobile code language, MixT, allowing a client to send code to data residing in the object store.

Monday, 31 July 2017

Why is it so hard to mask failures?

When we talk about fault tolerant distributed computing, using the state machine replication approach, it may seem obvious that a system of this kind should be capable of completely masking failures. In fact, however, this is not the case. Our ability to hide failures is really very limited.

When developers use state machine replication techniques (SMR), the usual approach is to replace components of systems or distributed services with groups of N members, and then use some sort of library that delivers the same inputs to each, in the same order. If the replicated component is deterministic, and if care is taken to initialize each component in a manner properly synchronized with respect to its peers, this is enough to guarantee that the copies will remain synchronized. Thus, we have an N-replica group that seemingly will tolerate N-1 faults.

Unfortunately, theory is one thing and reality can be quite a different matter. When people first began to experiment with SMR in the 1990's, developers quickly noticed that because software bugs are a major cause of failure, perfect replication will replicate many kinds of faults! Over time, a more nuanced approach emerged, in which the various replicas are proactively shut down and restarted in an uncoordinated way, so that on average there would still be N copies, but at any instant in time there might be N-1 copies, with one copy shutting down or rejoining. The trick is to transfer the current state of the group to the recovering member, and is solved using the virtual synchrony model, in which group membership advances through a series of epochs, reported via view upcall notifications, with state transfers performed during epoch transitions.

The benefit of this sort of staggered restart is to overcome so-called Heisenbugs. The term refers to bugs that are hard to pin down: they could cause non-deterministic behavior (in which case the replicas might diverge), or bugs that seem to shift around when the developer tries to isolate them.
A common form of Heisenbug involves situations where a thread damages a data structure, but the damage won't be noticed until much later, at which point any of a number of other threads could try to access the structure and crash. Thus the failure, when it occurs, is associated with logic remote from the true bug, and may jump about depending on scheduling order. If the developer realizes that the root cause is the earlier damage to the data structure, it generally isn't too hard to fix the problem. But if the developer is tricked into thinking the bug manifested in the code that triggered the crash, any attempts to modify that logic will probably just make things worse!

The reason that staggered restart overcomes Heisenbugs is that a restarting program will load its initial state from some form of checkpoint, hence we end up with N copies, each using different operations to reach the same coordinated state as the other N-1. If the data-structure corruption problem isn't a common thing, this joining process is unlikely to have corrupted the same data structure as did the others. With proactive restart, all N copies may be in equivalent yet rather different states. We can take this form of diversity even further by load-balancing read-requests across our N copies: each will see different read operations and this will be a further source of execution diversity, without mutating states in ways that can cause the N replicas to diverse.
With such steps, it isn't hard to build an ultra-resilient SMR service, that can remain alive even through extremely disruptive failure episodes. But can such a service "mask" failures?

The answer is yes and no.

On the "yes" side we find work by Robert Surton at Cornell, who created a clever little TCP fail-over solution called TCP-R. Using this protocol, a TCP connection can seamlessly roll from one machine (the failed server) to another. The cleverness arises because of the way that TCP itself handles acknowledgements: in Surton's approach, a service member accepts a TCP connection, reads the request (this assumes that the request size is smaller than the TCP window size, in bytes), replicates the received request using an SMR multicast, and only then allows TCP to acknowledge the bytes comprising the request.

If a failure disrupts the sequence, TCP-R allows a backup to take control over the TCP session and to reread the same bytes from the window.   Thus the service is guaranteed to read the request at least once. A de-duplication step ensures that a request that happens to be read twice won't cause multiple state updates.

Replies back to the end user are handled in a similar way.   The service member about to send the reply first notifies the other members, using a multicast, and only then sends the reply. If a failure occurs, one of the other members claims the TCP endpoint and finishes the interrupted send.
With TCP-R the end-user's experience is of a fully masked failure: the application sends its request, and the service definitely gets the request (unless all N members crash simultaneously, which will break the TCP session).

Lacking TCP-R, the situation is quite a bit more complex. In effect, the end-user would need to send the request, but also be prepared to re-send it if the endpoint fails without responding. For read-only requests, the service can just perform the request multiple times if it shows up multiple times, but updates are more complex. For these, the service would need logic to deduplicate requests: if the same request shows up twice, it should resend the original reply and not mutate the service state by performing the identical operation a second time. TCP-R masks the service failure from the perspective of the client, although the service itself still needs this form of deduplication logic.

On the "no" side of the coin, we need to consider the much broader range of situations that can arise in systems that use SMR for fault-tolerance. In particular, suppose that one SMR-replicated service somehow interacts with a second SMR-replicated service. Should all N members of the first replica group repeat the same request to the M members of the second group?   Doing so is clearly the most general solution, but runs into the difficulty that bytes will be transferred N times to each member: a high cost if we are simply trying to mask a rare event!

Eric Cooper studied this question in his PhD thesis on a system called Circus, at Berkeley in the 1990's. Basically, he explored the range of options from sending one request from group A to group B, but reissuing the request if the sender in A failed or the receiver in B, all the way to the full N x M approach in which every member of A multicasts every request to every member of B, and the members of B thus receive N copies and must discard N-1 of them in the usual case. (TCP-R can be understood as an instance of the first approach, but with the client-side logic hidden under TCP itself, so that only the server has to be aware of the risk of redundancy, and so that it only arises when a genuine failure occurs.)

Cooper pointed out that even with high costs for redundancy, the N x M approach can often outperform any scheme that waits to sense a failure before retrying. His insight was that because detecting a failure can be slow (often 30s or more), proactively sending multiple copies of each request will generally produce extra load on the receivers, but with the benefit of ensuring a snappy response because at least some receiver will act promptly and send the desired reply with minimal delay.

Cooper's solution, Circus, is best understood as a design pattern: a methodology that the application developer is expected to implement. It involves a multicast from group A to group B, code in group B to remember recent responses to requests from A, and logic to de-duplicate the request stream, so that whether B receives a request 1 time or N times, it behaves identically and responds to A in the same manner.

In Derecho, we don't currently offer any special help for this heavily redundant approach, but all the needed functionality is available and the design pattern shouldn't be hard to instantiate. But in fact, many Derecho users are more likely to use a non-fault-tolerant approach when building a data processing pipeline. More specifically, while Derecho users would often use replication to make the processing elements of the pipeline fault-tolerant, they might decide not to pay the overhead of making the handoff of requests, stage by stage in the pipeline, ultra-reliable.

The reason for this compromise is that in the IoT settings where a "smart memory service" might be used, most sensors are rather redundant, sending photo after photo of the same car on the highway, or location after location for the cat in the apartment. The service receives this heavily duplicative input and will actually start by sorting out the good content and discarding the replicated data. Thus we suspect that most Derecho users will be more concerned with ensuring that the service itself is highly available, and less concerned with ensuring that every single message sent to the service is processed.

Indeed, in most IoT settings, freshness of data is more important that perfect processing of each and every data point. Thus, if some camera creates photo X of a vehicle on the highway, and then photo Y, and X is somehow lost because of a failure that occurs exactly as it is being sent, it often would make more sense to not fuss about X and just focus on processing request Y instead.

Microsoft has a system, Cosmos, in which a pipeline of processing is done on images and videos. It manages without fault-tolerant handoff between stages because failures are rare and, if some object is missing, there is always a simple recipe to create it again from scratch. Facebook apparently does this too. Both systems need to be extra careful with the original copies of photos and videos, but computed artifacts can always be regenerated. Thus, perfect fault tolerance isn't really needed!

Of course, one can easily imagine systems in which each piece of data sent to the service is individually of vital importance, and for those, an open question remains: is it really necessary to hand-code Cooper's Circus design pattern? Or could there be a really nice way to package the Circus concept, for example by using a higher level language to describe services and compiling them down to SMR replicas that talk to one-another redundantly?

I view this as an open research topic for Derecho, and one we may actually tackle in coming years. Until then, Derecho can certainly support a high quality of adaptation after crashes, but won't seamlessly hide crashes from the developer who works with the technology. But on the other hand, neither does any other technology of which I'm aware!