A Few Thoughts on Distributed Computing: consistency

Showing posts with label consistency. Show all posts

Saturday, 13 June 2020

That's impossible!

Distributed computing, like many mature areas of computer science, has its share of impossibility results. I was lucky to be in the field before many of them were discovered, and because I'm basically a software builder, my research group built all sorts of distributed systems and tools. Later, though, the theory community showed that some of those things were impossible, which leaves a puzzle: How can anything solve an impossible problem?

Today, I thought it might be fun to explore an example. I want to go deeply enough into the question to shed light on the limitations of what is feasible in computer networks and cloud datacenters. At the same time, we'll also discover limitations on what can be modelled and proved, and even some limitations on the subtle ways that choices of wording (in English) can shape perceptions about what can and cannot be done.

When we talk about consistency in a distributed system, people generally map the work to their favorite model. For fans of database models, this might be transactional serializability and the ACID properties: Atomicity, Correctness, Isolation and Durability. Distributed systems researchers would point to State Machine Replication or Linearizability. The systems I created also worry about dynamic changes to the pool of servers, and have a consistency mechanism called virtual synchrony: membership changes in atomic steps. During a period when membership is stable (an epoch), we use state machine replication within the members. Protocols like Paxos solve this part of the problem.

If you take a big step back and abstract, all of these forms of consistency boil down to forms of fault-tolerant consensus: we have some set of processes, and they propose various kinds of actions, and vote on what action to do next. Consistency is the property that that they all decide the same thing. Thus, anything we can say about fault-tolerant consensus sheds light on fundamental limitations that apply to databases, distributed systems, and all sorts of other settings.

Any mathematically-based discussion starts with a model. For example, what does it mean to say that a failure has occurred? The most obvious choice is to say that well, some component halts -- it crashes suddenly and without doing anything horrible right as it collapses. But it is easy to come up with other failure models.

For example, consider timing. If a processor has a local clock that fails (drifts from the truth, or perhaps starts to show nonsense values), sometimes the remainder of the system stays healthy, even including the process on the machine with the faulty clock. Yet that could cause the process running on that machine to "lie" about what time some event occurred, or to miss a deadline. We tend to wrap all of these up, and call them timing faults.

Communication faults are tricky to model too. You may not be aware of this, but on Linux a process could send a message, but then the operating system or network could drop it. Should we blame the process itself? The network? And should we say that the message wasn't sent, or wasn't received? Worse, a network could split into parts that can't talk to each other: network "partitioning" failures.

Then we get the "hit by a cosmic ray" issues. Things like that (or more mundane problems like a power fluctuation) can cause the computer memory to flip bits. As a result, a process could experience some form of data corruption. And this doesn't even consider the case where the process is hijacked by a virus. We tend to lump all such issues into what we call "malicious" failure models, but even within the malicious models, there is a range that includes whether or not one allows collusion, as opposed to a strictly isolated form of nefarious misbehavior: A virus that can infect one process might be able to infect every process running the same code, and then mount a coordinated attack on something else. In contrast, that bit flip is just a case of flakey hardware and would only impact a single process at a time.

There is a lot of work that explores this range of behaviors. In fact one model, called the BAR model of failures, starts with these cases and then goes further by introducing incentives: are the participants out to cause chaos (a so-called byzantine case)? Or are they altruistic? Purely rational? Crash failures are then layered in, giving a cross-product that you can study to yield an abundance of impossibility results for tasks like reaching agreement or electing a leader.

For our purposes today, the result I want to discuss is one from a paper by Fischer, Lynch and Paterson called Impossibility of Consensus With One Faulty Process. We often refer to this just as the FLP impossibility result, and it is arguably the most famous of all such results. As the title suggests, the paper seemingly shows that agreement (consensus) is actually not possible if a system is at risk of crash failures. On the face of it, the FLP assumptions about the network and the possible failures are very mild -- and the result seemingly implies that databases, Cornell's amazing new Derecho system, dozens of other Paxos-based systems, Blockchain solutions for Bitcoin (all of which can solve consensus) are "doing something impossible." Indeed, FLP seems to imply that the very idea of consistency as a goal is hopeless: if we accept the title, don't waste your time building a database! But as we will see, a lot centers on the details of what the words in that title actually mean. The result is a real, valid, one. It does apply to all of those kinds of systems. But it doesn't mean we should pack up and head home.

The paper came out in 1985, which was an important year for me: it just happened to be the year when I wrote my first paper about a distributed system my students and I had created, one that implemented atomic multicast and Paxos and could be used to solve all sorts of practical agreement problems. We called it Isis (as it turned out, an unfortunate choice). Isis was released to the public, open source, and by 1987 or so it had a surprisingly wide uptake. The system was ultimately used in the New York Stock Exchange to run the trading floor, was adopted in the French Air Traffic Control System where still is used for many purposes, and even into the Oracle database product, which launches Isis every time the Oracle cluster management system boots -- and this is just a few examples.

As you can guess, right from the start I was getting questions about FLP, including in panel sessions at major conferences where this oddity was debated. My mother (an academic) has a saying that academic debates become heated precisely because the issues really don't matter. The big conference in systems, SOSP, was infamous in that time period for fireworks, very much validating my mother's point. In retrospect, I would have to say that relatively few SOSP attendees cared in a deep sense about FLP. But they were curious to understand the result: The paper is short and easy to read, but surprisingly hard to map to any simple intuition about what it "means". And let's face it: they also enjoyed lively panel debates.

Isis users didn't fuss too much about FLP: if they knew about it at all, they perceived it as an oddity. For a decade, all those traders on the floor of the NYSE happily traded stocks. Eventually the parts of the system that used Isis were phased out during a routine upgrade, but not because it had caused any issues -- in fact there wasn't a single disruption to trading during that entire ten-year period (crashes did occur, but Isis orchestrated self-healing recovery in every case).

By now, the French ATC system has been safely guiding airplanes since 1995 and will probably not be upgraded before 2025: a 30 year run! The designers of those platforms, and others, liked Isis both as a practical tool and because of its consistency-preserving consensus mechanisms. Isis created a world of state machine replication elements, which enabled self-managed, self-healing applications. Moreover, just as Isis itself could be described easily using the mathematics of state machine replication, those applications could also be proved correct and safe, even when components failed.

For example, one obviously wants to be sure that an Air Traffic Control System guarantees that each plane and each airport runway will have a single designated controller who is responsible for managing it. Isis allowed the French system to obtain this property. Each flight should have a single active flight plan; any change to the flight plan subsumes the prior one. Isis was central to their correctness proof for this property, too (one can think of an ATC system as centering on a log of flight plan versions, in which each change is a durable append to the log, and the current flight plan is always the version closest to the end of the log).

ATC systems never generate huge loads, and for this reason it was also possible to quantify the system's performance profile, and even to show that performance was stable across a wide range of stress tests and failure sequences (one does this by designing tests that carefully measure delays and bandwidth even as failures are injected to mimic scenarios believed possible during actual operation). This enabled the developers to convince skeptics that if the workstation used by some controller were to fail, someone else would take over the role within 30 seconds. During certification, the French used red-teams that were free to pour over the protocols, the code, and the way the system used it. Then they would often demand proofs for challenging scenarios they would construct. Sometimes the best response included a mathematical argument, but more often, these red-teams wanted to see experimental responses: an experiment that would mimic the case they worried about, and ride it out successfully. Over time, the system evolved to have an enormous range of unit tests and integration tests. Isis passed every test... and yet, FLP sits there.

Curiously, the one time I can remember this coming up, the red-team dismissed the FLP work, feeling that it made assumptions that didn't apply in an ATC setting (as we will see, FLP is posed in a very abstracted way, and assumes a very general model of the network). Yet I still felt that I needed a really cogent answer. Suppose that someone really were to challenge Isis in this way. What would be the best possible way to respond and explain how Isis relates to the FLP result?

In fact this was an actual worry for me. Those customers trusted me, and under my guidance, we were using the Isis system for applications where it genuinely matters that the solution have safe, consistent, self-healing behavior, and within seconds too! We had correctness proofs for the system (really, proofs of safety, not liveness, but the performance work had the flavor of a liveness claim). FLP proclaimed this combination of guarantees to be impossible... how can that be? Indeed, there were follow-on papers that appeared soon after FLP, pointing out that Isis and its version of Paxos were clearly subject to the FLP impossibility result. I had no choice but to figure out the proper rebuttal.

A great deal of the explanation centers on the FLP paper's approach to modelling the problem. Earlier, I said a few words about their system model, but didn't mention the network, and I didn't explain how they define asynchronous execution. So let's tackle those aspects first. The FLP paper assumes a network that mimics several aspects of TCP. Processes are permitted to send each other messages, and FLP assumes that these are sent over reliable network channels that never drop, corrupt or duplicate packets. I mentioned earlier that Linux, and the network itself, can both drop messages. But in fact if you use TCP, the TCP protocol itself compensates. On the face of it, FLP seems to assume exactly what TCP guarantees in a standard networked system, like a cloud data center. TCP, as you probably know, obtains this behavior by sequencing data, and then using explicit acknowledgments or complaints ("negative" acknowledgements) to trigger retransmissions and rate control. Duplicates can be filtered out because TCP has already seen those bytes.

On closer study, the FLP network model departs from TCP by allowing out-of-order delivery. Moreover, this matters: their proof requires this property, because it involves delaying some messages and allowing others to skip past the delayed ones (we'll see how they use this feature in a moment). For a while, it seemed plausible that this could be the key. Perhaps state machine replication seems to be very possible because we run on TCP (mostly), and it delivers messages in order. However in the end, it turned out that this particular aspect of the FLP model was unimportant. FLP applies even to a protocol built over TCP or RDMA.

Another puzzle relates to the way the FLP model defines asynchronous behavior. In the FLP description of processes, there are no clocks and indeed nothing even faintly resembling a clock: a set of processes could run the protocol for a billion years before achieving agreement, and this would be "just fine." Obviously, an air traffic control system wouldn't be happy with billion year delays, so real systems like Isis and Derecho have timers built in: if some process isn't responsive, the others vote the silent one out. To avoid a partitioning event (where our ATC system might split in half, with two subsets that each believe the other to have crashed, implying that two controllers would think themselves responsible for the same plane), we just require that in any such vote, a majority of the system has to "survive" long enough to vote in favor of the new membership. The majority rule eliminates the risk of split-brain behaviors, which could threaten safety.

These points, it turns out, are a bit more complicated than the one involving TCP. The first thing to appreciate is that rapid responsiveness is very important in an ATC system. When a pilot is approaching a landing strip, the ATC system is under a timed obligation to tell the pilot what to do: should plane A land first, or will plane B land first? Can plane A to climb 2500 feet to avoid that thunderhead?

Failures can threaten those quick responses, and this means that an ATC system may be in a hurry to kick out an unresponsive component. Yet if a pilot was cleared to land, but then the original controller's computer freezes and a new controller takes over, he or she should know about that prior clearance. This turns out to be the same requirement as the FLP rule that if any process decides v, then every process decides v. This tells us that even though ATC systems are biased towards availability, the most fundamental aspect of the FLP way of defining consensus still applies.

What about clocks? Real-world control systems like ATC platforms make heavy use of time, and take precautions to protect against errors that could be introduced by machines with faulty clocks. FLP had no clocks... does this somehow take us out of the FLP scope? One can imagine that it might: clocks enable timed actions, and we use timeout for fault detection: an ATC platform will give up on any tardy process and treat it as faulty, simply because we don't want the entire system to end up waiting for some very overloaded machine to catch up. Doesn't FLP itself need a way to model the detection of failures? And because timeouts are a form of non-determinism, how can we reconcile this use of time with the deterministic state machine model for the protocol itself?

As it turns out, FLP does have a way to handle non-determinism. In the state machine formalism, processes are state machines, but FLP allows states to have null transition edges. That is, if a process (call it p) is in some state s, FLP models p as having a set of possible next states that we could reach from s: perhaps, states s' or s'' (loop-back edges are fine, by the way, so one these could just be s itself). Each such transition is described by an edge in a state transition graph, and these edges are labeled by a kind of pattern that will unambiguously tell us which transition to take. Thus, when a message is available, we could either consume that message, or take a null transition: a non-deterministic event.

Given that the behavior of a deterministic state machine is fully determined by its sequence of inputs, you can see that the real question centers on the decision as to whether the next input will be a message or a null. FLP gives the network this power: they model the network as an active entity that makes its own decisions. In general, the network has a set of messages ready to deliver, and decides which to deliver, in what order, and whether or not to deliver a null to some process rather than a message. Thus, in the FLP model, timeouts are viewed as a network "decision" to deliver a null message in a state where the protocol may have been waiting for something else (perhaps p is waiting for a message from q, but instead gets a null, and interprets this to mean that a timeout has occurred).

FLP's network is unusually powerful. In effect, it is able to scrutinize each and every message, selectively deciding when to deliver it. There is an obligation to eventually deliver every message... but no bound on how long the network can first delay it. And here, it turns out, is the crux of why FLP concludes that consensus is impossible, even though protocols based on the Paxos model solve consensus billions of times every day. Real systems always timeout when a process has been unresponsive for more than a few seconds. But FLP doesn't require the network to behave that way. The network controls null transitions, yet it might not do so in a way that has anything to do with timeouts. Here, then, is another possible candidate for explaining why FLP doesn't preclude building your favorite consensus-based technology: FLP's way of using nulls has no connection at all to timeouts, and doesn't actually model the way that timeouts are used in real systems. And let me add: this is a very credible argument. Many people invested years thinking about this exact argument, or ideas somewhat like this.

There is an old adage about big systems that if something can happen, it will happen. The FLP authors would argue that their network simply schedules messages, that nulls are examples of places where a timeout could have occurred, and that because even TCP may have to retransmit a few times before a message gets through, that delayed delivery is a real and valid thing. Thus any schedule FLP invents could arise in a real system, particularly if you imagine that a network link was somehow broken, then later repaired. But as we will see (and one can easily turn the observation into a rigorous proof), the FLP network really has impossible superpowers, because the particular messages that end up being delayed this way, and the particular moments when a null transition occurs, need to be chosen with exquisite care. Yes, each individual event could occur in the wild, but every one of them would be so unlikely that any sequence of them would be of zero probability: the probability of a sequence of unlikely events is the product of their probabilities, and with small probabilities, we approach zero exponentially quickly. Yet just this kind of zero-probability sequence is the core of the FLP proof.

To see all of this in action, let's walk through a scenario that the FLP proof constructs. Imagine that we have built a system that must vote on something, 0/1 and that we happen to be in a scenario with a near-tie. Further, assume that we were given a tie-breaking rule. Normally, there actually won't be a tie, but a failure could result in a tie, and then we would use the tie-breaker. For example, perhaps we have 11 processes, 5 vote 0 and 6 vote 1. The winner of this vote should be 1, but only if all of the 6 votes for 1 are tabulated. If one crashes, we have a 5:5 tie, and the tie-breaking rule might award the win to 0.

FLP sets up this scenario, and the proof centers on delaying messages to confuse the protocol about whether to use its normal rule, or the tie-breaking rule. They show that any fault tolerant protocol that can switch over and use its tie-breaking rule would require a few state-machine transitions during which, presumably, it would be internally reconfiguring to switch modes. Then, just at this moment, FLP delivers the delayed vote, selecting some other message and delaying it instead. In effect, just as the vote itself was a multi-step procedure, they are playing with the idea that switching to tie-breaking mode would also be a form of consensus: another multi-step procedure.

In a formal analysis they actually treat both of these as transitions from a state with two possible outcomes (0 or 1) to a state with just one decision. And what they prove is that in any such state, a network can selectively delay some messages, and selectively deliver other messages (or nulls), and force the system back to a two-outcome condition. Then they let the delayed messages through and attack again, in the identical way. Thus, any decision is indefinitely delayed.

Interestingly, they can do this without any failures at all: they prove that any fault tolerant consensus protocol has a run in which no process fails, and yet even though nothing fails, no decision is ever reached! But the key is that the network must have that superpower enabling it to selectively delay just the right messages at just the right instant, while also knowing just when to deliver the delayed messages, so that the network obligation to eventually deliver every message is respected.

From this, we can see that FLP authors actually meant something peculiar by their use of the word "impossibility." In normal conversation, impossible means just what it sounds like. If I tell you that it is impossible for a protocol that can only decide 0 or 1 to decide -1, we would all agree on this. But given what I've described, if I claimed that it is impossible to build a network that has the properties assumed by FLP, you would probably agree with me. The FLP authors would disagree: for them, "exponentially convergent to 0 probability" still leaves that tiny, ultra-unlikely bad case. No matter how unlikely, if a system could exhibit some behavior, the FLP model would consider that it is possible for that behavior to occur.

Conversely, in FLP, impossible also has a meaning... but not the meaning I might instinctively assume. Think about the classic definition of correctness for a protocol, as Dijkstra first did: a correct algorithm is one that guarantees safety (nothing bad happens, using a predicate to define bad behavior), and liveness (something good eventually happens). The FLP definition of impossibility centers on liveness: if you really dig deep, FLP is telling us that if we have a safe consensus protocol for some asynchronous distributed system, it cannot also be a live protocol. A person could be forgiven for assuming from the paper's title that that there are no correct (safe) consensus protocols, but this is not what the paper actually does. In fact it does the opposite: it assumes we have a safe protocol, and then shows that the protocol cannot also guarantee liveness, by pointing out that the zero-probability schedule discussed above is capable of delaying decisions indefinitely.

We arrive at a peculiar insight. On the one hand, ATC systems and other similar distributed applications need to make quick progress: we don't want an ATC system to delay for a billion years before deciding who can land on runway 2-E. This leads them to reconfigure if a process seems unresponsive, dropping that slow process to remain highly available. At the same time, they do require safety guarantees, and those stem from the safety properties of consensus. Due to the risk of network partitioning, we know this approach can't guarantee liveness, but we accept that because the frequency of network partitioning is very low -- data center power loss is a more common problem.

Then, on the other hand, we have FLP. One could brush it to the side and say that well, FLP isn't relevant here. But is that really correct? FLP doesn't consider the network partitioning scenario we knew about (and that already precluded liveness, given our availability goals). Yet FLP seems to be warning that even if we could somehow eliminate this exposure to a crash due to partitioning, there is actually is another "unstoppable" scenario, involving a peculiar network behavior, that no consensus protocol can defend against.

But this leads to another insight: Reverting to Dijkstra, those of us who deliver safety-critical code to ATC organizations might actually want to prove that our systems are safe and live. FLP teaches us that if you wish to prove that a system like Derecho will always make progress, you'll need to introduce some extra assumptions beyond the ones FLP employs. Without extra assumptions, we can't create such proofs because if we could, we would have violated FLP. How might those assumptions look? Think of them as a list: "if there are no network partitions, and if the network has no way to selectively delay messages... ". What would that list of conditions need to include?

There has been wonderful work on this question too: in a famous paper, Chandra and Toueg (later joined by Hadzilacos) described a very basic two-phase commit protocol for consensus, and worked out the weakest assumptions one can make that would still allow it to guarantee liveness: something they called the <>W failure detection oracle. The plain-English name for <>W is "eventually weak", and in plain English, <>W is a failure detector that can make mistakes, but where eventually, some non-faulty process is recognized as healthy by all the other non-faulty processes. This state needs to be sustained for long enough for the consensus protocol to terminate, and <>W expresses that by just saying "and we stay in this state forever".

More recent work by Idit Keidar and her systems, notably Alex Shraer, showed that in real systems, one can generally assume a stronger failure detector called <>P. This failure detector can make mistakes for a while, but eventually settles into a state in which it makes perfect failure discoveries for long enough to allow consensus to complete.

In a real system, there is actually a way to create a <>P guarantee. The trick centers on making sure that if any process suspects q of having failed, q really has crashed. How do we solve that aspect? The real but somewhat silly answer is that real systems reboot, reimage or replace malfunctioning components. James Hamilton came up with this little adage, and we refer to it as "Hamilton's three R's." In effect, we unplug q. Voila! That process has definitely failed.

In a setting where we can assume <>P, FLP would actually not apply. Now, Derecho and similar systems can be shown to be both safe and live. Of course, events that violate our assumptions can still crash the system -- network partitioning, loss of datacenter power -- but our proof would hold as long as those conditions for progress are maintained.

If you really ponder the point, with a <>P solution based on Hamilton's three R's, FLP becomes a network partitioning attack: rather than delaying messages, it kills so many processes that Derecho would shut down, and Paxos would stop accepting new updates or letting applications learn (read) the committed part of the Paxos log.

All of this discussion... just to address one impossibility result. In fact we have many others, and they come with oddities too. For example, there are many such results for Byzantine Agreement, where we have a set of processes, and some subset of them can fail in any way you like, including malicious behavior crafted to disrupt the system. But the Byzantine model normally is explored in networks with perfect clocks and perfect synchronization -- and we don't have such networks. Moreover, with Byzantine solutions, if the number of faulty components reaches and then passes the assumed limit, all bets are off.

Let's generalize this to the broader question I posed at the start. If distributed computing theory is full of these kinds of pitfalls and peculiar results, what does that tell us about the discipline as a science? To me, the moral of the story is simply that theory motivated by real systems can shed valuable light but that one has to view these results with a degree of sophistication and caution: they might not mean quite what you would assume, even if the paper you are studying has an especially provocative title. The mathematics teaches deep things, but it often isn't trivial to relate those insights back to your actual system.

But this also makes it very hard to teach this material, especially to students who distributed computing as an interesting curiosity, but not necessarily as a topic they really want to wrestle with. While a comprehensive course would certainly delve deeply into the theory, the subtle nature of the result makes it very hard to include FLP in a more practical course, like my spring class on cloud computing. For very practical students, when they hear that FLP says that consensus is impossible, there can be a tendency to jump wholeheartedly into Brewer's CAP model, with its "you can have two out of three" mantra. It happens that CAP is not a theorem and is often not even true, yet it can seem like a very simple and appealing rule of thumb. CAP tells my students to just assume that consistency is hard and promises them (perhaps, falsely) that inconsistency is far easier.

I did spend part of one lecture on FLP this spring. I worried that the course might otherwise be unbalanced -- that I needed to at least expose the students to a few of the most important theoretical results. In that lecture, my main message was that one shouldn't take every paper's title at face value. I get them to propose definitions of "impossible" and then surprised them with the FLP meaning of impossibility. And I do think they understood the point: most later got a quiz question on this right. A good tradeoff: after all, FLP really is a lovely bit of mathematics, and at least they heard about it.

Wednesday, 26 June 2019

Whiteboard analysis: IoT Edge reactive path

One of my favorite papers is the one Jim Gray wrote with Pat Helland, Patrick O'Neil and Dennis Shasha, on the costs of replicating a database over a large set of servers, which they showed to be prohibitive if you don't fragment (shard) the database into smaller and independentally accessed portions: mini-databases. In some sense, this paper gave us the modern cloud, because you can view Brewer's CAP conjecture and the eBay/Amazon BASE methodologies as both flowing from Gray's original insight.

Fundamentally, what Jim and his colleagues did was to undertake a whiteboard analysis of the scalability of concurrency control in an uncontrolled situation, where transactions are simply submitted to some big pool of servers, and then compete for locks in accordance with a two-phase locking model (one in which a transaction acquires all its locks before releasing any), and then terminates using a two-phase or three-phase commit. They show that without some mechanism to prevent lock conflicts, there is a predictable and steadily increasing rate of lock conflicts leading to delay and even deadlock/rollback/retry. The phenomenon causes overheads to rise as a polynomial in the number of servers over which you replicate the data, and quite sharply: I believe it was N^3 in the number of servers, and T^5 in the rate of transactions. So your single replicated database will have a perform collapse. With shards, using state machine replication (implemented using Derecho!) this isn't an issue, but of course we don't get the full SQL model at that point -- we end up with a form of NoSQL on the sharded database, similar to what MongoDB or Amazon's Dynamo DB offers.

Of course the "dangers" paper is iconic, but the techniques it uses are of broad value. And this was central to the way Jim approached problems: he was a huge fan in working out the critical paths and measuring costs along them. In his cloud database setup, a bit of fancy mathematics let the group he was working with turn that sort of thinking into a scalability analysis that led to a foundational insight. But even if you don't have an identical chance to change the world, it makes sense to try and follow a similar path.

This has had me thinking about paper-and-pencil analysis of the critical paths and potential consistency conflict points for large edge IoT deployments of the kind I described last week. Right now, those paths are pretty messy, if you approach it this way. Without an edge service, we would see something like this:

IoT IoT Function Micro
Sensor ---------------> Hub ---> Server ------> Service

In this example I am acting as if the function server "is" the function itself, and hiding the step in which the function server looks up the class of function that should handle this event, launches it (or perhaps had one waiting, warm-started), and then hands off the event data to the function for handling on one of its servers. Had I included this handoff the image would be more like this:

IoT IoT Function Function Micro

Sensor ---------------> Hub ---> Server ------> F -----> Service

F is "your function", coded in a language like C#, F# or C++ or Python, and then encapsulated into a container of some form. You'll want to keep these programs very small and lightweight for speed. In particular, a function is not the place to do any serious computing, or to try and store anything. Real work occurs in the micro service, the one you built using Derecho. Even so, this particular step looks costly to me: without warm-starting it, launching F could take a substantial fraction of a section. And if F was warm-started, the context switch still involves some form of message passing, plus waking F up, and could still be many tens or even hundreds of milliseconds: an eternity at cloud speeds!

Even more concerning, many sensors can't connect directly to the cloud, and we end up cloning the architecture and running it twice: within an IoT Edge system (think of that as an operating system for a small NUMA machine or a cluster, running close to the sensors, and then relaying data to the main cloud if it can't handle the events out near the sensor device).

IoT Edge Edge Fcn IoT Function Micro

Sensor ---------------> Hub ---> Server -> F======> Hub ---> Server -> CF -> Service

Notice that now we have two user-supplied functions on the path. The first one will have decided that the event can't be handled out at the edge, and forwarded the request to the cloud, probably via a message queuing layer that I haven't actually shown, but represented using a double-arrow: ===>. This could have chosen to store the request and send it later, but with luck the link was up and it was passed to the cloud instantly, didn't need to sit in an arrival queue, and was instantly given to the cloud's IoT Hub, which in turn finally passed it to the cloud function server, the cloud function (CF) and the Micro Service.

The Micro Service may actually be a whole graph of mutually supporting Micro Services, each running on a pool of nodes, and each interacting with some of the others. The cloud's "App Server" probably hosts these and provides elasticity if a backlog forms for one of them.

We also have the difficulty that many sensors capture images and videos. These are initially stored on the device itself, which has substantial capacity but limited compute power. The big issue is that the first link, from sensor to the edge hub, would often be bandwidth limited. So we can't upload everything. Very likely what travels from sensor to hub is just a thumbnail and other meta-data. Then the edge function concludes that a download is needed (hopefully without too much delay), sends back a download request to the imaging device, and then the device moves the image to the cloud.

Moreover, there are industry standards for uploading photos and videos to a cloud, and those put the uploaded objects into the edge version of the blob store (short for "binary large objects"), which in turn is edge aware ands will mirror them to the main cloud blob store. Thus we have a whole pathway from IoT sensor to the edge blob server, which will eventually generate another event later to tell us that the data is ready. And as noted, for data that needs to reach the actual cloud and can't be processed at the edge, we replicate this path too, moving that image via the queuing service to the cloud.

So how long will all of this take? Latencies are high and bandwidth low for the first hop, because sensors rarely have great connectivity, and almost never have the higher levels of power required for really fast data transfers (even with 5G). So perhaps we will see a 10ms delay at that stop, plus more if the data is large. Inside the edge we should have a NUMA machine or perhaps a small cluster, and can safely assume 10G connections with latencies of 10us or less, although of course software like TCP will often impose its own delays. The big delay will probably be the handoff to the user-defined function, F.

My guess is that for an event that requires downloading a small photo, the very best performance will be something like 50ms before F sees the event (maybe even 100ms), then another 50-100 for F to request a download, then perhaps 200ms for the camera to upload the image to the blob server, and then a small delay (25ms?) for the blob server to trigger another event, F', saying "your image is ready!". We're up near 350ms and haven't done any work at all yet!

Because the function server is limited to lightweight computing, it hands off to our micro-service (a quick handoff because the service is already running; the main delay will be the binding action by which the function connects to it, and perhaps this can be done off the critical path). Call this 10ms? And then the micro service can decide what to do with this image.

Add another 75ms or so if we have to forward the request to the cloud. So the cloud might not be able to react to a photo in less than about 500ms, today.

None of this involved a Jim Gray kind of analysis of contention and backoff and retry. If you took my advice and used Derecho for any data replication, the 500ms might be the end of the story. But if you were to use a database solution like MongoDB (CosmosDB on Azure), it seems to me that you might easily see a further 250ms right there.

What should one do about these snowballing costs? One answer is that many of the early IoT applications just won't care: if the goal is to just journal that "Ken entered Gates Hall at 10am on Tuesday", a 1s delay isn't a big deal. But if the goal is to be reactive, we need to do a lot better.

I'm thinking that this is a great setting for various forms of shortcut datapaths, that could be set up after the first interaction and offer direct bypass options to move IoT events or data from the source directly to the real target. Then with RDMA in the cloud, and Derecho used to build your micro service, the 500ms could drop to perhaps 25 or 30ms, depending on the image size, and even less if the photo can be fully handled on the IoT Edge server itself.

On the other hand, if you don't use Derecho but you do need consistency, you'll get into trouble quickly: with scale (lots of these pipelines all running concurrently), and contention, it is easy to see how you could trigger Jim's "naive replication" concerns. So designers of smart highways had better beware: if they don't heed Jim's advice (and mine), by the time that smart highway warns that a car should "watch out for that reckless motorcycle approaching on your left!" it will already have zoomed past...

These are exciting times to work in computer systems. Of course a bit more funding wouldn't hurt, but we certainly will have our work cut out for us!

Sunday, 20 January 2019

Derecho status update

As we swing into 2019 mode, I wanted to share a quick Derecho status report and point to some goals for the coming months.

First, our ACM Transactions on Computer Sysms paper will be appearing sometime soon, which should give nice visibility for the work, and also the validation that comes from a tough peer-to-peer reviewing process. The paper has hugely improved through the pushback our reviews provided, so it was a challenge but, I think, worth it. The system itself works really well!

Next, we are starting to focus on a stronger integration with Azure IoT, where Derecho could be used either as a tool for creating new micro-services with strong fault tolerance and consistency guarantees, or as an ultra fast RDMA-capable object store. Microsoft has been supportive of this and Derecho should be available from their third party portal, still as a free and open source technology.

But that portal isn’t available yet. So right now, use Derecho via the v0.9 release of the system, which will be available by February 1 (we are saving v1.0 for later in the year, after we have a reasonable amount of end user experience). As of today, we still have one or two bugs we want to fix before doing that release.

Some key points:

We are urging people to use the Ubuntu Linux version, because this interoperates between normal Linux environments and Azure (the Microsoft cloud). On our release site (download here), you can find the source code but also some VMs (a container and then a true VM) with the library preinstalled. But in fact Derecho should work on any Linux-compatible system.
Right now, Derecho has only been tested in a single cluster, cloud (Azure, including Azure IoT, AWS, etc). We have some limited experience with virtualization, and with ROCE as opposed to pure Infiniband.
The easiest path to using Derecho is via the new key-value store. In this both keys and values can be any serializable object type you like, and we offer a wide range of features: Put and Get, but also a conditional put, which checks that the version you are writing was based on the most current version of the underlying object (useful for atomic replace, like in Zookeeper), plus a watch operation that works just like a topic based pub-sub or DDS. Objects can be stateless, stateful but not versioned, or versioned and persistent with strong consistency and extremely accurate temporal indexing. On this we will eventually support pub-sub (think of Kafka or OpenSplice), file systems (HDFS, Ceph), and maybe even a genuine Zookeeper look-alike. The only caution is that the watch feature isn’t designed to support huge numbers of watched topics. So if you would have more than 50 or 100 active topics, consider using Dr. Multicast to squeeze that set down.
The full system can only be used directly from our templates library API in C++, but you can easily build a wired-down library with no templated methods and then load it from Java or Python or whatever.
Runs on RDMA, OMNIPath, and even on normal TCP with no special hardware help at all. You just configure it via a configuration file, to tell the system how to set itself up. We use LibFabrics for this mapping to the underlying hardware.
Right now, all the Derecho group members need to be on hardware with identical endian and byte alignment policies, but clients not in the group can use RESTful RPC, the OMG DDS stack, WCF or JNI to issue RPCs to Derecho group members, which can then relay the request as appropriate.

Later this year we will extend the system in various ways. The API should stay stable, but the new features would include:

Hierarchically structured WAN layer that does read-only content mirroring for the object store.
A form of ultra fast and scalable LAN and WAN BlockChain support.
Machine checked correctness proofs, and a reference-version of the core Derecho protocols both in a high level form, and as proved and then re-extracted from those proofs in C or C++.
External client access to our API via RDMA, supporting point-to-point send and query.
Integration with Matt Milano’s mobile code language, MixT, allowing a client to send code to data residing in the object store.

Friday, 17 March 2017

The CAP conjecture is dead. Now what?

CAP has been around now for something like 15 years. There are some circles in which the acronym is even more famous than FLP or SAT! But CAP was never a theorem (let's call it a "folk-theorem"), and by now we have more and more examples of systems that violate CAP.

Yet even if CAP is false, it was also a fantastic rule of thumb that was incredibly valuable to developers tasked with building high performance distributed systems on the cloud. If we start to teach students that CAP is false, what can replace it in this role?

A little historical context: CAP is short for "you can only have two from Consistency, Availabilty and Partition Tolerance", This assertion was initially put forward by Eric Brewer in a PODC keynote talk he gave in 2000. The justification he offered ran along the following lines.  First, he observed that in today's increasingly globalized web systems, we invariably deploy services with high latency WAN links between them. These are balky, so services are forced to make a choice: either respond right now based on data locally available, or await restoration of the WAN link. A tradeoff that obviously favors availability over consistency, and already tells us that many web services will have to find a way to prevent responses that reflect stale data from being noticed by users, or if they are noticed, from causing problems. He used the term "partition tolerance" for this kind of fault-tolerance (namely, giving a response even when some link is down). Hence the P in CAP.

He suggested that we are favoring "A and P over C".

Then he observed that even in a single data center, if you want the highest possible levels of performance, you'll need to handle web requests on edge nodes that take their data from cache, without first talking a backend server first: again weaker consistency, but higher availability. So again, we see the A, as a kind of synonym for rapid response.

So he looked at all the different pairing: C and A over P, C and P over A, A and P over C. He concluded that in practice we always seem to find that by taking A and P we get the best scalability. But CAP itself asserted that although other mixes work, you always have to pick the two you like best, and you'll erode the third.

Although the definitions of A and P are a bit strange (P seems to have a bit of A mixed in, and also seems to sometimes mean fault-tolerance, since partitions never really arise within a data center), CAP is sort of catchy.  But like many such acronyms, much depends on the details: if you try and give rigorous definitions, to turn CAP into a theorem (people have done so), you find that it only holds in some narrow situations.

The result is that as professors teaching cloud computing, we generally treat CAP as a clever acronym, but one that works mostly as a rule of thumb. In some sense, CAP is a useful kind of folklore.

CAP took hold back in 2000 because at that time, companies like eBay and Amazon were struggling with the high costs of ACID transactions in systems that the database SQL programming model. Scalable database performance poses issues that are much more nuanced than the ones Eric had in mind: there were puzzles of lock conflict, complexity, data pipelines with non-trivial asynchronous ordering requirements, etc. But the bottom line is that performance of the big systems at eBay and Amazon was erratic and often strayed outside the magic 100ms target for web-service and web-page responses. This is the point at which a human user feels that the system is "snappy" and everyone wants to be in the sub-100ms range.

So, the technology leaders at eBay began to tell their developers that it was absolutely fine to write SQL code as a starting point in web application design (a pragmatic decision: they people they were hiring mostly had extensive SQL experience from their database courses), but that once the SQL code was working, to weaken the transactions by turning them into a series of individual atomic actions. eBay began to talk about the resulting development methodology using a new acronym: BASE, by which they meant "Basically Available, Softstate systems with Eventual Consistency."

Notice that BASE doesn't abandon consistency. Instead, it points out to the developer that many web systems just don't need ACID guarantees to work correctly.   ACID and SQL are great for creating a quick prototype that will be robust and easy to debug, but then you "optimize" it by taking away the ACID properties, without breaking the behavior in ways that violate the specification.

Amazon embraced BASE too. Around 2006, they decided to rebuild many of their core applications around a key-value technology called Dynamo, but SQL users found it hard to use, and by 2008 the adoption of Dynamo began to falter. To make the transition easier, Amazon layered in a NoSQL API for Dynamo, called Dynamo-DB: now SQL code could run on Dynamo, but with weaker guarantees than for a full SQL system (for example, NoSQL lacks join operations), and Dynamo-DB happened to be especially well-matched to BASE.

So you can see from these examples why CAP would be such convenient way to motivate developers to make the switch: it more or less tells them that if they optimize their code using BASE, it won't scale properly. Moreover, the eBay memos about BASE include step by step instructions to explain precisely how to go about doing it.

Today, fifteen years later, we've discovered more and more ways to implement scalable, high performance cloud services, edge caches that maintain coherence even at global scale, fully scalable transactional key-value storage systems, and the list goes on. Derecho is one example: it helps you build systems that are highly Available, Replicated and Consistent. Call this ARC.

The cool twist is that with ARC, you get lock-free strong consistency, right in the cloud edge! This is because we end up with very fast replication at the edge, and every replica is consistent. You do need to think hard about your update sources and patterns if you hope to avoid using locks, but for most applications that aspect is solvable because in the cloud, there is usually some sense in which any particular data item has just once real update source. So there is an update "pipeline" and once you have consistency in the system, you end up with a wide range of new options for building strongly consistent solutions.

An example I like very much was introduced by Marcos Aguilera in Sinfonia. In that system, you can make strongly consistent cache snapshots, and run your code on the snapshot. At massive scale you have as many of these snapshots as needed, and transactions mostly just run at the edge. But when you want to update the system state, the trick he suggested is this: run your transaction and keep track of what data it read and wrote (version numbers). Now instead of just responding to the client, generate a "minitransaction" that checks that these version numbers are still valid, and then does all the writes. Send this to the owner of the true database: it validates your updates and either commits by applying the updates, or rejects the transaction, which you can then retry.

Since so much of the heavy lifting is down when speculatively executing the transactions at the first step, the database owner has way less compute load imposed on it. Then you can start to think about systems with sharded data that has different owners for each shard, and limits the transactions to run within a single shard at a time (the trick is to design the sharing rule cleverly). Sinfonia and the follow-on systems had lots of these ideas.

ARC creates a world in which solutions like the Sinfonia one are easy to implement -- and the Sinfonia approach is definitely not the only such option.

I'm convinced that as we move the cloud towards the Internet of Things services, developers will need this model, because inconsistency in a system that controls smart cars or runs the power grid can wreak havoc and maybe even would be dangerous.

So now I want to claim that ARC is the best BASE story ever! Why? Well, remember that BASE is about basically available, soft-state systems with eventual consistency. I would argue that in examples like the Sinfonia one, we see all elements of the BASE story!

For example, a Sinfonia system is basically available because with enough consistent snapshots you can always do consistent read-only operations at any scale you like. The state is soft (the snapshots aren't the main copy of the system: they are replicas of it, asynchronously being updated as the main system evolves). And they are eventually consistent because when you do make an update, the mini-transaction validation step lets you push the update results into the main system, in a consistent way. It might take a few tries, but eventually, should succeed.

What about the eventual consistency aspect of BASE in the case of ARC?   ARC is about direct management of complex large-scale strongly consistent replicated state. Well, if you use ARC at the edge, back-end servers tend to be doing updates in a slightly time-lagged way: batched updates and asynchronous pipelines improve performance. Thus they are eventually consistent, too: the edge queues an update, then responds to the end-user in a consistent way, and the back end catches up soon afterwards.

And ARC isn't the only such story: my colleague Lorenzo Alvisi has a way to combine BASE with ACID: he calls it SALT, and it works really well.

So, feeling burned by CAP?  Why not check out ARC? And perhaps you would like a little SALT with that, for your databases?