A Few Thoughts on Distributed Computing: CAP

Showing posts with label CAP. Show all posts

Saturday, 13 June 2020

That's impossible!

Distributed computing, like many mature areas of computer science, has its share of impossibility results. I was lucky to be in the field before many of them were discovered, and because I'm basically a software builder, my research group built all sorts of distributed systems and tools. Later, though, the theory community showed that some of those things were impossible, which leaves a puzzle: How can anything solve an impossible problem?

Today, I thought it might be fun to explore an example. I want to go deeply enough into the question to shed light on the limitations of what is feasible in computer networks and cloud datacenters. At the same time, we'll also discover limitations on what can be modelled and proved, and even some limitations on the subtle ways that choices of wording (in English) can shape perceptions about what can and cannot be done.

When we talk about consistency in a distributed system, people generally map the work to their favorite model. For fans of database models, this might be transactional serializability and the ACID properties: Atomicity, Correctness, Isolation and Durability. Distributed systems researchers would point to State Machine Replication or Linearizability. The systems I created also worry about dynamic changes to the pool of servers, and have a consistency mechanism called virtual synchrony: membership changes in atomic steps. During a period when membership is stable (an epoch), we use state machine replication within the members. Protocols like Paxos solve this part of the problem.

If you take a big step back and abstract, all of these forms of consistency boil down to forms of fault-tolerant consensus: we have some set of processes, and they propose various kinds of actions, and vote on what action to do next. Consistency is the property that that they all decide the same thing. Thus, anything we can say about fault-tolerant consensus sheds light on fundamental limitations that apply to databases, distributed systems, and all sorts of other settings.

Any mathematically-based discussion starts with a model. For example, what does it mean to say that a failure has occurred? The most obvious choice is to say that well, some component halts -- it crashes suddenly and without doing anything horrible right as it collapses. But it is easy to come up with other failure models.

For example, consider timing. If a processor has a local clock that fails (drifts from the truth, or perhaps starts to show nonsense values), sometimes the remainder of the system stays healthy, even including the process on the machine with the faulty clock. Yet that could cause the process running on that machine to "lie" about what time some event occurred, or to miss a deadline. We tend to wrap all of these up, and call them timing faults.

Communication faults are tricky to model too. You may not be aware of this, but on Linux a process could send a message, but then the operating system or network could drop it. Should we blame the process itself? The network? And should we say that the message wasn't sent, or wasn't received? Worse, a network could split into parts that can't talk to each other: network "partitioning" failures.

Then we get the "hit by a cosmic ray" issues. Things like that (or more mundane problems like a power fluctuation) can cause the computer memory to flip bits. As a result, a process could experience some form of data corruption. And this doesn't even consider the case where the process is hijacked by a virus. We tend to lump all such issues into what we call "malicious" failure models, but even within the malicious models, there is a range that includes whether or not one allows collusion, as opposed to a strictly isolated form of nefarious misbehavior: A virus that can infect one process might be able to infect every process running the same code, and then mount a coordinated attack on something else. In contrast, that bit flip is just a case of flakey hardware and would only impact a single process at a time.

There is a lot of work that explores this range of behaviors. In fact one model, called the BAR model of failures, starts with these cases and then goes further by introducing incentives: are the participants out to cause chaos (a so-called byzantine case)? Or are they altruistic? Purely rational? Crash failures are then layered in, giving a cross-product that you can study to yield an abundance of impossibility results for tasks like reaching agreement or electing a leader.

For our purposes today, the result I want to discuss is one from a paper by Fischer, Lynch and Paterson called Impossibility of Consensus With One Faulty Process. We often refer to this just as the FLP impossibility result, and it is arguably the most famous of all such results. As the title suggests, the paper seemingly shows that agreement (consensus) is actually not possible if a system is at risk of crash failures. On the face of it, the FLP assumptions about the network and the possible failures are very mild -- and the result seemingly implies that databases, Cornell's amazing new Derecho system, dozens of other Paxos-based systems, Blockchain solutions for Bitcoin (all of which can solve consensus) are "doing something impossible." Indeed, FLP seems to imply that the very idea of consistency as a goal is hopeless: if we accept the title, don't waste your time building a database! But as we will see, a lot centers on the details of what the words in that title actually mean. The result is a real, valid, one. It does apply to all of those kinds of systems. But it doesn't mean we should pack up and head home.

The paper came out in 1985, which was an important year for me: it just happened to be the year when I wrote my first paper about a distributed system my students and I had created, one that implemented atomic multicast and Paxos and could be used to solve all sorts of practical agreement problems. We called it Isis (as it turned out, an unfortunate choice). Isis was released to the public, open source, and by 1987 or so it had a surprisingly wide uptake. The system was ultimately used in the New York Stock Exchange to run the trading floor, was adopted in the French Air Traffic Control System where still is used for many purposes, and even into the Oracle database product, which launches Isis every time the Oracle cluster management system boots -- and this is just a few examples.

As you can guess, right from the start I was getting questions about FLP, including in panel sessions at major conferences where this oddity was debated. My mother (an academic) has a saying that academic debates become heated precisely because the issues really don't matter. The big conference in systems, SOSP, was infamous in that time period for fireworks, very much validating my mother's point. In retrospect, I would have to say that relatively few SOSP attendees cared in a deep sense about FLP. But they were curious to understand the result: The paper is short and easy to read, but surprisingly hard to map to any simple intuition about what it "means". And let's face it: they also enjoyed lively panel debates.

Isis users didn't fuss too much about FLP: if they knew about it at all, they perceived it as an oddity. For a decade, all those traders on the floor of the NYSE happily traded stocks. Eventually the parts of the system that used Isis were phased out during a routine upgrade, but not because it had caused any issues -- in fact there wasn't a single disruption to trading during that entire ten-year period (crashes did occur, but Isis orchestrated self-healing recovery in every case).

By now, the French ATC system has been safely guiding airplanes since 1995 and will probably not be upgraded before 2025: a 30 year run! The designers of those platforms, and others, liked Isis both as a practical tool and because of its consistency-preserving consensus mechanisms. Isis created a world of state machine replication elements, which enabled self-managed, self-healing applications. Moreover, just as Isis itself could be described easily using the mathematics of state machine replication, those applications could also be proved correct and safe, even when components failed.

For example, one obviously wants to be sure that an Air Traffic Control System guarantees that each plane and each airport runway will have a single designated controller who is responsible for managing it. Isis allowed the French system to obtain this property. Each flight should have a single active flight plan; any change to the flight plan subsumes the prior one. Isis was central to their correctness proof for this property, too (one can think of an ATC system as centering on a log of flight plan versions, in which each change is a durable append to the log, and the current flight plan is always the version closest to the end of the log).

ATC systems never generate huge loads, and for this reason it was also possible to quantify the system's performance profile, and even to show that performance was stable across a wide range of stress tests and failure sequences (one does this by designing tests that carefully measure delays and bandwidth even as failures are injected to mimic scenarios believed possible during actual operation). This enabled the developers to convince skeptics that if the workstation used by some controller were to fail, someone else would take over the role within 30 seconds. During certification, the French used red-teams that were free to pour over the protocols, the code, and the way the system used it. Then they would often demand proofs for challenging scenarios they would construct. Sometimes the best response included a mathematical argument, but more often, these red-teams wanted to see experimental responses: an experiment that would mimic the case they worried about, and ride it out successfully. Over time, the system evolved to have an enormous range of unit tests and integration tests. Isis passed every test... and yet, FLP sits there.

Curiously, the one time I can remember this coming up, the red-team dismissed the FLP work, feeling that it made assumptions that didn't apply in an ATC setting (as we will see, FLP is posed in a very abstracted way, and assumes a very general model of the network). Yet I still felt that I needed a really cogent answer. Suppose that someone really were to challenge Isis in this way. What would be the best possible way to respond and explain how Isis relates to the FLP result?

In fact this was an actual worry for me. Those customers trusted me, and under my guidance, we were using the Isis system for applications where it genuinely matters that the solution have safe, consistent, self-healing behavior, and within seconds too! We had correctness proofs for the system (really, proofs of safety, not liveness, but the performance work had the flavor of a liveness claim). FLP proclaimed this combination of guarantees to be impossible... how can that be? Indeed, there were follow-on papers that appeared soon after FLP, pointing out that Isis and its version of Paxos were clearly subject to the FLP impossibility result. I had no choice but to figure out the proper rebuttal.

A great deal of the explanation centers on the FLP paper's approach to modelling the problem. Earlier, I said a few words about their system model, but didn't mention the network, and I didn't explain how they define asynchronous execution. So let's tackle those aspects first. The FLP paper assumes a network that mimics several aspects of TCP. Processes are permitted to send each other messages, and FLP assumes that these are sent over reliable network channels that never drop, corrupt or duplicate packets. I mentioned earlier that Linux, and the network itself, can both drop messages. But in fact if you use TCP, the TCP protocol itself compensates. On the face of it, FLP seems to assume exactly what TCP guarantees in a standard networked system, like a cloud data center. TCP, as you probably know, obtains this behavior by sequencing data, and then using explicit acknowledgments or complaints ("negative" acknowledgements) to trigger retransmissions and rate control. Duplicates can be filtered out because TCP has already seen those bytes.

On closer study, the FLP network model departs from TCP by allowing out-of-order delivery. Moreover, this matters: their proof requires this property, because it involves delaying some messages and allowing others to skip past the delayed ones (we'll see how they use this feature in a moment). For a while, it seemed plausible that this could be the key. Perhaps state machine replication seems to be very possible because we run on TCP (mostly), and it delivers messages in order. However in the end, it turned out that this particular aspect of the FLP model was unimportant. FLP applies even to a protocol built over TCP or RDMA.

Another puzzle relates to the way the FLP model defines asynchronous behavior. In the FLP description of processes, there are no clocks and indeed nothing even faintly resembling a clock: a set of processes could run the protocol for a billion years before achieving agreement, and this would be "just fine." Obviously, an air traffic control system wouldn't be happy with billion year delays, so real systems like Isis and Derecho have timers built in: if some process isn't responsive, the others vote the silent one out. To avoid a partitioning event (where our ATC system might split in half, with two subsets that each believe the other to have crashed, implying that two controllers would think themselves responsible for the same plane), we just require that in any such vote, a majority of the system has to "survive" long enough to vote in favor of the new membership. The majority rule eliminates the risk of split-brain behaviors, which could threaten safety.

These points, it turns out, are a bit more complicated than the one involving TCP. The first thing to appreciate is that rapid responsiveness is very important in an ATC system. When a pilot is approaching a landing strip, the ATC system is under a timed obligation to tell the pilot what to do: should plane A land first, or will plane B land first? Can plane A to climb 2500 feet to avoid that thunderhead?

Failures can threaten those quick responses, and this means that an ATC system may be in a hurry to kick out an unresponsive component. Yet if a pilot was cleared to land, but then the original controller's computer freezes and a new controller takes over, he or she should know about that prior clearance. This turns out to be the same requirement as the FLP rule that if any process decides v, then every process decides v. This tells us that even though ATC systems are biased towards availability, the most fundamental aspect of the FLP way of defining consensus still applies.

What about clocks? Real-world control systems like ATC platforms make heavy use of time, and take precautions to protect against errors that could be introduced by machines with faulty clocks. FLP had no clocks... does this somehow take us out of the FLP scope? One can imagine that it might: clocks enable timed actions, and we use timeout for fault detection: an ATC platform will give up on any tardy process and treat it as faulty, simply because we don't want the entire system to end up waiting for some very overloaded machine to catch up. Doesn't FLP itself need a way to model the detection of failures? And because timeouts are a form of non-determinism, how can we reconcile this use of time with the deterministic state machine model for the protocol itself?

As it turns out, FLP does have a way to handle non-determinism. In the state machine formalism, processes are state machines, but FLP allows states to have null transition edges. That is, if a process (call it p) is in some state s, FLP models p as having a set of possible next states that we could reach from s: perhaps, states s' or s'' (loop-back edges are fine, by the way, so one these could just be s itself). Each such transition is described by an edge in a state transition graph, and these edges are labeled by a kind of pattern that will unambiguously tell us which transition to take. Thus, when a message is available, we could either consume that message, or take a null transition: a non-deterministic event.

Given that the behavior of a deterministic state machine is fully determined by its sequence of inputs, you can see that the real question centers on the decision as to whether the next input will be a message or a null. FLP gives the network this power: they model the network as an active entity that makes its own decisions. In general, the network has a set of messages ready to deliver, and decides which to deliver, in what order, and whether or not to deliver a null to some process rather than a message. Thus, in the FLP model, timeouts are viewed as a network "decision" to deliver a null message in a state where the protocol may have been waiting for something else (perhaps p is waiting for a message from q, but instead gets a null, and interprets this to mean that a timeout has occurred).

FLP's network is unusually powerful. In effect, it is able to scrutinize each and every message, selectively deciding when to deliver it. There is an obligation to eventually deliver every message... but no bound on how long the network can first delay it. And here, it turns out, is the crux of why FLP concludes that consensus is impossible, even though protocols based on the Paxos model solve consensus billions of times every day. Real systems always timeout when a process has been unresponsive for more than a few seconds. But FLP doesn't require the network to behave that way. The network controls null transitions, yet it might not do so in a way that has anything to do with timeouts. Here, then, is another possible candidate for explaining why FLP doesn't preclude building your favorite consensus-based technology: FLP's way of using nulls has no connection at all to timeouts, and doesn't actually model the way that timeouts are used in real systems. And let me add: this is a very credible argument. Many people invested years thinking about this exact argument, or ideas somewhat like this.

There is an old adage about big systems that if something can happen, it will happen. The FLP authors would argue that their network simply schedules messages, that nulls are examples of places where a timeout could have occurred, and that because even TCP may have to retransmit a few times before a message gets through, that delayed delivery is a real and valid thing. Thus any schedule FLP invents could arise in a real system, particularly if you imagine that a network link was somehow broken, then later repaired. But as we will see (and one can easily turn the observation into a rigorous proof), the FLP network really has impossible superpowers, because the particular messages that end up being delayed this way, and the particular moments when a null transition occurs, need to be chosen with exquisite care. Yes, each individual event could occur in the wild, but every one of them would be so unlikely that any sequence of them would be of zero probability: the probability of a sequence of unlikely events is the product of their probabilities, and with small probabilities, we approach zero exponentially quickly. Yet just this kind of zero-probability sequence is the core of the FLP proof.

To see all of this in action, let's walk through a scenario that the FLP proof constructs. Imagine that we have built a system that must vote on something, 0/1 and that we happen to be in a scenario with a near-tie. Further, assume that we were given a tie-breaking rule. Normally, there actually won't be a tie, but a failure could result in a tie, and then we would use the tie-breaker. For example, perhaps we have 11 processes, 5 vote 0 and 6 vote 1. The winner of this vote should be 1, but only if all of the 6 votes for 1 are tabulated. If one crashes, we have a 5:5 tie, and the tie-breaking rule might award the win to 0.

FLP sets up this scenario, and the proof centers on delaying messages to confuse the protocol about whether to use its normal rule, or the tie-breaking rule. They show that any fault tolerant protocol that can switch over and use its tie-breaking rule would require a few state-machine transitions during which, presumably, it would be internally reconfiguring to switch modes. Then, just at this moment, FLP delivers the delayed vote, selecting some other message and delaying it instead. In effect, just as the vote itself was a multi-step procedure, they are playing with the idea that switching to tie-breaking mode would also be a form of consensus: another multi-step procedure.

In a formal analysis they actually treat both of these as transitions from a state with two possible outcomes (0 or 1) to a state with just one decision. And what they prove is that in any such state, a network can selectively delay some messages, and selectively deliver other messages (or nulls), and force the system back to a two-outcome condition. Then they let the delayed messages through and attack again, in the identical way. Thus, any decision is indefinitely delayed.

Interestingly, they can do this without any failures at all: they prove that any fault tolerant consensus protocol has a run in which no process fails, and yet even though nothing fails, no decision is ever reached! But the key is that the network must have that superpower enabling it to selectively delay just the right messages at just the right instant, while also knowing just when to deliver the delayed messages, so that the network obligation to eventually deliver every message is respected.

From this, we can see that FLP authors actually meant something peculiar by their use of the word "impossibility." In normal conversation, impossible means just what it sounds like. If I tell you that it is impossible for a protocol that can only decide 0 or 1 to decide -1, we would all agree on this. But given what I've described, if I claimed that it is impossible to build a network that has the properties assumed by FLP, you would probably agree with me. The FLP authors would disagree: for them, "exponentially convergent to 0 probability" still leaves that tiny, ultra-unlikely bad case. No matter how unlikely, if a system could exhibit some behavior, the FLP model would consider that it is possible for that behavior to occur.

Conversely, in FLP, impossible also has a meaning... but not the meaning I might instinctively assume. Think about the classic definition of correctness for a protocol, as Dijkstra first did: a correct algorithm is one that guarantees safety (nothing bad happens, using a predicate to define bad behavior), and liveness (something good eventually happens). The FLP definition of impossibility centers on liveness: if you really dig deep, FLP is telling us that if we have a safe consensus protocol for some asynchronous distributed system, it cannot also be a live protocol. A person could be forgiven for assuming from the paper's title that that there are no correct (safe) consensus protocols, but this is not what the paper actually does. In fact it does the opposite: it assumes we have a safe protocol, and then shows that the protocol cannot also guarantee liveness, by pointing out that the zero-probability schedule discussed above is capable of delaying decisions indefinitely.

We arrive at a peculiar insight. On the one hand, ATC systems and other similar distributed applications need to make quick progress: we don't want an ATC system to delay for a billion years before deciding who can land on runway 2-E. This leads them to reconfigure if a process seems unresponsive, dropping that slow process to remain highly available. At the same time, they do require safety guarantees, and those stem from the safety properties of consensus. Due to the risk of network partitioning, we know this approach can't guarantee liveness, but we accept that because the frequency of network partitioning is very low -- data center power loss is a more common problem.

Then, on the other hand, we have FLP. One could brush it to the side and say that well, FLP isn't relevant here. But is that really correct? FLP doesn't consider the network partitioning scenario we knew about (and that already precluded liveness, given our availability goals). Yet FLP seems to be warning that even if we could somehow eliminate this exposure to a crash due to partitioning, there is actually is another "unstoppable" scenario, involving a peculiar network behavior, that no consensus protocol can defend against.

But this leads to another insight: Reverting to Dijkstra, those of us who deliver safety-critical code to ATC organizations might actually want to prove that our systems are safe and live. FLP teaches us that if you wish to prove that a system like Derecho will always make progress, you'll need to introduce some extra assumptions beyond the ones FLP employs. Without extra assumptions, we can't create such proofs because if we could, we would have violated FLP. How might those assumptions look? Think of them as a list: "if there are no network partitions, and if the network has no way to selectively delay messages... ". What would that list of conditions need to include?

There has been wonderful work on this question too: in a famous paper, Chandra and Toueg (later joined by Hadzilacos) described a very basic two-phase commit protocol for consensus, and worked out the weakest assumptions one can make that would still allow it to guarantee liveness: something they called the <>W failure detection oracle. The plain-English name for <>W is "eventually weak", and in plain English, <>W is a failure detector that can make mistakes, but where eventually, some non-faulty process is recognized as healthy by all the other non-faulty processes. This state needs to be sustained for long enough for the consensus protocol to terminate, and <>W expresses that by just saying "and we stay in this state forever".

More recent work by Idit Keidar and her systems, notably Alex Shraer, showed that in real systems, one can generally assume a stronger failure detector called <>P. This failure detector can make mistakes for a while, but eventually settles into a state in which it makes perfect failure discoveries for long enough to allow consensus to complete.

In a real system, there is actually a way to create a <>P guarantee. The trick centers on making sure that if any process suspects q of having failed, q really has crashed. How do we solve that aspect? The real but somewhat silly answer is that real systems reboot, reimage or replace malfunctioning components. James Hamilton came up with this little adage, and we refer to it as "Hamilton's three R's." In effect, we unplug q. Voila! That process has definitely failed.

In a setting where we can assume <>P, FLP would actually not apply. Now, Derecho and similar systems can be shown to be both safe and live. Of course, events that violate our assumptions can still crash the system -- network partitioning, loss of datacenter power -- but our proof would hold as long as those conditions for progress are maintained.

If you really ponder the point, with a <>P solution based on Hamilton's three R's, FLP becomes a network partitioning attack: rather than delaying messages, it kills so many processes that Derecho would shut down, and Paxos would stop accepting new updates or letting applications learn (read) the committed part of the Paxos log.

All of this discussion... just to address one impossibility result. In fact we have many others, and they come with oddities too. For example, there are many such results for Byzantine Agreement, where we have a set of processes, and some subset of them can fail in any way you like, including malicious behavior crafted to disrupt the system. But the Byzantine model normally is explored in networks with perfect clocks and perfect synchronization -- and we don't have such networks. Moreover, with Byzantine solutions, if the number of faulty components reaches and then passes the assumed limit, all bets are off.

Let's generalize this to the broader question I posed at the start. If distributed computing theory is full of these kinds of pitfalls and peculiar results, what does that tell us about the discipline as a science? To me, the moral of the story is simply that theory motivated by real systems can shed valuable light but that one has to view these results with a degree of sophistication and caution: they might not mean quite what you would assume, even if the paper you are studying has an especially provocative title. The mathematics teaches deep things, but it often isn't trivial to relate those insights back to your actual system.

But this also makes it very hard to teach this material, especially to students who distributed computing as an interesting curiosity, but not necessarily as a topic they really want to wrestle with. While a comprehensive course would certainly delve deeply into the theory, the subtle nature of the result makes it very hard to include FLP in a more practical course, like my spring class on cloud computing. For very practical students, when they hear that FLP says that consensus is impossible, there can be a tendency to jump wholeheartedly into Brewer's CAP model, with its "you can have two out of three" mantra. It happens that CAP is not a theorem and is often not even true, yet it can seem like a very simple and appealing rule of thumb. CAP tells my students to just assume that consistency is hard and promises them (perhaps, falsely) that inconsistency is far easier.

I did spend part of one lecture on FLP this spring. I worried that the course might otherwise be unbalanced -- that I needed to at least expose the students to a few of the most important theoretical results. In that lecture, my main message was that one shouldn't take every paper's title at face value. I get them to propose definitions of "impossible" and then surprised them with the FLP meaning of impossibility. And I do think they understood the point: most later got a quiz question on this right. A good tradeoff: after all, FLP really is a lovely bit of mathematics, and at least they heard about it.

Tuesday, 19 December 2017

Data concentrators: Follow the bytes

In my last posting, I discussed the need for cloud-hosted data concentrators in support of fog computing. I thought it might be interesting to dive a bit deeper on this topic. A number of questions arise, but I want to suggest that they center on understanding the "critical path" for data. In computer systems parlance, the term just means "the portion of the system that limits performance." Once we understand which elements make up the critical path, we can then ask further questions about how one might implement an architecture to support this form of data concentrator, with various cloud-like goals: resource sharing, scalability, and so forth.

To set the context, let me give an example of how this form of reasoning can be valuable. Consider Eric Brewer's famous CAP conjecture for web servers (and the associated BASE methodology). It may be surprising to realize that before Eric actually put this forward, many web server systems were built as classic 3-tier database structures, in which every web page was basically the output of a database query, and the database server was thus on the critical path even for read-only web transactions. The database queries ultimately became the bottleneck: nobody could figure out how to get linear scaling without having the database concurrency mechanism dead center on the critical path and ultimately, overloaded by the load.

At that time we saw some very clever workarounds that came close to solving the problem. For example, with snapshot isolation, it was shown that you could approximate serializability by running transactions slightly in the past, on consistent snapshots. There was a scheme for gradually narrowing the snapshot window, to make it as current as practical and yet ensure that the data in the snapshot was committed and consistent. So by doing a read off this window, you could avoid consulting the back-end database more frequently than required and end up generating your web response in a consistent way, but totally locally, just talking to the cache. Yet even snapshot isolation frustrated some vendors. They wanted far more scalability, and snapshot isolation still had some overheads that involved queries to the backend database. It seemed clear that consistency would be at odds with eliminating this round-trip to the back end.

CAP originated from Eric's decision to study this exact issue: he observed that responsiveness for web page generation was overwhelmingly more critical than any other aspect of the system, and that database queries are slow due to their consistency mechanisms -- mechanisms that do guarantee that the answer is correct, but involve locking and other delays. This led him to ask whether web servers actually needed consistency, and to speculate about how much data can be served out of cached content. He ended up arguing for a "soft state" model, in which pretty much every web request is handled as much as possible from cached content, and his company, Inktomi, was born. CAP then gave rise to a development methodology (the BASE approach), resulted in scalable key-value caching architectures (MemCached, Cassandra, etc), argued for NoSQL database-like models such as in DynamoDB, and so forth.

Honestly, my crowd was up in arms at the time. Very few "strong consistency" researchers liked CAP or its conclusion that we should just toss consistency out the window. I could point to colleagues who rail against CAP, even though CAP itself is 15 years old. And how can one really dispute Eric's deeper point: if these applications don't actually need consistency, why are we insisting that they pay for a property that they don't want, and that isn't all that cheap either?

Notice how from single core insight about the critical path, we end up with a fertile debate and a whole area of research that played out over more than a decade. You may not love CAP, yet it was actually extremely impactful.

CAP isn't the only such story. Jim Gray once wrote a lovely little essay on the "Dangers of Database Replication" in which he did a paper-and-pencil analysis (jointly with colleagues at Microsoft: Pat Helland, Dennis Shasha and ) and showed that if you just toss state machine replication into a database, the resulting system will probably slow down as n^5, where n is the number of replicas. Since you probably wanted a speedup, not a slowdown, this is a classic example of how critical path analysis leads to recognition that a design is just very, very wrong! And Jim's point isn't even the same one Eric was making. Eric was just saying that if you want your first tier to respond to queries without sometimes pausing to fetch fresh data from back-end databases, you need to relax consistency.

Ultimately, both examples illustrate variations on the famous End to End principle: you should only solve a problem if you actually think you have that problem, particularly if the solution might be costly! And I think it also points to a trap that the research community often falls into: we have a definite tendency to favor stronger guarantees even when the application doesn't need guarantees. We are almost irresistibly drawn to making strong promises, no matter who we are making them to! And face it: we tend to brush costs to the side, even when those costs are a really big deal.

So, armed with this powerful principle, how might we tackle the whole data concentrator question? Perhaps the best place to start is with a recap of the problem statement itself, which arises in fog/edge computing, where we want a cloud system to interact closely with sensors deployed out in the real world. The core of that story centered on whether we should expect the sensors themselves to be super-smart. We argued that probably, they will be rather dumb, for a number of reasons: (1) power limits; (2) impracticality of sending them large machine-learned models, and of updating those in real-time; (3) least-common denominator, given that many vendors are competing in the IoT space and companies will end up viewing all these devices as plug-and-play options; (4) lack of peer-to-peer connectivity, when sensors would need to talk to one-another in real-time to understand a situation. In the previous blog I didn't flesh out all 4 of these points, but hopefully they are pretty self-evident (I suppose that if you already bet your whole company on brilliant sensors, you might disagree with one or more of them, but at least for me, they are pretty clear).

Specialized sensors for one-time uses might be fancier (sensors for AUVs, for example, that do various kinds of information analysis and data fusion while in flight). But any kind of general purpose system can't bet on those specialized features. We end up with a split: a military surveillance system might use very smart sensors, operating autonomously in an environment where connectivity to the ground was terrible in any case. But a smart highway might ignore those super-fancy features because it wants to be compatible with a wide range of vendors and anyhow, wants to work with a very dynamically-updated knowledge model. So, across the broad spectrum of products, a system that wants to have lots of options will end up forced towards that least-common denominator perspective.

From this, we arrived at various initial conclusions, centering on the need to do quite a bit of analysis on the cloud: you can probably do basic forms of image analysis on a camera, but matching your video or photo to a massive "model" of activity along Highway 101 near Redwood City -- that's going to be a cloud-computing task. Many people believe that speech understanding is basically a cloud-scale puzzle, at least for the current decade or so (the best solutions seem to need deep neural networks implemented by massive FPGA or ASIC arrays). Most forms of machine learning need big TPU or GPU clusters to crunch the data.

So most modern roads lead to the cloud. I'll grant that someday this might change, as we learn to miniaturize the hardware and figure out how to pack all the data in those models into tiny persistent storage units, but it isn't going to happen overnight. And again, I'm not opposed to fancy sensors for AUVs. I'm just saying that if some of your AUVs come from IBM, some from Microsoft and some from Boeing, and you want to work with all three options, you won't be able to leverage the fanciest features of each.

The data concentrator concept follows more or less directly from the observations above. Today's cloud is built around Eric's CAP infrastructure, with banks of web servers that rush to serve up web pages using cached data spread over key-value stores. Most of the deep learning and adaptation occurs as a back-end task, more or less offline. It isn't unusual for edge models to be hours out of date. And it is very hard to keep them within seconds of current state. But if you plan to control a smart highway, or even a smart home, seconds are too high a latency. "Watch out for that rusty pipe!" isn't a very helpful warning, if you (or your smart car) won't get that warning until 10s too late.

So what would a data concentrator do, and how does the answer to such a question reveal the most likely critical paths? Basically, data concentrators give rise to a new kind of tier-one system. Today's first tier of the cloud runs Eric Brewer's architecture: we have vast army's of very lightweight, stateless, web-page generating engines. A data concentrator would be a new form of first tier, one that focuses on stateful responses using machine-learned models and customized using machine-learning code, much as we customize a web server to generate the various web pages needed for media, direct sales, advertising, etc. Over time, I'm betting we would code these in a high level language, like TensorFlow, although today C++ might give much better performance.

Back to basics. You know the business adage about following the money. For a fog system where we care mostly about performance and real-time responses, the "costly" thing is slowdowns or delays. So let's follow the data. The whole reason for building data concentrators is that our sensors are kind of dumb, but the volumes of data are huge, hence we'll need a system element to filter and intelligently compress the data: I like to think of the resulting systems as "smart memory", meaning that we might treat them like file systems, and yet they are smart about how they store the stuff they receive. A smart memory is just one form of data concentrator; I can think of other forms, too; maybe a future blog could focus on that.

Via its file system personality, a smart-memory system could be used just like any storage infrastructure, supporting normal stuff like Spark/Databricks, TimescaleDB, you name it. But in its edge-of-the-cloud personality, it offers the chance to do tasks like image or voice classification using big machine-learned models, real-time reaction to urgent events (like mufflers falling off on the highway), and perhaps even simple learning tasks ("Ken's car is 2" to the left of what was predicted").

Where would the highest data pressures arise? If we have sensors capturing (and perhaps storing) video or voice, the individual data links back to the data collector tier of the cloud won't be terribly burdened: the core Internet has a lot of capacity, and even a dumb camera can probably do basic tasks like recognizing that the highway is totally empty. But each of these collectors will be a concentrator of streams, capturing data from lots of sources in parallel. So the highest demand may be on that very first leg inside the cloud: we should be asking what happens to a data stream as it arrives in a single data collector.

Without knowing the application, this may seem like an impossible goal, but consider this: we actually know a lot about modern applications, and they share a lot of properties. One is the use of accelerators: if the data is video, the first steps will be best fitted to GPU or TPU or FPGA hardware, which can do tasks like parallel image processing, evaluating large Bayesian or Neural Network models, and so forth. Right now, it is surprisingly hard to actually load data from an incoming TCP connection directly into an accelerator of these kinds: most likely it will need to be copied many times before it ever reaches GPU memory and we can turn that hardware loose. And we've already concluded that only a limited amount of that kind of work can occur on the sensor itself.

So a first insight more or less leaps at us: for this data collection model to work, we have to evolve the architecture to directly shunt data from incoming TCP connections to GMEM or TMEM or through a bump-in-the-wire FPGA transformation. Today's operating systems don't make this easy, so we've identified an O/S research need. Lacking O/S research, we've identified a likely hot-spot that could be the performance-limiting step of the architecture. That was basically how CAP emerged, and how Jim Gray's database scaling work advanced: both were paper-and-pencil analyses that led to pinch-points (in the case of CAP, the locking or cache-coherency delays needed to guarantee consistency for read-mostly workloads; in the case of database updates, the concurrency control and conflicts that caused aborts).

Distilled to its bare bones, the insight is: data concentrators need ways to move data from where it is captured to where it will be processed that minimize delay and maximize throughput -- and lack those options today. Without them, we will see a lot of copying, and it will be very costly.

A second such issue involves comparing data that is captured in multiple ways. If we believe in a sensor-rich IoT environment, we'll have many sensors in a position to hear a single utterance, or to see a single event, and different sensors may have differing qualities of signal. So we might wish to combine sensor inputs to strengthen a signal, enrich a set of 2-D visual perspectives into a 3-D one, triangulate to locate objects in space, select the best qualify image from a set of candidates, etc.

Again, we have a simple insight: cross-node interactions will also be a pressure point. The basic observation leads essentially the same findings as before.

Another form of cross node interaction arises because most data concentration systems form some kind of pipeline, with stages of processing. If you know about Microsoft's Cosmos technology, this is what I have in mind. Hadoop jobs (MapReduce) are another good example: the output of stage one becomes input to stage two.

So for such systems, performance is often dominated by flow of data between the stages. For example, suppose that one stage is a GPU computation that segments video images, and a second stage guesses at what the segmented objects might be ("car", "truck", "rusty pipe"), and a third one matches those against a knowledge model ("Ken's car", "Diane's truck"), then compares the details ("is Ken on the same trajectory as predicted?"). We might see some form of fanout: there could be one element in the first stage, but there could be two or even many second or third stage components, examining very different information and with distinct purposes. Each might send events or log data into storage.

So this then leads to scrutiny of architectural support for that form of structure, that form of handoffs, and attention to the requirements. As one example: if we want our infrastructure to be highly available, does that imply that handoff needs to be reliable on an event-by-event basis? The answers may differ case by case: data from a smart highway is very redundant because cars remain on it for long periods; if a glitch somehow drops one event, a follow-on event will surely have the same content. So in an end-to-end sense, we wouldn't need reliability for that specific handoff, especially if we could gain speed by relaxing guarantees. Eric would like that sort of thing. Conversely, if an event is one of a kind ("Oh my goodness, Diane's muffler is falling off! There it goes!!") you might want to view that as a critical event that must not be dropped, and that has many urgent consequences.

The tradeoffs between data throughput and latency are intriguing too. Generally, one wants to keep an architecture busy, but a system can be busy in good or bad ways. Further, goodness is very situation-specific: for warnings about roadway trash, a quick warning is really important. For aggregated steady state data processing, longer pipelines and batch processing pay off. And these are clearly in tension.

The pithy version would be along these lines: improved support for data pipelines will be important, and is lacking in today's systems. Lacking progress, not only will we see a fair amount of inefficient copying, but we could end up with barrier delays much as in Hadoop (where stage two needs to sit waiting for a set of results from stage one, and any single delayed result can snowball to delay the entire computation).

I won't drag this out -- for a blog, this is already a rather long essay! But hopefully you can see my core point, which I'll summarize. Basically, we need to visualize a scalable data concentrator system in terms of the data flows it can support and process, and optimize to make those flows as efficient as possible. Modern O/S structures seem ill-matched to this, although it may not be very hard to improve them until the fit is much better. But in doing so, it jumps out that low latency and high throughput may well be in conflict here. Following the bytes down these different paths might lead to very different conclusions about the ideal structure to employ!

I love puzzles like this one, and will surely have more to say about this one, someday. But jump in if you want to share a perspective of your own!

Monday, 24 April 2017

Will smart memory need an ACID model?

This is really part II of my posting on smart memory from ten days ago.

Let's imagine that we've created the ultimate data warehouse, using Derecho. This warehouse hosts terabytes of persistent memory, It can absorb updates at a staggering rate: hundreds of gigabits per second, has tens of thousands of processors that crunch the data down and then store it as collections of higher level "knowledge models", and is always hungry for more.

We used to think of data warehouses as respositories for nuggets of data, so perhaps we could call these "muggets:" models, that can now be queried. Hopefully this term isn't painfully cute.

Next, we imagine some collection of machine learning applications that consume these muggets and learn from them, or compute some sort of response to them. For example if the muggets represent knowledge collected on the local highway, the queries might be posed by smart cars trying to optimize their driving plans. "Is it ok with everyone if I shift to the passing lane for a moment?" "Does anyone know if there are obstacles on the other side of this big truck?" "What's the driving history for this motorcycle approaching me: is this guy some sort of a daredevil who might swoop in front of me with inches to spare?" "Is that refrigerator coming loose, the one strapped to the pickup truck way up ahead?"

If you drive enough, you realize that answers to such questions would be very important to a smart car! Honestly, I could use help with such things now and then too, and my car is pretty dumb.

We clearly want to answer the queries with strong consistency (for a smart car, a quick but incorrect answer might not be helpful!), but also very rapidly, even if this entails using slightly stale data. In Dercho, we have a new way to do this that adapts the FFFS snapshot approach described in our SOCC paper to run in what we call version vectors, which is how Derecho stores volatile and persistent data. Details will be forthcoming shortly, I promise.

Here's my question: Derecho's kind of data warehouse currently can't support the full ACID database style of computation, because at present, Derecho only has read-only consistent queries against its temporally precise, causally consistent snapshots. So we have strong consistency for updates, which are totally ordered atomic actions against sets of version vectors, and strong consistency for read-only queries, but not read/write queries, where an application might read the current state of the data warehouse, compute on it, and then update it. Is this a bad thing?

I've been turning the question over as I bike around on the bumpy, potholed roads here in Sebastopol, where we are finishing up my sabbatical (visiting my daughter, but I've also dropped in down in the valley and at Berkeley now and then). Honestly, cyclists could use a little smart warehouse help too, around here! I'm getting really paranoid about fast downhills with oak trees: the shadows invariably conceal massive threats! But I digress...

The argument that Derecho's time-lagged model suffices is roughly as follows: ACID databases are hard to scale, as Jim Gray observed in his lovely paper on "Dangers of Database Scalability". Basically, the standard model slows down as n^5 (n is the number of nodes running the system). This observation gave us CAP and BASE and ultimately, today's wonderfully scalable key-value stores and noSQL databases. But those have very weak consistency guarantees.

Our Derecho warehouse, sketched above (fleshed out in the TOCS paper we are just about to submit) gets a little further. Derecho can work quite well for that smart highway or similar purposes, especially if we keep the latency low enough. Sure, queries will only be able to access the state as of perhaps 100ms in the past, because the incoming database pipeline is busy computing on the more current state. But this isn't so terrible.

So the question we are left with is this: for machine learning in IoT settings, or similar online systems, are there compelling use cases that actually need the full ACID model? Or can Maine learning systems always manage with a big "nearly synchronous" warehouse, running with strong consistency but 100ms or so lagged relative to the current state of the world? Is there some important class of control systems or applications that can probably be ruled out by that limitation? I really want a proof, if so: show me that omitting the fully ACID behavior will be a huge mistake, and convince me with mathematics, not handwaving... (facts, not alt-facts).

I'm leaning towards the "Derecho would suffice" answer. But very curious to hear other thoughts...

Friday, 17 March 2017

The CAP conjecture is dead. Now what?

CAP has been around now for something like 15 years. There are some circles in which the acronym is even more famous than FLP or SAT! But CAP was never a theorem (let's call it a "folk-theorem"), and by now we have more and more examples of systems that violate CAP.

Yet even if CAP is false, it was also a fantastic rule of thumb that was incredibly valuable to developers tasked with building high performance distributed systems on the cloud. If we start to teach students that CAP is false, what can replace it in this role?

A little historical context: CAP is short for "you can only have two from Consistency, Availabilty and Partition Tolerance", This assertion was initially put forward by Eric Brewer in a PODC keynote talk he gave in 2000. The justification he offered ran along the following lines.  First, he observed that in today's increasingly globalized web systems, we invariably deploy services with high latency WAN links between them. These are balky, so services are forced to make a choice: either respond right now based on data locally available, or await restoration of the WAN link. A tradeoff that obviously favors availability over consistency, and already tells us that many web services will have to find a way to prevent responses that reflect stale data from being noticed by users, or if they are noticed, from causing problems. He used the term "partition tolerance" for this kind of fault-tolerance (namely, giving a response even when some link is down). Hence the P in CAP.

He suggested that we are favoring "A and P over C".

Then he observed that even in a single data center, if you want the highest possible levels of performance, you'll need to handle web requests on edge nodes that take their data from cache, without first talking a backend server first: again weaker consistency, but higher availability. So again, we see the A, as a kind of synonym for rapid response.

So he looked at all the different pairing: C and A over P, C and P over A, A and P over C. He concluded that in practice we always seem to find that by taking A and P we get the best scalability. But CAP itself asserted that although other mixes work, you always have to pick the two you like best, and you'll erode the third.

Although the definitions of A and P are a bit strange (P seems to have a bit of A mixed in, and also seems to sometimes mean fault-tolerance, since partitions never really arise within a data center), CAP is sort of catchy.  But like many such acronyms, much depends on the details: if you try and give rigorous definitions, to turn CAP into a theorem (people have done so), you find that it only holds in some narrow situations.

The result is that as professors teaching cloud computing, we generally treat CAP as a clever acronym, but one that works mostly as a rule of thumb. In some sense, CAP is a useful kind of folklore.

CAP took hold back in 2000 because at that time, companies like eBay and Amazon were struggling with the high costs of ACID transactions in systems that the database SQL programming model. Scalable database performance poses issues that are much more nuanced than the ones Eric had in mind: there were puzzles of lock conflict, complexity, data pipelines with non-trivial asynchronous ordering requirements, etc. But the bottom line is that performance of the big systems at eBay and Amazon was erratic and often strayed outside the magic 100ms target for web-service and web-page responses. This is the point at which a human user feels that the system is "snappy" and everyone wants to be in the sub-100ms range.

So, the technology leaders at eBay began to tell their developers that it was absolutely fine to write SQL code as a starting point in web application design (a pragmatic decision: they people they were hiring mostly had extensive SQL experience from their database courses), but that once the SQL code was working, to weaken the transactions by turning them into a series of individual atomic actions. eBay began to talk about the resulting development methodology using a new acronym: BASE, by which they meant "Basically Available, Softstate systems with Eventual Consistency."

Notice that BASE doesn't abandon consistency. Instead, it points out to the developer that many web systems just don't need ACID guarantees to work correctly.   ACID and SQL are great for creating a quick prototype that will be robust and easy to debug, but then you "optimize" it by taking away the ACID properties, without breaking the behavior in ways that violate the specification.

Amazon embraced BASE too. Around 2006, they decided to rebuild many of their core applications around a key-value technology called Dynamo, but SQL users found it hard to use, and by 2008 the adoption of Dynamo began to falter. To make the transition easier, Amazon layered in a NoSQL API for Dynamo, called Dynamo-DB: now SQL code could run on Dynamo, but with weaker guarantees than for a full SQL system (for example, NoSQL lacks join operations), and Dynamo-DB happened to be especially well-matched to BASE.

So you can see from these examples why CAP would be such convenient way to motivate developers to make the switch: it more or less tells them that if they optimize their code using BASE, it won't scale properly. Moreover, the eBay memos about BASE include step by step instructions to explain precisely how to go about doing it.

Today, fifteen years later, we've discovered more and more ways to implement scalable, high performance cloud services, edge caches that maintain coherence even at global scale, fully scalable transactional key-value storage systems, and the list goes on. Derecho is one example: it helps you build systems that are highly Available, Replicated and Consistent. Call this ARC.

The cool twist is that with ARC, you get lock-free strong consistency, right in the cloud edge! This is because we end up with very fast replication at the edge, and every replica is consistent. You do need to think hard about your update sources and patterns if you hope to avoid using locks, but for most applications that aspect is solvable because in the cloud, there is usually some sense in which any particular data item has just once real update source. So there is an update "pipeline" and once you have consistency in the system, you end up with a wide range of new options for building strongly consistent solutions.

An example I like very much was introduced by Marcos Aguilera in Sinfonia. In that system, you can make strongly consistent cache snapshots, and run your code on the snapshot. At massive scale you have as many of these snapshots as needed, and transactions mostly just run at the edge. But when you want to update the system state, the trick he suggested is this: run your transaction and keep track of what data it read and wrote (version numbers). Now instead of just responding to the client, generate a "minitransaction" that checks that these version numbers are still valid, and then does all the writes. Send this to the owner of the true database: it validates your updates and either commits by applying the updates, or rejects the transaction, which you can then retry.

Since so much of the heavy lifting is down when speculatively executing the transactions at the first step, the database owner has way less compute load imposed on it. Then you can start to think about systems with sharded data that has different owners for each shard, and limits the transactions to run within a single shard at a time (the trick is to design the sharing rule cleverly). Sinfonia and the follow-on systems had lots of these ideas.

ARC creates a world in which solutions like the Sinfonia one are easy to implement -- and the Sinfonia approach is definitely not the only such option.

I'm convinced that as we move the cloud towards the Internet of Things services, developers will need this model, because inconsistency in a system that controls smart cars or runs the power grid can wreak havoc and maybe even would be dangerous.

So now I want to claim that ARC is the best BASE story ever! Why? Well, remember that BASE is about basically available, soft-state systems with eventual consistency. I would argue that in examples like the Sinfonia one, we see all elements of the BASE story!

For example, a Sinfonia system is basically available because with enough consistent snapshots you can always do consistent read-only operations at any scale you like. The state is soft (the snapshots aren't the main copy of the system: they are replicas of it, asynchronously being updated as the main system evolves). And they are eventually consistent because when you do make an update, the mini-transaction validation step lets you push the update results into the main system, in a consistent way. It might take a few tries, but eventually, should succeed.

What about the eventual consistency aspect of BASE in the case of ARC?   ARC is about direct management of complex large-scale strongly consistent replicated state. Well, if you use ARC at the edge, back-end servers tend to be doing updates in a slightly time-lagged way: batched updates and asynchronous pipelines improve performance. Thus they are eventually consistent, too: the edge queues an update, then responds to the end-user in a consistent way, and the back end catches up soon afterwards.

And ARC isn't the only such story: my colleague Lorenzo Alvisi has a way to combine BASE with ACID: he calls it SALT, and it works really well.

So, feeling burned by CAP?  Why not check out ARC? And perhaps you would like a little SALT with that, for your databases?