A Few Thoughts on Distributed Computing: August 2017

Tuesday, 29 August 2017

The adequacy of specifications

I had coffee with a visitor yesterday: Professor Eva Kuhn from the Technical University in Vienna. Our conversation was focused on the power, and limitations, of new methods for creating correct solutions to distributed computing problems.

As you probably know, for many years I've been in dialog with a number of programming language researchers (notably Bob Constable, Mooly Sagiv and Noam Rimsky, although the full list would be much longer and would include Leslie Lamport, Nancy Lynch, Jay Lorch and many others), all specialists on a new way of generating provably correct distributed computing systems and protocols from formal specifications. I'm not an expert on this topic myself -- my close collaborator Robbert van Renesse is far more knowledgeable than me, and actually works with such tools himself. Call me an enthusiast.

The methodology they've worked with is roughly as follows. Using temporal logic or a similar abstract mathematical representation, one creates a description of the problem in terms of sets (the members might be understood as computing nodes, processes, variables, the network itself -- the designer gets to decide), operations upon them, and statements about them. In this way we can model a distributed collection of computers, exchanges of messages between processes running upon them, and even crashes.

Next, one specifies a desired behavior, such as uniform agreement (the abstract problem underlying consensus protocols such as Paxos, including the versions we built into Derecho). In some sense this step describes a "goal", but not the method for achieving the goal. Think of the specification as a set of constraints, assumptions, invariants: properties that characterize a correct solution. Ideally, the behavior would have a safety component ("the solution only does good things, in the following sense...") and a liveness component ("provided that the following assumptions hold, the solution always terminates"). But some protocols aren't always live, and sometimes the conditions under which they are live are difficult to pin down, so this second aspect isn't necessarily feasible.
At any rate, in the penultimate step, one specifies the desired protocol itself, but still in a logic formalism. This is done by expressing the behavior of the protocol as a sequence of operations in the underlying framework -- for example, as a set of temporal logic "actions" on the underlying sets, carefully limited so that any "distributed" operation is performed using our abstracted versions of primitives that make sense in real networks. For example, the standard Paxos protocol, with its leader-based rounds used to construct ballots and contend for slots in the Paxos log, would be transformed into a series of set operations that match the behavior of the leader, the local copies of the logs at each of the acceptor processes, and so forth. The lowest level operations would be formal versions of message send and receive primitives.

With this in hand, the next step is to use a theorem prover like NuPRL (Constable's system) or Ivy (the one used at Tel Aviv University) to prove that the abstractly-specified protocol implements the abstractly-formalized behavioral specification. For example, in the case of Paxos, such a proof would show that for any feasible mix of roles (leader, acceptor, etc), and for every reachable protocol state, the correctness invariants hold. We would also want to show that from any initial state, some decision states are always reachable, but of course this is not always possible. For one thing, the FLP impossibility prevents us from creating a protocol in which a decision is guaranteed to occur within finite time, but in fact there are similar cases in which progress can't occur, namely situations that might involve a crash of a majority of the machines, or a partitioning of the network. For example, if you were to launch a consensus algorithm in a setting where more than half the nodes have crashed, clearly it wouldn't be able to reach consensus. As mentioned above, in the ideal case you actually specify assumptions under which progress can occur ("we assume that at least a majority of the processes are operational and remain operational throughout the execution of the protocol"), but it isn't always as easy as you might expect to specify those assumptions. So sometimes, this aspect is not addressed.

These NuPRL or Ivy (or TLA+, Dafny, Coq, Larch...) proofs are semi-manual. There is a proof-checking step, typically done using a technique called model checking (basically, the system enumerates reachable states and verifies that the invariants hold for all such states). Then there are higher level theorems that can offer short-cuts: general proof cases that were previously proved and checked that essentially represent forms of meta-reasoning, covering a class of behaviors. These allow the checker to avoid repeatedly model-checking basic steps and to instead model-check the pattern, and then look for instances of that pattern. Finally, when the checker stalls, a human can intervene and help it find the proof (or perhaps modify the protocol or the target environment specification, to make it easier to establish the desired property).

The meta-proofs (the higher level ones) are very interesting and necessary, because they potentially address cases that cannot be model-checked by exhaustively enumerating every reachable state. For example, suppose that some protocol can generate unbounded runs of states, but a proof exists that every reachable state satisfies a given property. With that "meta" proof in hand, we don't need to examine every reachable state in order to know that the property always holds, and this might let us make a true statement about the protocol that references the meta-property. In contrast, had we tried to examine every reachable state, the checker would have failed, since the runs are unbounded, hence the number of states needing to be checked is unbounded.
In practice, the power of a tool like Ivy or NuPRL centers on this kind of meta-reasoning, and the sophistication of the tactics used to match proof requirements to the library of meta-proofs. Someday, these tactics may approach the sophistication of a human logician, and in fact NuPRL has already been used to semi-automatically solve some deep open problems in logic! But there are many cases it would not yet be able to tackle.

When checkers fail, they sometimes can exhibit a counterexample: perhaps, a run in which two processes enter "decision" states but hold different outcome values, violating consensus. But they can also simply fail to find a proof. This is because a powerful enough specification language can encode undecidable questions, hence one simply cannot automate the entire proof task: there are guaranteed to be some problems on which a prover can neither conclude that a statement is true, nor false. Moreover, for any finite amount of resources (space, computer time) there are problems that would take much more space, or much more time, to model check. So for several reasons, the checker might stall. If it succeeds, though, the result is that we have a proof for our abstract protocol specification: we can show that it solves the problem, as specified.

An interesting point, tangential to what I really want to say here, is that one could perhaps extract minimum conditions for progress (a form of weakest precondition), in the manner common in PL correctness proofs of a few decades ago. This would automate extraction of assumptions needed to prove "conditional liveness": if the precondition holds, then progress will occur. To me that would be amazing, and would automate a step that Chandra and Toueg first proposed for consensus nearly 20 years ago (the famous <>W oracle). If you love theory and languages, do consider tackling this challenge!

But staying "on message", the last step with a prover such as Ivy or NuPRL is to extract a provably correct program in C or O'CaML from the proof itself. That is, given a specification of the problem, and a specification of the protocol, and a proof that the given protocol solves the problem, these tools let you push a button and automatically extract the corresponding code in an executable form, after which you simply compile and run it! Ideally, the compiler used should be proved correct (there is a proved compiler for a subset of C), and the operating system itself be proved correct (like SEL4), and the hardware verified too (using tools like Verilog) and the foundry that made the chips... well, one goes as far as one can. In fact any chain of technologies has its limitations. But let's not fuss over the obvious.

Professor Kuhn works with people developing safety critical systems for self-driving cars, control of potentially dangerous equipment, and so forth, and for her, this new methodology was fascinating. In fact she isn't the first person I've talked to about this: Chris Hobbs, at QNX, and his colleague Peter Shook, are also extremely interested in the potential of such tools. By taking the programmer out of the loop and extracting code directly from a proof. we seemingly eliminate a weakness in many kinds of control systems, where one historically would hire a team to design a solution, but then would depend on people to write the required code, after which you use a method such as the "B method" to try an prove the code correct. The resulting human-generated proofs are only as good as the human team carrying them out, and the worry always exists that they might have failed to check some important case. With a computer-generated proof, there are computable "checking" methods that can quickly establish whether or not every case was model-checked.
And yet even with automated extraction, issues remain. Can every solvable problem be tackled in this way? Over coffee, Professor Kuhn and I talked about two concerns.
First, there are solvable problems for which a model-checker might run low on space or time and throw up its hands, and hence that could be solved in principle, but in fact might not be amenable to formal treatment even so. So one can imagine situations in which there is some important task, like controlling the flaps on an airplane wing, a good solution that our engineering colleagues want to implement, and a hand-built proof of correctness that "seems" adequate, but for which there isn't any feasible fully-machine checked counterpart, because our computers might not be powerful enough, or perhaps because our techniques for generating model-checked proofs might not have the needed meta-proof methods at hand.

A related issue was raised by Hobbs and Shook at QNX: If a proof (and for that matter, the corresponding C program) is not "natural" in appearance, a human skeptic might not be easily convinced that it is correct. Thus there are also model-checked proofs that might be rejected by a team tasked with confirming that the airplane is safe to fly: the computer says "yes, here is the proof" but the skeptic in the team might worry that if no human can read and fully understand that proof, it could be flawed in some systematic way and yet this would be overlooked.
That concern leads directly to the second question Professor Kuhn and I found intriguing. Our conversation ended on a related form of impasse: suppose that the original problem specification itself is somehow inadequate? For example, one could imagine a specification for behavior of flaps on an airplane wing that leaves out some important case that can actually arise, and hence fails to pin down the correct behavior in that case. Such a specification leaves undefined the correct behavior for some actual situations that might arise in the real world, although perhaps it covers "most" cases and hence might seem complete and useful: you could still generate code, look at examples, trace the logic, and see that for those examples the code behaves properly. The lurking danger would be that because there were unspecified but real cases, examples could also exist in which the behavior would be different than intended, because the specification neglected to cover them.
Thus we would want to be sure that the specification itself was adequate in the sense of fully specifying every case that could really arise in practice. This, though, seems to be a somewhat less fully solved problem. Indeed, it seems to pose an impossible requirement!

I've been shown some work that looks at specifications, and the question is a classic one: large communities are hard at work on the topic, and have been for decades. The issue is that the "adequacy of a specification" problem itself seems to be undecideable. The intuition is sort of obvious: if one could decide all possible adequacy situations, what would stop me from encoding some simple undecideable problem (such as the halting problem), and then using the test of adequacy to solve the undecideable problem? Seemingly, nothing stops me from doing this. Thus adequacy-testing should itself be undecideable. I bet someone has proved this, although I don't think I've ever seen such a paper. I need to read more of the PL literature.

At any rate, what we see here is that on the one hand, our field seems to be at the edge of a major advance (automated synthesis of provably correct solutions to safety-critical systems problems), but on the other hand, is also approaching a natural and final form of impossibility: the risk that for important cases, it may turn out to be impossible to verify that the specification itself is "complete".

There is reason for hope. There are many situations in which if one is given a program and must answer a question about it, the task cannot be solved because the given program might encode an undecideable task. Yet when we use computational methods to generate programs, we can often prove that our methods only generate sound programs.
Could we solve the adequacy problem in this manner? For example, is there a way to generate specifications through some kind of dialog with the developer, that would only generate adequate specifications with full coverage of the problem domain? This, I think, should be possible. The resulting solution might not be able to specify every problem you can dream up, but for the things it could specify, we would know that the specification doesn't overlook any possible scenarios.
Perhaps the practical class of critical control systems doesn't actually require the ability to specify control tasks that encode undecideable questions. If so, then perhaps every needed critical control solution can be created. This would finesse the issue: even though there may be ways to encode undecideability into the same formalism, our more limited use of the methodology would never attempt to do the impossible.
My hope is that we are rapidly heading towards this kind of happy outcome... with all the progress on proofs using NuPRL an Ivy (and other systems like TLA+, Dafny, Larch and Coq), we'll know pretty soon.

Thursday, 17 August 2017

The systems-area obsession with peak performance

In systems, there has always been a completely understandable focus on peak performance. For me personally, Butler Lampson's early papers on "Hints for Operating System Design" (which basically argued that we need to unclutter the critical path), the famous End to End paper by Saltzer, Reed, and Clark (which argued for taking unnecessary functionality out of the network), and the Birrell and Nelson paper on the performance of Firefly RPC (which argued for taking unnecessary mechanism out of the remote procedure call path) were awe-inspiring classics: papers you reread decades later, and that still amaze.

In fact for people who get pleasure from programming, there is a natural tendency to build systems and evaluate them, and obviously any such task centers on a peak performance perspective. Optimizing peak performance is fun, and honestly, can be addicting: there is such a direct feedback when you succeed. It is very rare to see a first version of a system that can't be sped up by a factor of 10 or more even just by doing basic optimizations, and in some cases, we end up with speedups of 100 or 1000-fold, or even more. What a thrill!
Yet there is a sense in which our love for speed worries me: I wonder if this classical way of thinking about systems might be fading as a pure form of innovation in the eyes of the field as a whole. The core issue is that existing systems (here I mean the mainstream workhorses: the operating system, the network layer, perhaps the compiler) all work pretty well. Of course, speeding them up is a worthy endeavor, but it may no longer matter enough to be a justifiable goal in its own right. Research on performance is just not a compelling story, if you focus on this particular layer.
Why should this matter? In fact, it might not matter at all, were the systems community itself aligned with these larger external forces that shape the way we are perceived by other communities and by computer science as a discipline. But right now, I suspect, there is a substantial disconnect: people like me are addicted to speed (hmm... that doesn't sound quite right), while people who hang out at conferences like NPS and KDD don't really spend much time worrying about the performance of the systems components my crowd focuses upon, like the latest version of Linux running on the latest multicore hardware platform.
As I write this blog entry, this same dynamic is evident even within my own research group. For us, it plays out as a tension between telling the Derecho story as a story about a new concept ("smart memory") and telling it as a story about raw speed ("fastest Paxos and Atomic Multicast, ever!").
It seems to me that the broader field tends to value impact more than what might be called "narrow" metrics, such as the speed of the Linux I/O path. Invent a new way of doing things, and you can change the world in interesting ways. So the puzzle that emerges is this: if the systems community has started to drift relative to the broader computer science community as a whole, don't we then run some risk of becoming marginalized, by virtue of over-emphasizing aspects that the broader computer science community views as unimportant, while actually rejecting innovations that the broader community might be thrilled to hear about?

Take Spark, a recent home run story from Berkeley. If you think back, the first research papers on Spark told a story (primarily) about huge speedups for MapReduce/Hadoop, which obtained by smarter in-memory caching of files (they call them RDD: Resilient Distributed Data objects) and smarter task scheduling, so that computations would tend to exhibit affinity relative to the cached content. Years later, it seems clear that the more significant aspect of Spark -- the more impactful innovation -- was that it created a longer term computing "model" in which data loaded into the Databricks system (the new name for Spark) lives in memory, is transformed through various stages of computation, and where the end-user has a powerful new experience of data mining with vastly better performance because these RDDs remain resident in memory as long as there is enough space, if they continue to be used now and then. Systems people may think of this as a story of performance... but NIPS and KDD people perceive this as a paradigm shift. As a systems person, it seems to me that our community in fact accepted the right papers, but for the wrong reason, and in fact that the early advising of the Spark inventors (by their faculty mentors) may even have misunderstood the real contribution and steered them down the less vital path. Of course, the enthusiasm for Spark quickly reset the focus, and today, the Databricks company that offers commercial support for the Spark platform focuses on high-productivity data mining with blazing performance, rather than portraying the story as "speeding up your Hadoop jobs."

It isn't a realistic thing to wish for, but I'll wish for it anyhow: as a field, it seems to me that we need to try and pivot, and to embrace change in terms of what styles of computing really matter. The world has changed drastically in the past decade or two: what matters most, right now, is machine learning. This is partly because systems work pretty well. Disruption comes from big reaches, not small 10x optimizations to things that already worked fairly well.

I don't know anything more about the future than anyone else. My focus, in Derecho, is on "smart memory," but will this ever become a recognized field of research, one that other people would work on? Does the world really need smart memory services for IoT applications? I hope so, but of course that question will be answered by other people, not by me. And so one can easily understand why my students love the raw speed story: For them, fast replication is a more well-defined systems topic, with an obvious and specific role in existing systems. People use replication solutions. So it makes sense for them to gravitate towards speed records.

Indeed, for them, viewing machine learning as the real goal, and performance as just one dimension, makes systems research feel secondary to machine learning research. Nobody wants to feel like the plumber or the electrician: we all want to build the house itself. Yet perhaps this is the new reality for systems researchers.
Will such a pivot be feasible? Perhaps not: the systems addiction to speed runs deep. But at the same time, when I visit colleagues in industry, I find them deeply embedded into groups that are doing important practical tasks that often center on a machine learning objective. So it seems to me that if we don't evolve in this necessary way, we'll simply fade in importance to the broader field. We just have to try, even if we might not succeed.

Thursday, 10 August 2017

Zero-copy computing

Last night, one of my group members tossed together a simple experiment on Derecho as the first step towards a much fancier experiment he needs to run. To his surprise, the performance was a fifth of what we've been seeing in our experiments that will go into the ACM TOCS submission we plan to send out any day now.

What could cause a 5x slowdown?

As it turns out, the issue is easy to understand and points to a deeper and rather interesting puzzle. RDMA is blazingly fast, as you know if you've read my earlier postings on the topic. Basically, transfers occurs entirely in hardware: the NIC on machine A talks to the memory unit on machine A and grabs chunks of data, which zip straight over the wire to the NIC on machine B, which stores the data directly into the memory unit on B. This gives a rate of data transfer that can be much higher than any single-core memcpy operation could approach: even if there is a hardware instruction for copying blocks of memory, that instruction still will operate by a series of fetch and store operations and will need to talk to the memory unit twice for each cache-line-sized chunk of data.

So RDMA tends to run 2x faster than copying. If you look at an end-to-end pipeline, you'll generally see that RDMA delays are sharply higher than the latency of interacting with local DRAM, but the data transfer speed for a big transfer can maintain this 2x benefit, from data on A all the way to the receive buffer on B.

This is what my student ran into last night. In his case, by using Derecho to send long null-terminated strings he benefitted because strings are easy to check for correct content ("Hello world, this is update 227!") but very costly to create and transmit. Derecho was running 5x too slowly because it spent all its time waiting for his expensive string creation code to run, and for Derecho's orderedsend to marshall the objects before sending. Yet our instinct tells us that those should be viewed as fast operations. Well, instinct has ceased to be correct: RDMA is a new world.

In fact you've probably thought about the following question. You are driving on highway 101 in a fancy Tesla sports car with one of those insane speed buttons. Elon Musk has really pushed the limits and the car can reach 2/3rs of the speed of light. Yet your commute from Menlo Park to South San Jose takes exactly as long as it took back when you were driving your old Subaru Imprezza! The key insight isn't a very deep one: even with a supercar, the "barriers" on the highway will still limit you to roughly the same total commute time, if those barriers are frequent enough.

In modern operating systems and languages, these kinds of barriers are pervasive.

Today's most widely used standard systems maintain data in various typed data structures: strings, classes defined by developers, etc. Even an object as simple as a string may require byte by byte copying, just to find the null terminating character, and by itself, this will be a further 2.5x slower than RDMA. So send a string, and your end-to-end numbers might be 5x slower than the best we get from Derecho with byte arrays of known size. Send a class that needs complex marshaling and the costs go even higher (if the fields can be copied directly from memory, scatter-gather is an option, but otherwise we would often need to first copy the data into a send buffer, then send it, then free the buffer: a sequence that could push us even beyond that 5x slowdown).

What you can see here is that at speeds of 100Gbps or higher, copying is a devastingly slow operation! Yet in fact, modern operating systems copy like crazy:

They copy data from user space into kernel space prior to doing I/O, and back later.
Modern languages are very relaxed about creating clones of objects.
Other than C++ 14, every method call copies arguments onto the stack, item by item.
Garbage collectors copy and compact all over the place.

This suggests that it may be time for someone to create a zero-copy operating system, and perhaps also to start looking carefully at what it would take to do zero-copy versions of modern programming language frameworks like Python, Go, C# and so forth (C++ is pretty good in this way). Otherwise, as RDMA pushes towards 400 Gbps (200Gbps is already becoming fairly common), and 1Tbps within a decade, we'll find that our RDMA path seems nearly instant... but that the applications just can't benefit!