A Few Thoughts on Distributed Computing: Applications with demanding safety requirements

Plot Spoilers

In a hurry? Here’s everything you need to know – you can skip the body of the discussion.

Vendors are starting to create safety critical computing systems that have large code bases, such as control systems for self-driving cars. The important word is “large”, and we’ll see what I mean by that shortly.

The most powerful proof methodologies are extremely effective, but even so, are unable to deal with large, complex applications.

There are highly effective ways to prove correctness for small components, such as the implementation of data structure, or even an atomic multicast or Paxos protocol. But they simply don’t work for systems with black box dependencies on large, complex components like commercial database or message bus products.
“Small” typically means hundreds of lines of code, although some proved-correct compilers exist (for subsets of C or similar languages). There is a microkernel operating system called SEL4 that has been fully proved: it has thousands of lines of code. This isn’t all that small, but there is a slight qualification: the SEL4 system was “codesigned” with its proof: both were developed at the same time, and the code was often written in ways that lend themselves to being formalized and proved, which might not be the ways the developer would have implemented those methods had the main goal been speed or OS extensibility.
Codesign is a common need: if you try and write the code first, then prove it, your task will be much more difficult. So we see codesign almost everywhere today. In my area, distributed computing, this style of work is seen in the case of Microsoft’s Ironfleet Paxos implementation: The team spent months of work to develop the code and proof, creating a single Dafny program (a version of Java) where the Paxos code is marked up by addition of various assumptions (preconditions), derived inferences (postconditions), invariants, etc. (I’m singling out the Dafny-based work because I like the Dafny markup approach, but this is one of a few notable recent projects that use formal methods).
Weird but true: even after proving Ironfleet correct, the authors had no idea if it would work! The Paxos specification omits all sorts of practical aspects that an implementation needs to deal with, like opening files, so those aspects had never been proved correct. It worked on the first try, but presumably they had some kind of preexisting framework on which this occurred, it had been debugged in some other context.
It turns out that there are many ways to prove things correct. The biggest successes have used model-checking (a form of automated proof that centers on checking that reachable states satisfy a set of invariants). Basically, the developer specifies a set of logic tests that define safety, and the checker simulates execution of the code to see if it can find code paths that lead to violations of the invariants. To avoid having an unbounded search space, help from the developer is needed at various steps. One common task is to introduce intermediary properties. For example, a developer might show that if such-and-such a series of cases are checked successfully, they actually cover all reachable states.
At Cornell, my colleagues have pioneered in developing a second class of provably correct protocols using constructive logics (also called proof-refinement logics), employing a novel kind of logic-driven interactive development tool (an IDE). The exciting aspect of this work is that at each stage, the proof has a corresponding executable program. For example, were you to use one of these tools to prove that a composite integer N can be factored by testing all odd numbers from 1 to √N as possible factors, an artifact of the proof would be a program that actually computes factorizations of a given input. There have also been proofs of Paxos that use this approach.
So… we have at least two ways to produce a provably correct implementation of a protocol such as Paxos. The puzzle is that given this component, it is not clear what to do with it: is a proved component somehow different from one that was built in the usual way? For example, one could imagine a Paxos replicated “type” that the user would see as an extensible programming language type trait, and could then use to leverage the underlying guarantees. But there has been little attention to such issues, hence the application designer working on a safety-critical system has little help, even when using a proved-correct component such as Paxos.
Sometimes there are ways in which a protocol might even break even if used in a seemingly acceptable way. Protocols like Paxos often make assumptions about how they will be used, and about the environment in which they will run. These assumptions rarely enter into the proof, hence are outside the scope of the formal work. Yet they they matter for correctness!
Examples include compiler optimizations (normally disabled in a proof), flow-control, message fragmentation (for networks that require that big objects be sent as a sequence of small ones), the bootstrapping sequence (for startup), mechanisms used for garbage collection of the log, the protocol used by a stateful application to resynchronize its state with the protocol state after a restart, correct interaction with the network hardware, correct “management” of the logged data, mechanisms for dealing with unanticipated faults, and so forth.

What about really large systems? Here, the prevailing approach is quite different: existing clean-room technologies are amazingly effective for creating robust applications, including ones that work at massive scale. But they do not use formally checked proofs.

Such methods have taken specification of requirements to a “high art” (more below). This makes possible chalkboard proofs for many aspects of correct behavior.
Translated into a methodology, one arrives at a process that can work well even at large scale, even if distinct organizations build components that must later be integrated.
But such methods encounter limitations tied to a mixture of social considerations with other aspects of the software life-cycle. The core issue here is that code is created by individuals and by companies, but individuals change jobs, corporate visions evolve (and companies may be acquired, or fail), etc. When such things happen, stumbles occur.

Setting limitations on the two approaches to the side, there is also a clear need to improve our options for combining the two. Today’s large safety-critical systems often can only be approached by fragmenting them into (1) non-critical portions that can be designed to not pose risks from a safety perspective, and (2) a safety kernel that guarantees that the critical functionalities of the system will be correct. In effect we would have one large system but might think of it as having a core system that provides safety on which a set of non-safety-critical components are running. Given such an approach, complex applications are layered over the resulting trusted elements in ways that preserve core safety properties from the “trust kernel” while bringing in desired higher-level functionality, and without breaking its guarantees.

Finally, we need to find ways to overcome the limitations cited earlier. These are technical limitations, and they can render some designs infeasible. The trick is to be clever about the design so that such problems are avoided where possible, and then to cleverly develop work-arounds where we can’t avoid encountering a limitation.

Add all of this up, and there does seem to be a path enabling the creation of large safety-critical systems. But it isn’t one that can be followed casually. Not every design will turn out to be feasible, and our completed systems won’t always be perfect: they may embody tradeoffs, and they many include engineering tricks that circumvent what would otherwise be serious limitations. When we have a choice, we would prefer not to use such methods, but when there is no choice…

The safety process and its limitations

Let’s revisit the same points, but more deliberately, and shifting emphasis from system components like a Paxos protocol used for data replication to larger-scale applications that might be created by combining precreated components with additional application logic. Two broad families of approaches stand out:

For small things (like code to control the wing flaps on an airplane, or the brakes on a train), the best practice starts with special operating systems and compilers that have been proved correct. These operating systems tend to be absolutely minimal, and the compilers generally focus on languages with very small feature sets and carefully specified semantics, so don’t think of this as a methodology aimed at C++ 17 programming on Microsoft Azure. More typical would be a C subset running on a microkernel operating system such as QNX or SEL4. Given such a trusted runtime environment, the developer creates an application, and then uses the semantically strong properties of the target setting as a tool when proving the control code correct as well. Indeed, it is increasingly common not to limit attention to the operating system and the compiler: the hardware itself is often a partner in the solution. Thus, we might build a safe braking system for trains that involves application logic for combining a set of inputs concerning track state and locomotive status, and then might run it in a TMR configuration, with three replicas. A hardware voting module could then take the outputs and combine them: if 2 out of 3 replicas agree that the train can advance another mile, the brakes remain disengaged. This kind of thinking can be carried quite deeply into the hardware. But the resulting methodology (obviously) can’t scale up. For example, we don’t build hospital ICU monitoring systems this way, or air traffic control systems: they are too large and have too many moving parts.
Large systems are better approached using a clean-room methodology. When we use big existing technology components as black boxes within an even larger application (like a stock exchange trading floor solution, or an air traffic control platform, or the infrastructure for a smart power grid), prebuilt technology components will be deeply integrated into the system architecture. The application is more like an orchestra playing a concert: you get to write the music, but the musicians will play it on your behalf, and using instruments that you didn’t build. Here, the best practice centers on a very rigid style of technology procurement, integration and testing; I’ll say much more about it momentarily.

Both styles of critical system development can be quite expensive, in many senses. At a NASA workshop on critical systems validation one speaker remarked that the safety critical systems community used to think about “pounds of proof” per line of code, although today we rarely would print such a proof on paper. Then there is a human cost (generating those proofs takes a huge effort) and a time cost too. Cleanroom development is expensive too, and for similar reasons. Moreover, precisely because the artifact is the outcome of a fairly complex process, the validated system will often be rather inflexible: any change, for example to port them to a new setting or add a feature, could require that the whole proof be redone. This introduces a different kind of cost: a high cost of ownership, and a high cost of lifecycle events.

Whereas the first style of system would be common in avionics systems or train braking systems, the second style of system is more aligned with today’s high-productivity IDEs, and would be more typical of an air traffic control system, or a large-scale train routing system. And this is intriguing, because it highlights two sides of one question: we want safety for these transport systems, or for other similar critical computing applications, and yet the prevailing development technologies don’t really integrate with one-another. Could we do better?

Some initial observations:

It isn’t possible to create all-encompassing formal models or correctness proofs at genuinely large scale, but we certainly can create models covering important aspects of key components. Such a model will tell us how some component works from the perspective of our application, or perhaps even will limit itself to some aspects of behavior. By design, these focused models won’t include every detail of the API, the system configuration, the implementation, etc.
If a model abstracts a large component such as a database, or a publish-subscribe message bus, it will probably be functionality-oriented. For example, a publish-subscribe message bus model might specify that if the bus is running under normal conditions (which would need to be specified: such-and-such a network, load levels of no more than some limit, no recent crashes or new ones underway, etc) and a bounded size message is sent by the application, it will be delivered reliably and in-order with an end-to-end latency of between 5 and 25ms.
The application may end up with many dependencies upon key assumptions. For example, taking our message bus scenario a bit further, if we deliver a message bus with a specification of this kind, a great many elements of the application may be incorrect unless the message-bus product in fact matches this focused specification. There are implications for procurement. One obvious impact is that when selecting a message bus, the team is likely to consider only widely used majority products that far exceed the given requirements. We may even try to require that the vendor sign an agreement specifying that such-and-such a quality control standard is in use, and that these requirements are tested as part of the process: there are ISO and IEEE specifications for development and testing (of course, verifying that the standard is genuinely applied is quite a different matter). Thus we lack formally machine-checked proofs, but use social-engineering to compensate. A vendor unable to comply would not be selected as a supplier.
Our safety-critical setting will use the message bus in specific ways, hence we’ll want to be sure that the technology is explicitly tested for a workload matching our intended use, by having the vendor run an agreed-upon list of functionality and correctness tests prior to delivery. This won’t be a one-time event: products evolve over time, hence testing becomes a life-long commitment to retest each patch and each major new version in the same way.
As the saying goes, “trust but verify”. We’ll need to verify that the delivered technology works as promised. In practice, this entails running additional acceptance tests.
One the product passes vendor and acceptance tests, the next step will be to integrate the received component into our system, and carry out additional system integration tests. Hopefully those will work flawlessly, but just in case, there needs to be a way to report defects or new requirements back down the supply chain to the organization that can best respond.
Eventually, the overall system emerges: it will be the result of successful integration of a whole pipeline of components, each individually specified, tested, retested, integrated into the system that will use it, and then again tested, etc.

But how well do such techniques really work?

To answer such a question, we should perhaps remind ourselves of an old adage, which I first heard from IBM’s database guru, Bruce Nelson. Bruce was on a panel concerned with reliability, and had just presented some data gathered at IBM about bug reports over the lifetime of a major OLTP product. He pointed out that even though the bug rate dropped for a while, it later had surges of higher rates, often triggered by patches that were intended to fix problems: the essence of his talk was to ask why fixing bugs seemed to trigger new issues.

Bruce started with a premise, namely that all complex software embodies a mix of Bohrbugs and Heisenbugs that escaped notice during development.

As Bruce saw it, freshly released software will have the worst rates of both kinds of bugs. But the ones he called Bohrbugs (boring… but also solid and easy to pin down, like Bohr’s model of elementary particles) are dull: reproducible, easy to pin down and kill. Precisely because of this, Bruce used to say that in a well managed product group, you can anticipate that the rate of Bohrbugs will drop over time.

The other category includes the Heisenbugs: they embody a form of inherent uncertainty, like Heisenburg’s model for elementary particles. These are tricky bugs that may be very tough to fix. Bruce was convinced that software tends to be full of them, but with bandaids that conceal them. Thus the working system might embody flaws that were hidden by developers (who may have thought they were fixing the issue; some developers probably deliberately conceal bugs they can’t fix, but others may genuinely misunderstand the bug, attribute it to some completely unrelated part of the system, but then modify that part of the system in such a way that the bug no longer is obvious).

What causes a Heisenbug? Bruce’s study suggested that these were most often the end-result of some form of concurrency problem (hence hard to reproduce). For example, in complex concurrent code, it isn’t uncommon that some minor error might damage a data structure. Suppose, however, that the damage isn’t instantly noticed. For example, a B-tree could be left in an inconsistent state by thread A running at time T, but A itself might not touch the bad links and hence would be unaware that it left a pointer that points to freed memory. The crash won’t occur until some other thread B touches the structure at time T’, much later in the execution. Worse, suppose that B doesn’t even crash, itself, but simply updates something using this bad pointer. In effect, a mistake in A causes thread B to damage something that thread C is dependent upon. And C is the visible “culprit” when the crash finally occurs.

Thus with a Heisenbug, one issue is that the consequences of the mistake might be noticed in a thread totally different from the one that caused the problem. Further, anything that might change the execution could impact the thread that will “notice” the problem: in one run C is first to hit the issue, in another, the whole problem doesn’t occur because the data structure gets garbage collected before anyone notices the problem.

Bruce explained this is why he had become skeptical that we can ever really fix such issues; more often, he explained, someone finds an elaborate workaround so that the bug can occur without the harm being quite so noticeable. Further, this, he felt, was why life-cycle patches and functionality upgrades demand special attention: in his view, it is in the nature of software that each such event will trigger a wave of new problems.

The life and times of a critical system.

Why is life-cycle a special issue when building critical systems? A core issue is that non-critical products evolve through a series of releases, which can mean that (1) the race to introduce features is more dominant than the push for ultra-high reliability, and (2) the upgrade to a completely new version may even happen before the prior version is fully mature. Vendors will issue patches to correct major bugs, but minor issues are often folded into the next version, so that rather than every settling into a stable state with no known bugs, there is a continuous progression of series of major versions, each fixing a set of issues with the prior version but introducing new features and functionality that often bring issues of their own. Even so, there is a presumption that users will promptly upgrade each time a new version is released.

Safety critical systems are different: switching a working mission-critical computing system like an air traffic control application to use a new version of one of its component technologies can be very costly, because such a change would often force a complete revalidation. This creates a dilemma, because as time passes, it gets harder and harder to find the resources to revalidate. We end up with our critical computing system forced to run using components with known bugs that can’t be fixed.

The problem isn’t a new one. NASA, which owns the responsibility for certifying flight safety in aircraft within the United States (space exploration isn’t their only assigned job), long ago began to talk about integrated modular avionics, meaning a methodology for building modular critical systems (for avionic systems, in their case) that can be removed and replaced, plug-and-play, with a validation methodology for just the upgraded element. But so far as I know, they never felt that the problem was solved, and continue to have the same issues of upgrading that we’ve touched upon above.

The core problem is that even if a new module passes some sort of compliance test that sufficed for the old module, the technologies used inside it may have changed in a dramatic way, and perhaps those changes create new safety issues. For example, there were a series of aircraft fires associated with lithium-ion batteries in 2015: it turned out that if those were dropped during installation, they could crack in ways that permitted flammable gases to build up. But those batteries were themselves upgrades of a prior battery technology that wasn’t at risk of this kind of issue. Thus the tests to replace the old battery technology with the new one wouldn’t necessarily have included drop-resistance and fire-resistance scenarios: the new technology improved on the old in some ways, but created additional questions not seen with the prior technology.

Here's another scenario that illustrates a different aspect of concern. Suppose that a smart power grid solution depends on a message bus from a vendor, call them fly-by-wire pubsub, or FBW.com. Further, let’s assume that FBW.com had the very best technology for publish-subscribe in 2010 when our competition for a vendor began.

Well, competitions take time: by the time FBW.com is under contract it could easily be 2012, and they probably didn’t deliver a hardened and heavily tested version of their product until 2015.

In the life-cycle story one might expect from the above, years of happy use ensures, with occasional patches and improvements. But it is just as likely that one day, FBW ceases operations. Perhaps they are acquired, or go bankrupt, or the company strategy might evolve. There are dozens of reasons that FBW could change in ways that would make them an unacceptable technology source for our project.

So now the power grid is faced with a serious problem: the last delivered version was fine, but what will I do if new features are needed, or if the system needs to be moved to new hardware, or if a problem is noticed as the load on the smart grid platform grows over time? Quite possibly, the 2012 contract with FBW is useless: the original entity might not exist anymore! In effect, by selecting FBW.com as a provider we opted for the very best as of 2010, but that isn’t much help today.

Even if FBW is a healthy company ten years later, their market may have evolved in ways that could force them to evolve. We noted that the pressure to roll out new features becomes a continuous cycle of evolution without long periods of stability. Thus FBW might have shipped us version 1.0 for the original smart grid application, but could by now be up to version 4.5 and might be less and less capable of supporting 1.0, no matter what the 2012 contract seemed to require.

What are my options? One would be to renegotiate the contract, paying FBW extra money to test the new code to the same quality level as the old code. However, this would be quite expensive and many government agencies are loath to send extra money to a company that is unable to deliver on prior commitments. In fact FBW itself might refuse such a contract, since its business model has changed. Indeed, it may be obvious to them and to me that the new implementation has ceased to be suitable for safety-critical uses: the evolution of products generally responds to mass-market requirements, not to special needs of niche users. Guarantees one could count on in 2010 might be abandoned by 2016!

Is there a magic remedy to the FBW conundrum? One answer is limit procurement to large companies, but no company is immune to stories such as this. A second is to internalize the issue: the FBW solution in 2015 could have been open source, so that the power grid project would be in a position to do its own in-house maintenance. But this is problematic because many safety-critical infrastructure operators aren’t technology houses: they run smart grids or smart highways or do air traffic control, so expecting them to also have an ongoing software team isn’t necessarily realistic. More common would be to outsource the whole issue to an integration vendor, picking a company with a long history of such roles, but this might be less than fully satisfactory depending on the state of the FBW component at the moment that FBW walks away from the product.

Best practice in 2017 and beyond

No technology is completely free of issues, so I hope that you won’t see the remarks above as some form of bleakly negative condemnation. In fact, they add up to two reasonably good stories: one that works best at small scale, and one that works fairly well at large scale. It shouldn’t surprise anyone that both methods also have limitations.

The key, it seems to me, is that as we move forward with a new generation of mission-critical systems that use complex technologies in sensitive settings, we do so with a deft hand and an appreciation of these limitations.

Consider self-driving cars. Few applications could be more risky: not only are the passengers in the car at risk, but so are people in other cars, and even bystanders on the street. Further, the software used in different cars will in general not be created by any single vendor: there are already at least a half dozen major projects underway. Thus as a car proceeds down the highway, often at very high speeds and with very small spacing relative to other cars, we have a very subtle distributed consensus problem in which the participants might not be running the identical protocol, or any protocol at all.

To me this problem statement is manifestly unsafe: if we build self-driving cars to operate fully autonomously, we will never manage to create a safe system. We could end up with a solution that has better average driving safety statistics than human drivers can achieve, but there will be situations under which the ensemble will fail catastrophically.

But the situation would change if we introduced smart highways: a form of air traffic control system that could take charge of the roadway. In a situation where no car drives itself without explicit permission from the smart highway controller, many of the most extreme worries that arise with fully autonomous systems vanish. We can then take a next step and require fail-safe behavior: that in all cases, for any k component failures, the system can move itself into a safe mode. For values of k like 1 or 2, this should be feasible. Safe mode could be a complete shutdown, or a situation in which a car pulls to the side and requests service.

Building the smart highway controller itself would then become a problem similar to that of building any other safety-critical solution. My instinct would be to take our two best-of-breed options, but then try to go even further by using the small elements within the larger systems in a very careful, “enlightened” manner – so this last step is the opportunity for innovation, because it is a step we don’t often look at closely even when using clean-room methods at large-scale. It won’t give us a universal answer covering all possible applications, and won’t address the life-cycle issues of clean-room technology development, but there is at least an opportunity for meaningful progress.

What would it mean to build a large system that gains safety from properties of a small system embodied within it? The answer would need to run something like this:

We factor out the safety critical aspects of a large system, so that the technology required for safe operation is much smaller than the full footprint of the large system as a whole.
We use strong methods (formal machine-checked proofs) for technologies providing core safety. Techniques like synthetic diversity and replication often a way to overcome lingering distrust in hardware, compiler, O/S and other elements: if we replicate critical components and one instance suffers a mishap, the system can outvote it and move on.
The remainder of the system is created using clean-room technologies. Indeed, it might fail in some unexpected ways, perhaps due to latent Heisenbugs, or perhaps due to life-cycle upgrades that introduce new issues. But because the larger system isn’t safety critical, safety of the end users isn’t put a risk by such events.

Mapping this back to a smart highway, the safety critical aspect might be a subsystem managing automobile driving plans, a bit like flight plans. We could adopt the view that these advance through a series of versions, which are themselves read-only (so to modify version 7, the system creates version 8, etc). Questions arise of how to persist these versions, how to share them, and so forth. But as we progress down such a design path, we end up with smaller and smaller functional elements that are more and more amenable to formal correctness methods. Then we can layer in fancier functionality, such as software to recommend hotels and restaurants and make reservations at a higher level, in ways that do not put the core safety of the solution at risk.

A Few Thoughts on Distributed Computing

Tuesday, 6 December 2016

Applications with demanding safety requirements

Plot Spoilers

The safety process and its limitations

The life and times of a critical system.

Best practice in 2017 and beyond

No comments:

Post a Comment