Plot Spoilers
In a hurry?
Here’s everything you need to know – you can skip the body of the
discussion.
Vendors are starting to create safety critical computing systems that have large code bases, such as control systems for self-driving cars. The important word is “large”, and we’ll see what I mean by that shortly.
The most powerful proof methodologies are extremely effective, but even so, are unable to deal with large, complex applications.
- There are highly effective ways to prove correctness for small components, such as the implementation of data structure, or even an atomic multicast or Paxos protocol. But they simply don’t work for systems with black box dependencies on large, complex components like commercial database or message bus products.
- “Small” typically means hundreds of lines of code, although some proved-correct compilers exist (for subsets of C or similar languages). There is a microkernel operating system called SEL4 that has been fully proved: it has thousands of lines of code. This isn’t all that small, but there is a slight qualification: the SEL4 system was “codesigned” with its proof: both were developed at the same time, and the code was often written in ways that lend themselves to being formalized and proved, which might not be the ways the developer would have implemented those methods had the main goal been speed or OS extensibility.
- Codesign is a common need: if you try and write the code first, then prove it, your task will be much more difficult. So we see codesign almost everywhere today. In my area, distributed computing, this style of work is seen in the case of Microsoft’s Ironfleet Paxos implementation: The team spent months of work to develop the code and proof, creating a single Dafny program (a version of Java) where the Paxos code is marked up by addition of various assumptions (preconditions), derived inferences (postconditions), invariants, etc. (I’m singling out the Dafny-based work because I like the Dafny markup approach, but this is one of a few notable recent projects that use formal methods).
- Weird but true: even after proving Ironfleet correct, the authors had no idea if it would work! The Paxos specification omits all sorts of practical aspects that an implementation needs to deal with, like opening files, so those aspects had never been proved correct. It worked on the first try, but presumably they had some kind of preexisting framework on which this occurred, it had been debugged in some other context.
- It turns out that there are many ways to prove things correct. The biggest successes have used model-checking (a form of automated proof that centers on checking that reachable states satisfy a set of invariants). Basically, the developer specifies a set of logic tests that define safety, and the checker simulates execution of the code to see if it can find code paths that lead to violations of the invariants. To avoid having an unbounded search space, help from the developer is needed at various steps. One common task is to introduce intermediary properties. For example, a developer might show that if such-and-such a series of cases are checked successfully, they actually cover all reachable states.
- At Cornell, my colleagues have pioneered in developing a second class of provably correct protocols using constructive logics (also called proof-refinement logics), employing a novel kind of logic-driven interactive development tool (an IDE). The exciting aspect of this work is that at each stage, the proof has a corresponding executable program. For example, were you to use one of these tools to prove that a composite integer N can be factored by testing all odd numbers from 1 to √N as possible factors, an artifact of the proof would be a program that actually computes factorizations of a given input. There have also been proofs of Paxos that use this approach.
- So… we have at least two ways to produce a provably correct implementation of a protocol such as Paxos. The puzzle is that given this component, it is not clear what to do with it: is a proved component somehow different from one that was built in the usual way? For example, one could imagine a Paxos replicated “type” that the user would see as an extensible programming language type trait, and could then use to leverage the underlying guarantees. But there has been little attention to such issues, hence the application designer working on a safety-critical system has little help, even when using a proved-correct component such as Paxos.
- Sometimes there are ways in which a protocol might even break even if used in a seemingly acceptable way. Protocols like Paxos often make assumptions about how they will be used, and about the environment in which they will run. These assumptions rarely enter into the proof, hence are outside the scope of the formal work. Yet they they matter for correctness!
- Examples include compiler optimizations (normally disabled in a proof), flow-control, message fragmentation (for networks that require that big objects be sent as a sequence of small ones), the bootstrapping sequence (for startup), mechanisms used for garbage collection of the log, the protocol used by a stateful application to resynchronize its state with the protocol state after a restart, correct interaction with the network hardware, correct “management” of the logged data, mechanisms for dealing with unanticipated faults, and so forth.
What about really large systems? Here, the prevailing approach is quite
different: existing clean-room
technologies are amazingly effective for creating robust applications,
including ones that work at massive scale.
But they do not use formally checked proofs.
- Such methods have taken specification of requirements to a “high art” (more below). This makes possible chalkboard proofs for many aspects of correct behavior.
- Translated into a methodology, one arrives at a process that can work well even at large scale, even if distinct organizations build components that must later be integrated.
- But such methods encounter limitations tied to a mixture of social considerations with other aspects of the software life-cycle. The core issue here is that code is created by individuals and by companies, but individuals change jobs, corporate visions evolve (and companies may be acquired, or fail), etc. When such things happen, stumbles occur.
Setting limitations on the two approaches
to the side, there is also a clear need to improve our options for combining the two. Today’s large safety-critical systems often
can only be approached by fragmenting them into (1) non-critical portions that
can be designed to not pose risks from a safety perspective, and (2) a safety
kernel that guarantees that the critical functionalities of the system will be
correct. In effect we would have one
large system but might think of it as having a core system that provides safety
on which a set of non-safety-critical components are running. Given such an approach, complex applications
are layered over the resulting trusted elements in ways that preserve core
safety properties from the “trust kernel” while bringing in desired
higher-level functionality, and without breaking its guarantees.
Finally, we need to find ways to overcome
the limitations cited earlier. These are
technical limitations, and they can render some designs infeasible. The trick is to be clever about the design so
that such problems are avoided where possible, and then to cleverly develop
work-arounds where we can’t avoid encountering a limitation.
Add all of this up, and there does seem to
be a path enabling the creation of large safety-critical systems. But it isn’t one that can be followed
casually. Not every design will turn out
to be feasible, and our completed systems won’t always be perfect: they may
embody tradeoffs, and they many include engineering tricks that circumvent what
would otherwise be serious limitations.
When we have a choice, we would prefer not to use such methods, but when
there is no choice…
The safety process and its limitations
Let’s revisit the same points, but more
deliberately, and shifting emphasis from system components like a Paxos
protocol used for data replication to larger-scale applications that might be
created by combining precreated components with additional application
logic. Two broad families of approaches
stand out:
- For small things (like code to control the wing flaps on an airplane, or the brakes on a train), the best practice starts with special operating systems and compilers that have been proved correct. These operating systems tend to be absolutely minimal, and the compilers generally focus on languages with very small feature sets and carefully specified semantics, so don’t think of this as a methodology aimed at C++ 17 programming on Microsoft Azure. More typical would be a C subset running on a microkernel operating system such as QNX or SEL4. Given such a trusted runtime environment, the developer creates an application, and then uses the semantically strong properties of the target setting as a tool when proving the control code correct as well. Indeed, it is increasingly common not to limit attention to the operating system and the compiler: the hardware itself is often a partner in the solution. Thus, we might build a safe braking system for trains that involves application logic for combining a set of inputs concerning track state and locomotive status, and then might run it in a TMR configuration, with three replicas. A hardware voting module could then take the outputs and combine them: if 2 out of 3 replicas agree that the train can advance another mile, the brakes remain disengaged. This kind of thinking can be carried quite deeply into the hardware. But the resulting methodology (obviously) can’t scale up. For example, we don’t build hospital ICU monitoring systems this way, or air traffic control systems: they are too large and have too many moving parts.
- Large systems are better approached using a clean-room methodology. When we use big existing technology components as black boxes within an even larger application (like a stock exchange trading floor solution, or an air traffic control platform, or the infrastructure for a smart power grid), prebuilt technology components will be deeply integrated into the system architecture. The application is more like an orchestra playing a concert: you get to write the music, but the musicians will play it on your behalf, and using instruments that you didn’t build. Here, the best practice centers on a very rigid style of technology procurement, integration and testing; I’ll say much more about it momentarily.
Both styles of critical system development
can be quite expensive, in many senses.
At a NASA workshop on critical systems validation one speaker remarked
that the safety critical systems community used to think about “pounds of
proof” per line of code, although today we rarely would print such a proof on
paper. Then there is a human cost (generating
those proofs takes a huge effort) and a time cost too. Cleanroom development is expensive too, and
for similar reasons. Moreover, precisely
because the artifact is the outcome of a fairly complex process, the validated
system will often be rather inflexible: any change, for example to port them to
a new setting or add a feature, could require that the whole proof be
redone. This introduces a different kind
of cost: a high cost of ownership, and a high cost of lifecycle events.
Whereas the first style of system would be
common in avionics systems or train braking systems, the second style of system
is more aligned with today’s high-productivity IDEs, and would be more typical
of an air traffic control system, or a large-scale train routing system. And this is intriguing, because it highlights
two sides of one question: we want safety for these transport systems, or for
other similar critical computing applications, and yet the prevailing
development technologies don’t really integrate with one-another. Could we do better?
Some initial observations:
- It isn’t possible to create all-encompassing formal models or correctness proofs at genuinely large scale, but we certainly can create models covering important aspects of key components. Such a model will tell us how some component works from the perspective of our application, or perhaps even will limit itself to some aspects of behavior. By design, these focused models won’t include every detail of the API, the system configuration, the implementation, etc.
- If a model abstracts a large component such as a database, or a publish-subscribe message bus, it will probably be functionality-oriented. For example, a publish-subscribe message bus model might specify that if the bus is running under normal conditions (which would need to be specified: such-and-such a network, load levels of no more than some limit, no recent crashes or new ones underway, etc) and a bounded size message is sent by the application, it will be delivered reliably and in-order with an end-to-end latency of between 5 and 25ms.
- The application may end up with many dependencies upon key assumptions. For example, taking our message bus scenario a bit further, if we deliver a message bus with a specification of this kind, a great many elements of the application may be incorrect unless the message-bus product in fact matches this focused specification. There are implications for procurement. One obvious impact is that when selecting a message bus, the team is likely to consider only widely used majority products that far exceed the given requirements. We may even try to require that the vendor sign an agreement specifying that such-and-such a quality control standard is in use, and that these requirements are tested as part of the process: there are ISO and IEEE specifications for development and testing (of course, verifying that the standard is genuinely applied is quite a different matter). Thus we lack formally machine-checked proofs, but use social-engineering to compensate. A vendor unable to comply would not be selected as a supplier.
- Our safety-critical setting will use the message bus in specific ways, hence we’ll want to be sure that the technology is explicitly tested for a workload matching our intended use, by having the vendor run an agreed-upon list of functionality and correctness tests prior to delivery. This won’t be a one-time event: products evolve over time, hence testing becomes a life-long commitment to retest each patch and each major new version in the same way.
- As the saying goes, “trust but verify”. We’ll need to verify that the delivered technology works as promised. In practice, this entails running additional acceptance tests.
- One the product passes vendor and acceptance tests, the next step will be to integrate the received component into our system, and carry out additional system integration tests. Hopefully those will work flawlessly, but just in case, there needs to be a way to report defects or new requirements back down the supply chain to the organization that can best respond.
- Eventually, the overall system emerges: it will be the result of successful integration of a whole pipeline of components, each individually specified, tested, retested, integrated into the system that will use it, and then again tested, etc.
But how well do such techniques really
work?
To answer such a question, we should
perhaps remind ourselves of an old adage, which I first heard from IBM’s
database guru, Bruce Nelson. Bruce was
on a panel concerned with reliability, and had just presented some data
gathered at IBM about bug reports over the lifetime of a major OLTP
product. He pointed out that even though
the bug rate dropped for a while, it later had surges of higher rates, often
triggered by patches that were intended to fix problems: the essence of his
talk was to ask why fixing bugs seemed to trigger new issues.
Bruce started with a premise, namely that
all complex software embodies a mix of Bohrbugs and Heisenbugs that escaped
notice during development.
As Bruce saw it, freshly released software
will have the worst rates of both kinds of bugs. But the ones he called Bohrbugs (boring… but also solid and easy to pin
down, like Bohr’s model of elementary particles) are dull: reproducible, easy
to pin down and kill. Precisely because
of this, Bruce used to say that in a well managed product group, you can
anticipate that the rate of Bohrbugs will drop over time.
The other category includes the Heisenbugs:
they embody a form of inherent uncertainty, like Heisenburg’s model for
elementary particles. These are tricky
bugs that may be very tough to fix.
Bruce was convinced that software tends to be full of them, but with
bandaids that conceal them. Thus the
working system might embody flaws that were hidden by developers (who may have
thought they were fixing the issue; some developers probably deliberately
conceal bugs they can’t fix, but others may genuinely misunderstand the bug,
attribute it to some completely unrelated part of the system, but then modify
that part of the system in such a way that the bug no longer is obvious).
What causes a Heisenbug? Bruce’s study suggested that these were most
often the end-result of some form of concurrency problem (hence hard to
reproduce). For example, in complex
concurrent code, it isn’t uncommon that some minor error might damage a data
structure. Suppose, however, that the
damage isn’t instantly noticed. For
example, a B-tree could be left in an inconsistent state by thread A running at
time T, but A itself might not touch the bad links and hence would be unaware
that it left a pointer that points to freed memory. The crash won’t occur until some other thread
B touches the structure at time T’, much later in the execution. Worse, suppose that B doesn’t even crash,
itself, but simply updates something using this bad pointer. In effect, a mistake in A causes thread B to
damage something that thread C is dependent upon. And C is the visible “culprit” when the crash
finally occurs.
Thus with a Heisenbug, one issue is that
the consequences of the mistake might be noticed in a thread totally different
from the one that caused the problem.
Further, anything that might change the execution could impact the
thread that will “notice” the problem: in one run C is first to hit the issue,
in another, the whole problem doesn’t occur because the data structure gets
garbage collected before anyone notices the problem.
Bruce explained this is why he had become
skeptical that we can ever really fix such issues; more often, he explained,
someone finds an elaborate workaround so that the bug can occur without the
harm being quite so noticeable. Further,
this, he felt, was why life-cycle patches and functionality upgrades demand
special attention: in his view, it is in the nature of software that each such
event will trigger a wave of new problems.
The life and times of a critical system.
Why is life-cycle a special issue when
building critical systems? A core issue
is that non-critical products evolve through a series of releases, which can
mean that (1) the race to introduce features is more dominant than the push for
ultra-high reliability, and (2) the upgrade to a completely new version may
even happen before the prior version is fully mature. Vendors will issue patches to correct major
bugs, but minor issues are often folded into the next version, so that rather
than every settling into a stable state with no known bugs, there is a
continuous progression of series of major versions, each fixing a set of issues
with the prior version but introducing new features and functionality that
often bring issues of their own. Even
so, there is a presumption that users will promptly upgrade each time a new
version is released.
Safety critical systems are different:
switching a working mission-critical computing system like an air traffic
control application to use a new version of one of its component technologies
can be very costly, because such a change would often force a complete
revalidation. This creates a dilemma,
because as time passes, it gets harder and harder to find the resources to
revalidate. We end up with our critical
computing system forced to run using components with known bugs that can’t be
fixed.
The problem isn’t a new one. NASA, which owns the responsibility for
certifying flight safety in aircraft within the United States (space
exploration isn’t their only assigned job), long ago began to talk about integrated modular avionics, meaning a methodology
for building modular critical systems (for avionic systems, in their case) that
can be removed and replaced, plug-and-play, with a validation methodology for
just the upgraded element. But so far as
I know, they never felt that the problem was solved, and continue to have the
same issues of upgrading that we’ve touched upon above.
The core problem is that even if a new
module passes some sort of compliance test that sufficed for the old module,
the technologies used inside it may have changed in a dramatic way, and perhaps
those changes create new safety issues.
For example, there were a series of aircraft fires associated with
lithium-ion batteries in 2015: it turned out that if those were dropped during
installation, they could crack in ways that permitted flammable gases to build
up. But those batteries were themselves
upgrades of a prior battery technology that wasn’t at risk of this kind of
issue. Thus the tests to replace the old
battery technology with the new one wouldn’t necessarily have included
drop-resistance and fire-resistance scenarios: the new technology improved on
the old in some ways, but created additional questions not seen with the prior
technology.
Here's another scenario that illustrates a
different aspect of concern. Suppose
that a smart power grid solution depends on a message bus from a vendor, call
them fly-by-wire pubsub, or FBW.com.
Further, let’s assume that FBW.com had the very best technology for
publish-subscribe in 2010 when our competition for a vendor began.
Well, competitions take time: by the time
FBW.com is under contract it could easily be 2012, and they probably didn’t
deliver a hardened and heavily tested version of their product until 2015.
In the life-cycle story one might expect
from the above, years of happy use ensures, with occasional patches and
improvements. But it is just as likely
that one day, FBW ceases operations.
Perhaps they are acquired, or go bankrupt, or the company strategy might
evolve. There are dozens of reasons that
FBW could change in ways that would make them an unacceptable technology source
for our project.
So now the power grid is faced with a
serious problem: the last delivered version was fine, but what will I do if new
features are needed, or if the system needs to be moved to new hardware, or if
a problem is noticed as the load on the smart grid platform grows over
time? Quite possibly, the 2012 contract
with FBW is useless: the original entity might not exist anymore! In effect, by selecting FBW.com as a provider
we opted for the very best as of 2010, but that isn’t much help today.
Even if FBW is a healthy company ten years
later, their market may have evolved in ways that could force them to
evolve. We noted that the pressure to
roll out new features becomes a continuous cycle of evolution without long
periods of stability. Thus FBW might
have shipped us version 1.0 for the original smart grid application, but could
by now be up to version 4.5 and might be less and less capable of supporting
1.0, no matter what the 2012 contract seemed to require.
What are my options? One would be to renegotiate the contract,
paying FBW extra money to test the new code to the same quality level as the
old code. However, this would be quite
expensive and many government agencies are loath to send extra money to a
company that is unable to deliver on prior commitments. In fact FBW itself might refuse such a
contract, since its business model has changed.
Indeed, it may be obvious to them and to me that the new implementation
has ceased to be suitable for safety-critical uses: the evolution of products
generally responds to mass-market requirements, not to special needs of niche
users. Guarantees one could count on in
2010 might be abandoned by 2016!
Is there a magic remedy to the FBW
conundrum? One answer is limit
procurement to large companies, but no company is immune to stories such as
this. A second is to internalize the
issue: the FBW solution in 2015 could have been open source, so that the power
grid project would be in a position to do its own in-house maintenance. But this is problematic because many
safety-critical infrastructure operators aren’t technology houses: they run
smart grids or smart highways or do air traffic control, so expecting them to
also have an ongoing software team isn’t necessarily realistic. More common would be to outsource the whole
issue to an integration vendor, picking a company with a long history of such
roles, but this might be less than fully satisfactory depending on the state of
the FBW component at the moment that FBW walks away from the product.
Best practice in 2017 and beyond
No technology is completely free of issues,
so I hope that you won’t see the remarks above as some form of bleakly negative
condemnation. In fact, they add up to
two reasonably good stories: one that works best at small scale, and one that
works fairly well at large scale. It
shouldn’t surprise anyone that both methods also have limitations.
The key, it seems to me, is that as we move
forward with a new generation of mission-critical systems that use complex
technologies in sensitive settings, we do so with a deft hand and an
appreciation of these limitations.
Consider self-driving cars. Few applications could be more risky: not only
are the passengers in the car at risk, but so are people in other cars, and
even bystanders on the street. Further,
the software used in different cars will in general not be created by any
single vendor: there are already at least a half dozen major projects
underway. Thus as a car proceeds down
the highway, often at very high speeds and with very small spacing relative to
other cars, we have a very subtle distributed consensus problem in which the
participants might not be running the identical protocol, or any protocol at
all.
To me this problem statement is manifestly
unsafe: if we build self-driving cars to operate fully autonomously, we will
never manage to create a safe system. We
could end up with a solution that has better average driving safety statistics
than human drivers can achieve, but there will be situations under which the
ensemble will fail catastrophically.
But the situation would change if we
introduced smart highways: a form of
air traffic control system that could take charge of the roadway. In a situation where no car drives itself
without explicit permission from the smart highway controller, many of the most
extreme worries that arise with fully autonomous systems vanish. We can then take a next step and require
fail-safe behavior: that in all cases, for any k component failures, the system can move itself into a safe
mode. For values of k like 1 or 2, this should be feasible. Safe mode could be a complete shutdown, or a
situation in which a car pulls to the side and requests service.
Building the smart highway controller
itself would then become a problem similar to that of building any other
safety-critical solution. My instinct
would be to take our two best-of-breed options, but then try to go even further
by using the small elements within the larger systems in a very careful,
“enlightened” manner – so this last step is the opportunity for innovation,
because it is a step we don’t often look at closely even when using clean-room
methods at large-scale. It won’t give
us a universal answer covering all possible applications, and won’t address the
life-cycle issues of clean-room technology development, but there is at least
an opportunity for meaningful progress.
What would it mean to build a large system
that gains safety from properties of a small system embodied within it? The answer would need to run something like
this:
- We factor out the safety critical aspects of a large system, so that the technology required for safe operation is much smaller than the full footprint of the large system as a whole.
- We use strong methods (formal machine-checked proofs) for technologies providing core safety. Techniques like synthetic diversity and replication often a way to overcome lingering distrust in hardware, compiler, O/S and other elements: if we replicate critical components and one instance suffers a mishap, the system can outvote it and move on.
- The remainder of the system is created using clean-room technologies. Indeed, it might fail in some unexpected ways, perhaps due to latent Heisenbugs, or perhaps due to life-cycle upgrades that introduce new issues. But because the larger system isn’t safety critical, safety of the end users isn’t put a risk by such events.
Mapping this back to a smart highway, the
safety critical aspect might be a subsystem managing automobile driving plans,
a bit like flight plans. We could adopt
the view that these advance through a series of versions, which are themselves
read-only (so to modify version 7, the system creates version 8, etc). Questions arise of how to persist these
versions, how to share them, and so forth.
But as we progress down such a design path, we end up with smaller and
smaller functional elements that are more and more amenable to formal
correctness methods. Then we can layer
in fancier functionality, such as software to recommend hotels and restaurants
and make reservations at a higher level, in ways that do not put the core
safety of the solution at risk.
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.