Sunday, 5 November 2017

Disaggregating data centers

One of the hot new buzz-phrases in the systems community is "disaggregation", as in "the future data center will be disaggregated".

Right now, we tend to build data centers by creating huge rack-mounted collections of computers that often run the same operating systems we use on the desktop: versions of Linux, perhaps virtualized (either with true VMs or via container models, like Docker on Mesos).  The core challenge centers on guaranteeing good performance while sharing resources in ways that optimize power and other costs.  The resulting warehouse-size systems can easily contain tens or hundreds of thousands of racked machines.

Researchers pressing for a disaggregated model generally start by pointing out that we really have a lot of unused resource in these settings.  For example, there might be large numbers of machines equipped with GPU coprocessors that aren't currently running any applications that use them (or ones that are FPGA-capable, but don't actually need FPGA support right now), machines with lots of spare DRAM or SSD space, etc.  You can make a long list of such resources: other forms of ASICs (such as for high-speed SHA3 hashing), accelerators for tasks like DFFT computation, DRAM, unused cores, NICs that actually have computers built into them and hence could be used as network-layer coprocessors, switching and routers with programmable capabilities, massive storage systems with ARM or similar compute power down near the medium, etc.

So, the proposal goes, why not shift towards a greater degree of sharing, where your spare resources could be available to my machine?

With RDMA and RDMA-enabled persistent memory (PMEM) standards starting to work properly in settings that long-ago standardized around Ethernet and IP (RDMA on Infiniband is just not a hugely popular technology in data centers, although Azure HPC offers it), it makes sense to talk about sharing these spare resources in more ambitious ways.  My spare DRAM could be loaned out to you to expand your in-memory cache.  Your unused FPGA could become part of my 1000-FPGA deep neural network service.  There are more and more papers about doing these kinds of things; my class this fall has been reading and discussing a bunch of them.  In our work on Derecho, we've targeted  opportunities in this space.  Indeed, Derecho could be viewed as an example of the kind of support disaggregated computing might require.

But deciding when disaggregation would be beneficial is a hard problem.  A remote resource will never have the sorts of ultra-low latency possible when a task running locally accesses a local resource, and studies of RDMA in large settings show surprisingly large variability in latency, beyond which one sees all sorts of congestion and contention-related throughput impacts.  So the disaggregated data center is generally going to be a NUMA environment: lots of space resources may be within reach, but unless care is taken to share primarily with nearby nodes  (perhaps just one hop away), you would need to anticipate  widely varying latency and some variability in throughput: orders of magnitude differences depending on where something lives.  Get the balance right and you can pull off some really nifty tricks.  But if you accidentally do something that is out of balance, the performance benefit can quickly evaporate.

Even with variability of this kind, a core issue centers on the shifting ratio of latency to bandwidth.  In a standard computer, talking to a local resource, you'll generally get very high bandwidth (via DMA) coupled with low latency.  You can do almost as well when running in a single rack, talking to a nearby machine over a single hop via a single TOR switch.  But when an application reaches out across multiple hops in a data-center network, the latency, and variability of latency, grow sharply. 

The bandwidth story is more complex.  Although modern data centers have oversubscribed leaf and  spine switches and routers, many applications are compute intensive, and it would be rare to see long periods with heavy contention.  Thus, even if latency becomes high, bandwidth might not be lost.

In our experiments with RDMA we've noticed that you often can get better bandwidth when talking to a remote resource than to local ones.  This is because with two machines connected by a fast optical link, two memory modules can cooperate to transfer data "in parallel", with one sending and one receiving.  The two NICs will need to get their DMA engines synchronized, but this works well, hence "all" their capacity can be dedicated (provided that the NICs have all the needed data for doing the transfer in cache). 

In contrast, on a single machine, a single memory unit is asked to read data from one place and write it to some other place.  This can potentially limit the operation to half its potential speed, simply because with one core and one memory unit, it is possible to "overload" one or both.

We thus confront a new balancing act: a need to work with applications that run over hardware with relatively high latency coupled to very high bandwidth.  In my view, this shifting ration of latency to bandwidth is a game-changer: things that worked well in classic systems will need to be modified in order to take advantage of it.  The patterns of communication that work best in a disaggregated world are ones dominated by asynchronous flows: high-rate pipelines that move data steadily with minimal delays: no locking, no round-trip requests where a sender has to pause waiting for responses. 

As you  know from these blogs, almost all of my work is on distributed protocols.  Mapping these observations to my area, you'll quickly realize that in modern settings, you really don't want to run 2-phase commit or similar protocols.  Traditional versions of Paxos or BFT are a performance disaster.  In fact, this is what led us to design Derecho in the way we ultimately settled upon: the key challenge was to adapt Paxos to run well in this new setting.  It wasn't easy and it isn't obvious how to do this in general, for some arbitrary distributed algorithm or protocol.

So the puzzle, and I won't pretend to have any really amazing ideas for how to solve it, centers on the generalized version of this question.  Beyond the protocols and algorithms, how does Linux need to evolve for this new world?  What would the ideal O/S look like, for a disaggregated world?

As a concrete example, think about a future database system.  Today's database systems mostly work on a read, compute, output model: you start by issuing a potentially complex query.  Perhaps, it asks  the system to make a list of open orders that have been pending for more than six weeks and involve a supplier enrolled in a special "prioritized orders" program, or some other fairly complex condition. 

The system reads the relations containing the data, performs a sequence of joins and selects and projects, and eventually out comes the answer.  With small data sets, this works like a charm.

But suppose that the database contains petabytes of data spread over thousands of storage systems.
Just reading the data could saturate your NIC for days!  In such a situation, you may need to find new ways to run your query.  You would think about disaggregation if the underlying hardware turns out to be reasonably smart.  For example, many modern storage devices have a local processor, run Linux, and can run a program on your behalf.

Clearly you want the database operations to be pushed right into the storage fabric, and would want to harness all that compute power.  You might want to precompute various indices or data samples, so that when the query is issued, the work has partly been done in advance, and the remaining work can happen down in the layer that hosts the data.

But of course I'm a systems person, so my system-centric question, posed in this context, is this: what will you want as an O/S for that environment?  And more to the point: how will that O/S differ from todays’s cloud systems?

If you are unsure how to think of such a question, consider this: if you wanted to implement a disaggregated database today, how would you even convince the storage layer to run these little add-on programs?  In a purely mechanical sense: how does one run a program on a modern SSD storage unit?  I bet you have absolutely no idea.

If you wanted to do everything by hand, logging into the storage servers one by one is an option.  Today, it might be the only option, and the environment you would find inside them could be pretty arcane.  Each vendor has its own weird variation on the concept. 

But we could change this.  We could create a new kind of operating system just for such tasks.  The power of an operating system is that it can impose standardization and increase productivity through portability, easy access paths, good debugging support, proper integration of the needed tools, virtualization for sharing, etc.  You would lack all of those features, today, if you set out to build this kind of database solution.

I mentioned at the start that data centers are optimized to minimize wasted resources.  One would need to keep an eye on several metrics in solving this database challenge using a disaggregated model.  For example, if DRAM on my machine gets loaned out to your machine, did the total performance of the data center as a whole improve, or did your performance improve at my expense?  Did the overall power budget just change in some way that the data center owner might need to think about?  How did this shift in resources impact the owner's ability to manage (to schedule) resources, so as to match profiles of intended level of resources for each class of job?

Then there are lower level questions of what the needed technical enablers would be.  I happen to believe that Derecho could become part of the answer -- this is one of the motivating examples we thought about when we developed the system, and one of my main reasons for writing these thoughts down in this blog.  But we definitely don't have the full story in hand yet, and honestly, it isn't obvious to me how to proceed!  There are a tremendous number of choices, options, and because the vendors don't have any standards for disaggregated computing, a total lack of. existing structure!  It’s going to be the Wild West, for a while... but of course, as a researcher, that isn't necessarily a bad news story...

Thursday, 12 October 2017

Deriving assumptions and constraints

In the past few blog entries, I focused on issues of "specification": to what extent are we able to specify the behavior our systems guarantee, the behavior we expect clients to respect, or other kinds of assumptions we might be making about the setting, peak loads, and so forth?

Today I thought I might ask a closely related question: could formal theorem-proving tools, like the Ivy prover used by my friends at Tel Aviv University, or the NuPRL system that Bob Constable's group here at Cornell created, be used to "extract" a specification, including any assumptions or constraints the solution imposes on applications that use it? 

I should mention that while these two theorem proving systems use high level logic languages, not all such systems are quite so mathematical in approach.  If you prefer Java as your coding language, Microsoft's Dafny would be another system where you could pose my question.  In fact, Dafny really embodies the main idea behind this blog posting, but approaches it the question in a slightly different way.  In Dafny, when you specify a component of a system, you encode any preconditions or invariants as assertions in a special notation that goes into the documentation for the associated module and methods.  Later, when compiling your code, Dafny requires you to prove that the method actually respects the constraints that you expressed this way.  So Dafny doesn't automate specification, per se, but it does enforce that a method must respect its specification, and it can sometimes automatically prove that this property holds.  If so, you won't need to do anything more at all.  Further, if your users later violate an assumption, Dafny would refuse to compile their code, ensuring that your module will be used only as intended.

The problem of automatically figuring out what these specifications should include looks super hard, but is fascinating to me: an ideal topic for research, if you have the math and type-theory background (or are willing to learn: nobody is born knowing that stuff...)

So here's the setting.  Suppose I have a method for computing some function, F(x), such as "it computes x/2".  Now, assume that I decide to run it purely with integer arithmetic.  When will method F() be correct?  Well, for this simple example, it will work for even values of x.  Had I been using Dafny, it might refuse to compile my code without some help, such as "assert even(x)" as a precondition (the definition of even(x) could be something like x%2 == 0, for an integer x).

So Dafny wouldn't necessarily compile code with an error in it, but it often wouldn't know why the code isn't correct.  The question becomes: could a theorem prover have figured this out for me?  With this very simple example, the answer is "sort of."  Take a theorem prover, and tell it that F is a "lambda" that maps from x to x/2.  Now define the argument x to be of type integer, and tell it that the arithmetic operator is of type (integer, integer) => integer.

If you tell NuPRL or Ivy to "verify" this code, it should fail, just as Dafny should fail, giving an example such as "x=1", and meaning "with 1 as its input, F(1) gives an undefined result".  Another possibility, for situations where the prover understands that integer division rounds down, would to try and verify the invariant that F(X)*2 = X.  This should fail, with the example F(1)=0.

So now we know that our function doesn't "work" as intended.  What we really want is for the system to suggest "Have you considered requiring that X must be even?"

So this would be an automated derivation of an assumption, or of a constraint on the inputs.  (For this blog posting, we can assume that assumption and constraint are synonyms, although there are some PL communities that use the terms in slightly different ways.)

How hard is to automate the inference of assumptions?  With current programming languages, it can be very difficult, for several reasons.  A core issue is specificity and naturalness.  Notice first that because false implies anything, there is a very simple universal constraint, but it renders the problem totally useless.  This rules out any kind of simple way of solving the problem.  So we want a non-trivial solution.

Next, consider that specifications can involve multiple kinds of formalisms: multiple areas of mathematics.  For example, over the course of my career I've worked with systems that have real-time guarantees,  protocols in which you typically specify a convergence guarantee,  more classic temporal logic properties, such as consistency for state-machine replication.  Years ago, one of my students (Kryzs Ostrowski) made the point that there aren't any programming languages in which all three of these "models" would easily fit.  He argued for a different form of type checking, but setting that to the side, the bottom line is that if you code in a language where temporal logic is relatively easy to express (like Dafny), you probably won't be able to specify real-time properties, or probabilistic convergence properties.  Each style of property has a different most natural style of formalism, and they don't necessary compose in obvious ways.

The same goes for having a system automatically infer specifications: the odds are that any given formalism will be expressive enough to capture and perhaps infer at least some kinds of specifications in its own formalism, but that it won't even be able to "understand" properties that are outside its normal scope of coverage.

This makes specification of the automatic-specification-derivation question itself tricky.  A really good derivation of assumptions would factor out each distinct element, so that we would have a small list of them, each "speaking" about a distinct aspect of the runtime environment.  But a theorem prover, lacking a human notion of what it means to be distinct and natural, is much more likely to extract some single messy condition for safety, and perhaps a second messy condition for liveness ("total" correctness can be viewed as having two aspects: "nothing bad happens", and "if condition X holds, then something good will eventually happen"). 

Of course, if English was useful for describing this goal, it would have been much easier.  We basically want a normal human reader of the specification to be given a list of preconditions (requirements the caller must respect), each separately intuitive and sensible: "The program will work if (1) The inputs are even numbers.  (2) No input is negative.  (3) No input exceeds 2^16-1..."  So getting from some single condition for correctness, if we had one, into a cleanly refactored form that feels right to human users will be a puzzle.  Then we would want to specify post-conditions, again in an easily understood form, and invariants.

But while these easy cases are probably very feasible, any general approach to extracting a specification purely in temporal logic is guaranteed to be hard.   Taking a favorite classic paper: I would love to see someone explore extracting the specification of uniform agreement (consensus).  In principle, doing so should  "derive" the famous Chandra and Toueg <>W conditions: consensus with N>2 members, of which at most one is faulty, can be achieved if the failure detection input has the property that eventually, it permanently stabilizes in a state where some correct process is never again suspected of being faulty by any correct process.  A really powerful system might even be able to give a result for the more general case of N members and T failures, and derive that N > 2T+1.

If you follow this blog, you've noticed that I'm obsessed with our new Derecho system.  For a system like Derecho, what could a tool of this kind derive?  We think we know the actual specification (we understand our own protocols well), but it would be fun to see a surprising "angle" that we haven't considered.  And you can't rule out such a thing happening: Derecho is a complex engineered piece of software, and even if we think we understand the conditions for progress and safety, perhaps there are aspects of the environment that Derecho depends upon, and that we haven't thought much about.  For example, what are we assuming about the network layer (RDMA, in the case of Derecho)?  What are we assuming about the SSD storage units?  Are there any timing requirements?

Monday, 18 September 2017

What we can learn about specifications from ZooKeeper's asynchronous mode, and its unsafe ForceSync=no option?

At the time of this writing, ZooKeeper is surely the world's most widely used data replication tool.  You run it on a few machines, normally 3, and it offers a file system API with very strong guarantees to the user.  In fact, if configured to do so, ZooKeeper implements the Paxos specification: Leslie Lamport's formalism of the properties required for correct state machine replication.  (My post-doc, Weijia Song, points out that actually, the Zookeeper Atomic Broadcast, ZAB, isn't necessarily a true Paxos protocol, but the issue he raises is very subtle, so I'll set that to the side.  For our purposes here, ZAB is close enough).

In a recent blog posting, we discussed some of the missing aspects of that very specification.  As a result, when I read the ZooKeeper documentation, I was intrigued to realize that the documentation more or less urges that the system be configured to violate Paxos!  In fact the document is short, and easy to read, so have a look if you are skeptical. 

You'll learn about all sorts of parameters that represent ZooKeeper's response to those missing specification elements, such as how to deal with disks that fill up completely, or avoiding inconsistency in the list of servers running the ZooKeeper service.

And then, in the middle of the same document, you run into a fascinating option: there is a small section called "Unsafe configuration options" that explains that "The following options can be useful, but be careful when you use them. The risk of each is explained along with the explanation of what the variable does."  Then we read about an option called ForceSync: "If this option is set to no, ZooKeeper will not require updates to be synced to the media." There is no discussion of risks at all.

Some people know about this but think of it in terms of a broader approach to "using Zookeeper asynchronously".  Used asynchronously, Zookeeper lets you start a series of operations but either ignore their termination, or at least not wait one by one.  Of course flow control always kicks in eventually, to prevent congestion, but you end up with a stream of requests.  In this mode it is nearly universal that you would also set ForceSync=no.

So how safe are such actions?

Elsewhere, on the ZooKeeper blog, Flavio Junquera writes that the system would perfectly well if this option is used, and that it can offer big speedups.  He comments that for safety, there are several options: "You could consider using write barriers, or battery-backed raid SSD".  The write barrier remark relates to a Linux system call, "fsync".  A battery-backed raid SSD is a type of SSD storage with a DRAM cache that can hold pending writes in memory (in DRAM), but with battery backup so that if power fails, the pending writes will definitely complete.  Then behind the DRAM are a set of SSD storage units arranged to handle transfers in parallel, so that the aggregate bandwidth might be enough to keep up with the DRAM transfer rates.

On StackOverflow and elsewhere, you can easily find threads encouraging you to configure ZooKeeper with ForceWrites=no, and assuring the reader that nobody has ever observed any bad consequences.

In effect, there is very little discussion of risks, except in the sense of "yes, you should definitely use this feature, but remember to also do these other things...."

So what's the issue, and why is it interesting?

At the core of any Paxos implementation is the transaction log where Paxos stores its state.  In Derecho, this takes the form of replicated data residing in the replicated C++ objects defined by the developer.  In classic Paxos, it was a list of log entries associated with the "acceptor role".  Most people understand this to have been an append-only disk file, but my colleague and friend Robbert van Renesse, a Paxos expert, questions that assumption.  He thinks that Leslie was deliberately vague about where the logs live, with the intent that it could equally well be used as an in-memory atomic multicast.  Derecho does exactly that: it has one protocol, with two configuration options, and you get to pick.  Durable storage on disk gives you a durable Paxos, and in-memory storage, a form of atomic multicast with total ordering and fault-tolerance.

The same is true in ZooKeeper, in which performance centers on the speed of the ZooKeeper transaction log.  You need to tell it where you want the log to reside.  Some popular options include placing it in RamDisk (in memory), or on a real disk, or perhaps an SSD.  Above you saw recommendations that it be on a battery-backed raid SSD.

The problem is that if you just put the log on a normal disk or even a normal SSD disk, you get Paxos guarantees of durability... but you also see a heck of a big slowdown.  Partly this is because DMA to an SSD is quite slow compared to copying in memory.  But the bigger issue is that each time you do an SSD write, if you actually wait for the write to fully complete ("a forced sync"), you pay a full millisecond just waiting.

Even with concurrency this limits the typical SSD configuration of ZooKeeper to about 1000 write operations per second.  

Early in the ZooKeeper story, the developers ran into this issue, and added a new option: ForceSync=no.  With it, ZooKeeper "on its own" ceases to be a true Paxos log, because it will build a backlog of in-memory updates queued up to be written to disk, and won't actually carry out those updates instantly.  But it gains hugely in performance: 50,000 writes per second become completely feasible.  A 50x speedup... at what cost?

This is where those comments about battery-backed SSDs and write barriers enter the picture.  And this is the puzzle: in fact, you can use ZooKeeper safely in this mode, at no cost and no risk at all.  But it depends on your perception of cost, and of risk.

Lets start by setting ForceWrites=no but ignoring the helpful advice.  ZooKeeper will be buggy.  But, to be bit by this particular bug two things have to happen.  First, you need to have a service that crashes and develops amnesia about a batch of committed transactions (updates) that were pending at the time of the crash.  And second, someone or something needs to notice.

The point about "someone noticing" is the key to why so many applications get away with setting ForceSync=no, and yet pay no attention to Flavio's advice.  Think about the sequence of events for an application using ZooKeeper.  Some application is about not to complete something important, like launching the rocket ship.  So it writes to the ZookKeeper log "... two, one, ignition!" and presses the launch button.

Exactly as this occurs, the power goes out, and on recovery, the system has no record that the button was about to get pushed.  So we have an inconsistency that Paxos normally doesn't permit: Lamport requires that Paxos must never forget a committed transaction, meaning that once the application is told the commit has occurred, Paxos has an obligation to not lose it.

But this is not a likely failure sequence!  The amnesia part, sure, that really is likely.  A bit like with a normal Linux file system: if a program crashes before calling fsync, the last bytes it wrote could easily be lost (maybe even the last few thousand).  We know that, and learn to call fsync.  But someone actually caring, about that specific operation, yet neglecting to manually call fsync?  Seems very unlikely...

So here we have ZooKeeper acting... like Linux file systems normally act!  In fact, you can manually call fsync anytime you like in ZooKeeper, so if you do need it, there it is.  That's the write-barrier approach.

The battery-backed raid SSD option is less common.

So who is wrong: Leslie, for including this rule in the specification?   The good user, who learns to call fsync when necessary?  Or the bad user, for recklessly breaking the properties of Paxos, all for a lousy 50x or 100x speedup?

As a builder, I have to feel sympathy for any developer who wants the speed.  And I honestly question that Paxos specification.  Maybe the requirement really is too strong!  Couldn't it be reexpressed in terms of fsync: "no committed request will ever be lost if fsync was invoked, and completed after the commit?"

In fact the interesting issue here is that when ForceSync=no, ZooKeeper simply imposes an extra obligation on the behavior of the developer (use fsync, or confirm that you have the right kind of specialized SSD).  As we discussed in that prior blog entry, Paxos already imposes obligations on its users, and doesn't express those either.  Why is this different?

Yet I also understand Leslie.  I've asked him about this, and he thinks developers will just get it wrong.  They want the speed, because otherwise they look bad, so they flip this switch, but do something for extra speed in a situation where it really isn't appropriate.

Here in a college town, with students who definitely drive unsafely and fast, I get it.

How many of the developers who push ZooKeeper's insane speed button actually know what they are doing, and think about when to manually call fsync?  Yet on the other hand, how many of their applications would break if the ZooKeeper storage subsystem were to slow down by 50x or 100x?

So you tell me: should systems follow the ZooKeeper lead?

Seriously: what do you think?  Should Derecho support ForceSync=no?

Thursday, 7 September 2017

Inadequacy of the Paxos specification, and what we can learn from the issue

In a blog ten days ago, I discussed the issue of specifications that omit coverage for cases that actually arise in real systems.  Since then two colleagues who follow the blog asked for examples to illustrate the issues, so I thought I might say a few more words on this, focusing on the classic specification of Paxos: Leslie Lamport's solution to the State Machine Replication problem (also sometimes called Consensus, or Uniform Agreement).

The traditional specification of Paxos has the following elements:
  • A specified set of participants.
  • An assignment of roles {leader, acceptor, learner} to the participants.  Each can have more than one role, and we often think of external clients as adding two additional roles: {command-initiator, command-consumer}. 
    • A leader runs a protocol for putting new commands into the Paxos log.
    • An acceptor holds a Paxos log (an ordered list of slots, each of which can be empty, or can hold a command), and information about which commands are known to have committed.  Any given acceptor might have gaps: slots for which it lacks the committed command (on top of this, there is also a somewhat subtle failure case in which a slot will permanently be left empty, so that every acceptor would have a gap at that spot).
    • A learner runs a protocol for computing the full list of committed commands and their ordering.  Any single acceptor might have gaps in its log, so the learner does this by merging logs in order to fill those gaps.
    • Paxos really says nothing at all about command initiators and consumers, except that the initiator waits for a response to the request that the command be posted (in doing so, Paxos eliminates what we sometimes refer to as a causal order obligation, in which a system is expected to track and respect ordering on pending asynchronous actions).
  • A rule for what it means for a command to be valid (non-triviality). 
  • Agreement on ordering.
  • Durability: in any state where a Paxos service responds to "learn" requests, it needs to return the entire list of previously ordered commands. 
Many authors just focus on the three bolded properties.  Yet notice that from this set of five elements, we can easily discern a whole series of additional, unspecified aspects:
  • There is a Paxos reconfiguration protocol, but it doesn't introduce additional specification elements, except to change the set of participants.  Yet there are several aspects one would normally wish to see addressed:
    • Malkhi has convincingly argued that one should terminate a membership epoch before starting actions in the next epoch.  This avoids having a command that was initiated in epoch k commit much later, perhaps in epoch k+1.  While Lamport has often said that this isn't necessarily a bad thing, Malkhi's point is that a delayed commit can be confusing, because many systems operate in a configuration-sensitive way.
    • A new member of a Paxos acceptor group should probably not have an initially empty log.  A means of performing state transfer to it is thus required.  Simply copying some existing log to the joiner is incorrect, because by doing so, a command that previously lacked a quorum and hence was uncommitted (Paxos has a scenario in which it leaves a slot "empty") can become committed because duplicating a log effectively duplicates a vote for that command.
    • When Paxos restarts after all its members crash, the protocol doesn't specify the rule for figuring out the proper initial configuration for the restarted service (this matters with dynamic membership).
  • The specification says nothing about flow-control, yet if a Paxos protocol has no flow control mechanism at all, a quorum of acceptors could advance unboundedly far into the future relative to a stalled or failed acceptor.  This might mean that the faulty acceptor has no feasible way to catch up later, and in effect, would decrease the resilience of the service: rather than having N members, we would have to think of it as having just N-1.
  • The specification says nothing about sizes of objects (size of commands), yet acceptors will presumably have bounded amounts of memory, and might not be able to accept arbitrarily large objects.  Solving this isn't necessarily hard (one could have commands that point to external objects), but then the objects would be less replicated than the commands, and one has to ask whether that somehow violates the durability property.
  • The specification says nothing about fairness or other quality of service properties, such as timeliness of response.  Yet real systems need fairness when many clients share a Paxos service and all the clients want a fair-share of access.  In fact, one can then ask what specific notion of fairness is desired: should it be round-robin (like in Derecho)?  Or should some clients be allowed to send 2x more commands than others, because they happen to be busier?  Should a slower client be "delayed" by activity of a faster client?
  • I mentioned that Paxos seemingly excludes situations where clients might issue a series of requests, without waiting for replies.  In practice, this is common (clients often "stream" requests).  We would want the client ordering to be respected by the system: if a client sends request A, then B, Paxos should preserve the A happened before B relationship.
  • At Google, the Paxos owner (in fact the leader of the team responsible for the Chubby service) pointed out to me that his big issue is concerned with wide area deployments of Paxos, which introduces a whole set of issues Lamport never considered:
    • Proper specification of a hierarchical Paxos.  Guerraoui and Pedone and Quema all have looked at this question, primarily in connection with Ring-Paxos protocols.
    • Heavy tailed behaviors.  Chubby struggles to deal with very delayed data that can be caused by overloaded or damaged WAN links.
    • One way to get around "late" data is to design systems that assume they have correct and current data, for example using locks or leases.  Chubby does this, but it turns out that when one does so, getting WAN service instances to release their read locks so that the data can be updated, or preventing them from renewing those leases, can be very slow.  This might violate a specification that requires fast normal read behavior, but also requires a fast way to be able to update "rarely changed" configuration parameters or program versions or other forms of WAN data that does change now and then.
Beyond these points, one encounters a further concern.  The Paxos specification is ideally suited to verifying the correctness of Paxos, with respect to its own "promises".  But the specification doesn't tell us anything at all about correct use of Paxos, or required behavior of the application using the service. 

For example, suppose that Paxos isn't the real repository for data but is playing an intermediary role: the program using the Paxos service might itself be a replicated database, or some other form of replicated service that wants updates delivered in a deterministic order.
  • Do we expect that such a service would always be able to use the identical order to the Paxos log?  What does this imply about the specification the client service must respect?
  • On restart from crashes, we now have two forms of state: state in the durable Paxos logs, and state in the database service replicas.  How should these be reconciled?
  • We won't want the Paxos state to grow without bounds, so we will need to truncate the Paxos logs.  There is a truncate protocol in Paxos, but what obligations fall upon the client service that wishes to make use of that truncate command?  How does truncation interplay with failure cases that might arise?
Believe or not, I could actually go on and list even more issues!  My point, though, is that when we use Lamport's core specification, we are really working from an inadequate specification that omits major, very important aspects of the real service we intend to build and use. 

But notice how hard it would be to check a Paxos specification for adequacy.  We would need a fairly elaborate (and adequate) specification of the environment, and of our larger goals.  Otherwise, questions such as flow-control or other aspects of bounding resource consumption, or of client state, could never even be posed.  So there is sense of chicken and egg: to understand if a Paxos specification is adequate, we really need an adequate specification of the setting where Paxos will be used, so that we can study the questions it poses about the Paxos service per-se.

I'll stop on that point, but there is more one can say about adequacy of specifications.  A good topic for some other posting down the road...

Tuesday, 29 August 2017

The adequacy of specifications

I had coffee with a visitor yesterday: Professor Eva Kuhn from the Technical University in Vienna.  Our conversation was focused on the power, and limitations, of new methods for creating correct solutions to distributed computing problems.

As you probably know, for many years I've been in dialog with a number of programming language researchers (notably Bob Constable, Mooly Sagiv and Noam Rimsky, although the full list would be  much longer and would include Leslie Lamport, Nancy Lynch, Jay Lorch and many others), all specialists on a new way of generating provably correct distributed computing systems and protocols from formal specifications.  I'm not an expert on this topic myself -- my close collaborator Robbert van Renesse is far more knowledgeable than me, and actually works with such tools himself.  Call me an enthusiast.

The methodology they've worked with is roughly as follows.  Using temporal logic or a similar abstract mathematical representation, one creates a description of the problem in terms of sets (the members might be understood as computing nodes, processes, variables, the network itself -- the designer gets to decide), operations upon them, and statements about them.  In this way we can model a distributed collection of computers, exchanges of messages between processes running upon them, and even crashes. 

Next, one specifies a desired behavior, such as uniform agreement (the abstract problem underlying consensus protocols such as Paxos, including the versions we built into Derecho).  In some sense this step describes a "goal", but not the method for achieving the goal.  Think of the specification as a set of constraints, assumptions, invariants: properties that characterize a correct solution.  Ideally, the behavior would have a safety component ("the solution only does good things, in the following sense...") and a liveness component ("provided that the following assumptions hold, the solution always terminates").  But some protocols aren't always live, and sometimes the conditions under which they are live are difficult to pin down, so this second aspect isn't necessarily feasible.
At any rate, in the penultimate step, one specifies the desired protocol itself, but still in a logic formalism.  This is done by expressing the behavior of the protocol as a sequence of operations in the underlying framework -- for example, as a set of temporal logic "actions" on the underlying sets, carefully limited so that any "distributed" operation is performed using our abstracted versions of primitives that make sense in real networks.  For example, the standard Paxos protocol, with its leader-based rounds used to construct ballots and contend for slots in the Paxos log, would be transformed into a series of set operations that match the behavior of the leader, the local copies of the logs at each of the acceptor processes, and so forth.  The lowest level operations would be formal versions of message send and receive primitives.

With this in hand, the next step is to use a theorem prover like NuPRL (Constable's system) or Ivy (the one used at Tel Aviv University) to prove that the abstractly-specified protocol implements the abstractly-formalized behavioral specification.  For example, in the case of Paxos, such a proof would show that for any feasible mix of roles (leader, acceptor, etc), and for every reachable protocol state, the correctness invariants hold.  We would also want to show that from any initial state, some decision states are always reachable, but of course this is not always possible.  For one thing, the FLP impossibility prevents us from creating a protocol in which a decision is guaranteed to occur within finite time, but in fact there are similar cases in which progress can't occur, namely situations that might involve a crash of a majority of the machines, or a partitioning of the network.   For example, if you were to launch a consensus algorithm in a setting where more than half the nodes have crashed, clearly it wouldn't be able to reach consensus.  As mentioned above, in the ideal case you actually specify assumptions under which progress can occur ("we assume that at least a majority of the processes are operational and remain operational throughout the execution of the protocol"), but it isn't always as easy as you might expect to specify those assumptions.  So sometimes, this aspect is not addressed.

These NuPRL or Ivy (or TLA+, Dafny, Coq, Larch...) proofs are semi-manual. There is a proof-checking step, typically done using a technique called model checking (basically, the system enumerates reachable states and verifies that the invariants hold for all such states).  Then there are higher level theorems that can offer short-cuts: general proof cases that were previously proved and checked that essentially represent forms of meta-reasoning, covering a class of behaviors.  These allow the checker to avoid repeatedly model-checking basic steps and to instead model-check the pattern, and then look for instances of that pattern.  Finally, when the checker stalls, a human can intervene and help it find the proof (or perhaps modify the protocol or the target environment specification, to make it easier to establish the desired property). 

The meta-proofs (the higher level ones) are very interesting and necessary, because they potentially address cases that cannot be model-checked by exhaustively enumerating every reachable state.  For example, suppose that some protocol can generate unbounded runs of states, but a proof exists that every reachable state satisfies a given property.  With that "meta" proof in hand, we don't need to examine every reachable state in order to know that the property always holds, and this might let us make a true statement about the protocol that references the meta-property. In contrast, had we tried to examine every reachable state, the checker would have failed, since the runs are unbounded, hence the number of states needing to be checked is unbounded. 
In practice, the power of a tool like Ivy or NuPRL centers on this kind of meta-reasoning, and the sophistication of the tactics used to match proof requirements to the library of meta-proofs.  Someday, these tactics may approach the sophistication of a human logician, and in fact NuPRL has already been used to semi-automatically solve some deep open problems in logic!  But there are many cases it would not yet be able to tackle.

When checkers fail, they sometimes can exhibit a counterexample: perhaps, a run in which two processes enter "decision" states but hold different outcome values, violating consensus.  But they can also simply fail to find a proof.  This is because a powerful enough specification language can encode undecidable questions, hence one simply cannot automate the entire proof task: there are guaranteed to be some problems on which a prover can neither conclude that a statement is true, nor false.  Moreover, for any finite amount of resources (space, computer time) there are problems that would take much more space, or much more time, to model check.  So for several reasons, the checker might stall.  If it succeeds, though, the result is that we have a proof for our abstract protocol specification: we can show that it solves the problem, as specified.

An interesting point, tangential to what I really want to say here, is that one could perhaps extract minimum conditions for progress (a form of weakest precondition), in the manner common in PL correctness proofs of a few decades ago.  This would automate extraction of assumptions needed to prove "conditional liveness": if the precondition holds, then progress will occur. To me that would be amazing, and would automate a step that Chandra and Toueg first proposed for consensus nearly 20 years ago (the famous <>W oracle).  If you love theory and languages, do consider tackling this challenge!

But staying "on message", the last step with a prover such as Ivy or NuPRL is to extract a provably correct program in C or O'CaML from the proof itself.  That is, given a specification of the problem, and a specification of the protocol, and a proof that the given protocol solves the problem, these tools let you push a button and automatically extract the corresponding code in an executable form, after which you simply compile and run it!  Ideally, the compiler used should be proved correct (there is a proved compiler for a subset of C), and the operating system itself be proved correct (like SEL4), and the hardware verified too (using tools like Verilog) and the foundry that made the chips... well, one goes as far as one can.  In fact any chain of technologies has its limitations.  But let's not fuss over the obvious.

Professor Kuhn works with people developing safety critical systems for self-driving cars, control of potentially dangerous equipment, and so forth, and for her, this new methodology was fascinating.  In fact she isn't the first person I've talked to about this: Chris Hobbs, at QNX, and his colleague Peter Shook, are also extremely interested in the potential of such tools.  By taking the programmer out of the loop and extracting code directly from a proof. we seemingly eliminate a weakness in many kinds of control systems, where one historically would hire a team to design a solution, but then would depend on people to write the required code, after which you use a method such as the "B method" to try an prove the code correct. The resulting human-generated proofs are only as good as the human team carrying them out, and the worry always exists that they might have failed to check some important case.  With a computer-generated proof, there are computable "checking" methods that can quickly establish whether or not every case was model-checked.
And yet even with automated extraction, issues remain.  Can every solvable problem be tackled in this way?  Over coffee, Professor Kuhn and I talked about two concerns. 
First, there are solvable problems for which a model-checker might run low on space or time and throw up its hands, and hence that could be solved in principle, but in fact might not be amenable to formal treatment even so.  So one can imagine situations in which there is some important task, like controlling the flaps on an airplane wing, a good solution that our engineering colleagues want to implement, and a hand-built proof of correctness that "seems" adequate, but for which there isn't any feasible fully-machine checked counterpart, because our computers might not be powerful enough, or perhaps because our techniques for generating model-checked proofs might not have the needed meta-proof methods at hand.

A related issue was raised by Hobbs and Shook at QNX: If a proof (and for that matter, the corresponding C program) is not "natural" in appearance, a human skeptic might not be easily convinced that it is correct.   Thus there are also model-checked proofs that might be rejected by a team tasked with confirming that the airplane is safe to fly: the computer says "yes, here is the proof" but the skeptic in the team might worry that if no human can read and fully understand that proof, it could be flawed in some systematic way and yet this would be overlooked.
That concern leads directly to the second question Professor Kuhn and I found intriguing.  Our conversation ended on a related form of impasse: suppose that the original problem specification itself is somehow inadequate?  For example, one could imagine a specification for behavior of flaps on an airplane wing that leaves out some important case that can actually arise, and hence fails to pin down the correct behavior in that case.  Such a specification leaves undefined the correct behavior for some actual situations that might arise in the real world, although perhaps it covers "most" cases and hence might seem complete and useful: you could still generate code, look at examples, trace the logic, and see that for those examples the code behaves properly.  The lurking danger would be that because there were unspecified but real cases, examples could also exist in which the behavior would be different than intended, because the specification neglected to cover them.
Thus we would want to be sure that the specification itself was adequate in the sense of fully specifying every case that could really arise in practice.  This, though, seems to be a somewhat less fully solved problem.  Indeed, it seems to pose an impossible requirement!

I've been shown some work that looks at specifications, and the question is a classic one: large communities are hard at work on the topic, and have been for decades.  The issue is that the "adequacy of a specification" problem itself seems to be undecideable.  The intuition is sort of obvious: if one could decide all possible adequacy situations, what would stop me from encoding some simple undecideable problem (such as the halting problem), and then using the test of adequacy to solve the undecideable problem?  Seemingly, nothing stops me from doing this.  Thus adequacy-testing should itself be undecideable.  I bet someone has proved this, although I don't think I've ever seen such a paper.  I need to read more of the PL literature.

At any rate, what we see here is that on the one hand, our field seems to be at the edge of a major advance (automated synthesis of provably correct solutions to safety-critical systems problems), but on the other hand, is also approaching a natural and final form of impossibility: the risk that for important cases, it may turn out to be impossible to verify that the specification itself is "complete".

There is reason for hope.  There are many situations in which if one is given a program and must answer a question about it, the task cannot be solved because the given program might encode an undecideable task.  Yet when we use computational methods to generate programs, we can often prove that our methods only generate sound programs. 
Could we solve the adequacy problem in this manner?  For example, is there a way to generate specifications through some kind of dialog with the developer, that would only generate adequate specifications with full coverage of the problem domain?  This, I think, should be possible.  The resulting solution might not be able to specify every problem you can dream up, but for the things it could specify, we would know that the specification doesn't overlook any possible scenarios.
Perhaps the practical class of critical control systems doesn't actually require the ability to specify control tasks that encode undecideable questions.  If so, then perhaps every needed critical control solution can be created.  This would finesse the issue: even though there may be ways to encode undecideability into the same formalism, our more limited use of the methodology would never attempt to do the impossible.
My hope is that we are rapidly heading towards this kind of happy outcome... with all the progress on proofs using NuPRL an Ivy (and other systems like TLA+, Dafny, Larch and Coq), we'll know pretty soon.

Thursday, 17 August 2017

The systems-area obsession with peak performance

In systems, there has always been a completely understandable focus on peak performance.  For me personally, Butler Lampson's early papers on "Hints for Operating System Design" (which basically argued that we need to unclutter the critical path), the famous End to End paper by Saltzer, Reed, and Clark (which argued for taking unnecessary functionality out of the network), and the Birrell and Nelson paper on the performance of Firefly RPC (which argued for taking unnecessary mechanism out of the remote procedure call path) were awe-inspiring classics: papers you reread decades later, and that still amaze.

In fact for people who get pleasure from programming, there is a natural tendency to build systems and evaluate them, and obviously any such task centers on a peak performance perspective.  Optimizing peak performance is fun, and honestly, can be addicting: there is such a direct feedback when you succeed.  It is very rare to see a first version of a system that can't be sped up by a factor of 10 or more even just by doing basic optimizations, and in some cases, we end up with speedups of 100 or 1000-fold, or even more.  What a thrill!
Yet there is a sense in which our love for speed worries me: I wonder if this classical way of thinking about systems might be fading as a pure form of innovation in the eyes of the field as a whole.  The core issue is that existing systems (here I mean the mainstream workhorses: the operating system, the network layer, perhaps the compiler) all work pretty well.  Of course, speeding them up is a worthy endeavor, but it may no longer matter enough to be a justifiable goal in its own right.  Research on performance is just not a compelling story, if you focus on this particular layer.
Why should this matter?  In fact, it might not matter at all, were the systems community itself aligned with these larger external forces that shape the way we are perceived by other communities and by computer science as a discipline.  But right now, I suspect, there is a substantial disconnect: people like me are addicted to speed (hmm... that doesn't sound quite right), while people who hang out at conferences like NPS and KDD don't really spend much time worrying about the performance of the systems components my crowd focuses upon, like the latest version of Linux running on the latest multicore hardware platform.
As I write this blog entry, this same dynamic is evident even within my own research group.  For us, it plays out as a tension between telling the Derecho story as a story about a new concept ("smart memory") and telling it as a story about raw speed ("fastest Paxos and Atomic Multicast, ever!"). 
It seems to me that the broader field tends to value  impact more than what might be called "narrow" metrics, such as the speed of the Linux I/O path.  Invent a new way of doing things, and you can change the world in interesting ways.  So the puzzle that emerges is this: if the systems community has started to drift relative to the broader computer science community as a whole, don't we then run some risk of becoming marginalized, by virtue of over-emphasizing aspects that the broader computer science community views as unimportant, while actually rejecting innovations that the broader community might be thrilled to hear about?

Take Spark, a recent home run story from Berkeley.  If you think back, the first research papers on Spark told a story (primarily) about huge speedups for MapReduce/Hadoop, which  obtained by smarter in-memory caching of files (they call them RDD: Resilient Distributed Data objects) and smarter task scheduling, so that computations would tend to exhibit affinity relative to the cached content.  Years later, it seems clear that the more significant aspect of Spark -- the more impactful innovation --  was that it created a longer term computing "model" in which data loaded into the Databricks system (the new name for Spark) lives in memory, is transformed through various stages of computation, and where the end-user has a powerful new experience of data mining with vastly better performance because these RDDs remain resident in memory as long as there is enough space, if they continue to be used now and then.  Systems people may think of this as a story of performance... but NIPS and KDD people perceive this as a paradigm shift.  As a systems person, it seems to me that our community in fact accepted the right papers, but for the wrong reason, and in fact that the early advising of the Spark inventors (by their faculty mentors) may even have misunderstood the real contribution and steered them down the less vital path.  Of course, the enthusiasm for Spark quickly reset the focus, and today, the Databricks company that offers commercial support for the Spark platform focuses on high-productivity data mining with blazing performance, rather than portraying the story as "speeding up your Hadoop jobs."

It isn't a realistic thing to wish for, but I'll wish for it anyhow: as a field, it seems to me that we need to try and pivot, and to embrace change in terms of what styles of computing really matter.  The world has changed drastically in the past decade or two: what matters most, right now, is machine learning.  This is partly because systems work pretty well.  Disruption comes from big reaches, not small 10x optimizations to things that already worked fairly well.

I don't know anything more about the future than anyone else.  My focus, in Derecho, is on "smart memory," but will this ever become a recognized field of research, one that other people would work on?  Does the world really need smart memory services for IoT applications?  I hope so, but of course that question will be answered by other people, not by me.    And so one can easily understand why my students love the raw speed story: For them, fast replication is a more well-defined systems topic, with an obvious and specific role in existing systems.  People use replication solutions.  So it makes sense for them to gravitate towards speed records.

Indeed, for them, viewing machine learning as the real goal, and performance as just one dimension, makes systems research feel secondary to machine learning research.  Nobody wants to feel like the plumber or the electrician: we all want to build the house itself.  Yet perhaps this is the new reality for systems researchers.
Will such a pivot be feasible?  Perhaps not: the systems addiction to speed runs deep. But at the same time, when I visit colleagues in industry, I find them deeply embedded into groups that are doing important practical tasks that often center on a machine learning objective.  So it seems to me that if we don't evolve in this necessary way, we'll simply fade in importance to the broader field.  We just have to try, even if we might not succeed.

Thursday, 10 August 2017

Zero-copy computing

Last night, one of my group members tossed together a simple experiment on Derecho as the first step towards a much fancier experiment he needs to run.  To his surprise, the performance was a fifth of what we've been seeing in our experiments that will go into the ACM TOCS submission we plan to send out any day now.

What could cause a 5x slowdown?

As it turns out, the issue is easy to understand and points to a deeper and rather interesting puzzle.  RDMA is blazingly fast, as you know if you've read my earlier postings on the topic.  Basically, transfers occurs entirely in hardware: the NIC on machine A talks to the memory unit on machine A and grabs chunks of data, which zip straight over the wire to the NIC on machine B, which stores the data directly into the memory unit on B.  This gives a rate of data transfer that can be much higher than any single-core memcpy operation could approach: even if there is a hardware instruction for copying blocks of memory, that instruction still will operate by a series of fetch and store operations and will need to talk to the memory unit twice for each cache-line-sized chunk of data.

So RDMA tends to run 2x faster than copying.  If you look at an end-to-end pipeline, you'll generally see that RDMA delays are sharply higher than the latency of interacting with local DRAM, but the data transfer speed for a big transfer can maintain this 2x benefit, from data on A all the way to the receive buffer on B.

This is what my student ran into last night.  In his case, by using Derecho to send long null-terminated strings he benefitted because strings are easy to check for correct content ("Hello world, this is update 227!") but very costly to create and transmit.  Derecho was running 5x too slowly because it spent all its time waiting for his expensive string creation code to run, and for Derecho's orderedsend to marshall the objects before sending.  Yet our instinct tells us that those should be viewed as fast operations.  Well, instinct has ceased to be correct: RDMA is a new world.

In fact you've probably thought about the following question.  You are driving on highway 101 in a fancy Tesla sports car with one of those insane speed buttons.  Elon Musk has really pushed the limits and the car can reach 2/3rs of the speed of light.  Yet your commute from Menlo Park to South San Jose takes exactly as long as it took back when you were driving your old Subaru Imprezza!  The key insight isn't a very deep one: even with a supercar, the "barriers" on the highway will still limit you to roughly the same total commute time, if those barriers are frequent enough. 

In modern operating systems and languages, these kinds of barriers are pervasive.

Today's most widely used standard systems maintain data in various typed data structures: strings, classes defined by developers, etc.  Even an object as simple as a string may require byte by byte copying, just to find the null terminating character, and by itself, this will be a further 2.5x slower than RDMA.  So send a string, and your end-to-end numbers might be 5x slower than the best we get from Derecho with byte arrays of known size.  Send a class that needs complex marshaling and the costs go even higher (if the fields can be copied directly from memory, scatter-gather is an option, but otherwise we would often need to first copy the data into a send buffer, then send it, then free the buffer: a sequence that could push us even beyond that 5x slowdown).

What you can see here is that at speeds of 100Gbps or higher, copying is a devastingly slow operation!  Yet in fact, modern operating systems copy like crazy:
  • They copy data from user space into kernel space prior to doing I/O, and back later.
  • Modern languages are very relaxed about creating clones of objects.
  • Other than C++ 14, every method call copies arguments onto the stack, item by item.
  • Garbage collectors copy and compact all over the place.
This suggests that it may be time for someone to create a zero-copy operating system, and perhaps also to start looking carefully at what it would take to do zero-copy versions of modern programming language frameworks like Python, Go, C# and so forth (C++ is pretty good in this way).  Otherwise, as RDMA pushes towards 400 Gbps (200Gbps is already becoming fairly common), and 1Tbps within a decade, we'll find that our RDMA path seems nearly instant... but that the applications just can't benefit!

Monday, 31 July 2017

Why is it so hard to mask failures?

When we talk about fault tolerant distributed computing, using the state machine replication approach, it may seem obvious that a system of this kind should be capable of completely masking failures.  In fact, however, this is not the case.  Our ability to hide failures is really very limited.

When developers use state machine replication techniques (SMR), the usual approach is to replace components of systems or distributed services with groups of N members, and then use some sort of library that delivers the same inputs to each, in the same order.  If the replicated component is deterministic, and if care is taken to initialize each component in a manner properly synchronized with respect to its peers, this is enough to guarantee that the copies will remain synchronized.  Thus, we have an N-replica group that seemingly will tolerate N-1 faults.

Unfortunately, theory is one thing and reality can be quite a different matter.  When people first began to experiment with SMR in the 1990's, developers quickly noticed that because software bugs are a major cause of failure, perfect replication will replicate many kinds of faults!  Over time, a more nuanced approach emerged, in which the various replicas are proactively shut down and restarted in an uncoordinated way, so that on average there would still be N copies, but at any instant in time there might be N-1 copies, with one copy shutting down or rejoining.  The trick is to transfer the current state of the group to the recovering member, and is solved using the virtual synchrony model, in which group membership advances through a series of epochs, reported via view upcall notifications, with state transfers performed during epoch transitions.

The benefit of this sort of staggered restart is to overcome so-called Heisenbugs.  The term refers to bugs that are hard to pin down: they could cause non-deterministic behavior (in which case the replicas might diverge), or bugs that seem to shift around when the developer tries to isolate them.
A common form of Heisenbug involves situations where a thread damages a data structure, but the damage won't be noticed until much later, at which point any of a number of other threads could try to access the structure and crash.  Thus the failure, when it occurs, is associated with logic remote from the true bug, and may jump about depending on scheduling order.  If the developer realizes that the root cause is the earlier damage to the data structure, it generally isn't too hard to fix the problem.  But if the developer is tricked into thinking the bug manifested in the code that triggered the crash, any attempts to modify that logic will probably just make things worse! 

The reason that staggered restart overcomes Heisenbugs is that a restarting program will load its initial state from some form of checkpoint, hence we end up with N copies, each using different operations to reach the same coordinated state as the other N-1.  If the data-structure corruption problem isn't a common thing, this joining process is unlikely to have corrupted the same data structure as did the others.  With proactive restart, all N copies may be in equivalent yet rather different states.  We can take this form of diversity even further by load-balancing read-requests across our N copies: each will see different read operations and this will be a further source of execution diversity, without mutating states in ways that can cause the N replicas to diverse.
With such steps, it isn't hard to build an ultra-resilient SMR service, that can remain alive even through extremely disruptive failure episodes. But can such a service "mask" failures?

The answer is yes and no.

On the "yes" side we find work by Robert Surton at Cornell, who created a clever little TCP fail-over solution called TCP-R.  Using this protocol, a TCP connection can seamlessly roll from one machine (the failed server) to another.  The cleverness arises because of the way that TCP itself handles acknowledgements: in Surton's approach, a service member accepts a TCP connection, reads the request (this assumes that the request size is smaller than the TCP window size, in bytes), replicates the received request using an SMR multicast, and only then allows TCP to acknowledge the bytes  comprising the request. 

If a failure disrupts the sequence, TCP-R allows a backup to take control over the TCP session and to reread the same bytes from the window.   Thus the service is guaranteed to read the request at least once.  A de-duplication step ensures that a request that happens to be read twice won't cause multiple state updates.

Replies back to the end user are handled in a similar way.   The service member about to send the reply first notifies the other members, using a multicast, and only then sends the reply.  If a failure occurs, one of the other members claims the TCP endpoint and finishes the interrupted send.
With TCP-R the end-user's experience is of a fully masked failure: the application sends its request, and the service definitely gets the request (unless all N members crash simultaneously, which will break the TCP session).

Lacking TCP-R, the situation is quite a bit more complex.  In effect, the end-user would need to send the request, but also be prepared to re-send it if the endpoint fails without responding.  For read-only requests, the service can just perform the request multiple times if it shows up multiple times, but updates are more complex.  For these, the service would need logic to deduplicate requests: if the same request shows up twice, it should resend the original reply and not mutate the service state by performing the identical operation a second time.  TCP-R masks the service failure from the perspective of the client, although the service itself still needs this form of deduplication logic.

On the "no" side of the coin, we need to consider the much broader range of situations that can arise in systems that use SMR for fault-tolerance.  In particular, suppose that one SMR-replicated service somehow interacts with a second SMR-replicated service.  Should all N members of the first replica group repeat the same request to the M members of the second group?   Doing so is clearly the most general solution, but runs into the difficulty that bytes will be transferred N times to each member: a high cost if we are simply trying to mask a rare event!

Eric Cooper studied this question in his PhD thesis on a system called Circus, at Berkeley in the 1990's.  Basically, he explored the range of options from sending one request from group A to group B, but reissuing the request if the sender in A failed or the receiver in B, all the way to the full N x M approach in which every member of A multicasts every request to every member of B, and the members of B thus receive N copies and must discard N-1 of them in the usual case.  (TCP-R can be understood as an instance of the first approach, but with the client-side logic hidden under TCP itself, so that only the server has to be aware of the risk of redundancy, and so that it only arises when a genuine failure occurs.)

Cooper pointed out that even with high costs for redundancy, the N x M approach can often outperform any scheme that waits to sense a failure before retrying.  His insight was that because detecting a failure can be slow (often 30s or more), proactively sending multiple copies of each request will generally produce extra load on the receivers, but with the benefit of ensuring a snappy response because at least some receiver will act promptly and send the desired reply with minimal delay.

Cooper's solution, Circus, is best understood as a design pattern: a methodology that the application developer is expected to implement.  It involves a multicast from group A to group B, code in group B to remember recent responses to requests from A, and logic to de-duplicate the request stream, so that whether B receives a request 1 time or N times, it behaves identically and responds to A in the same manner.

In Derecho, we don't currently offer any special help for this heavily redundant approach, but all the needed functionality is available and the design pattern shouldn't be hard to instantiate.  But in fact, many Derecho users are more likely to use a non-fault-tolerant approach when building a data processing pipeline.  More specifically, while Derecho users would often use replication to make the processing elements of the pipeline fault-tolerant, they might decide not to pay the overhead of making the handoff of requests, stage by stage in the pipeline, ultra-reliable.

The reason for this compromise is  that in the IoT settings where a "smart memory service" might be used, most sensors are rather redundant, sending photo after photo of the same car on the highway, or location after location for the cat in the apartment.  The service receives this heavily duplicative input and will actually start by sorting out the good content and discarding the replicated data.  Thus we suspect that most Derecho users will be more concerned with ensuring that the service itself is highly available, and less concerned with ensuring that every single message sent to the service is processed. 

Indeed, in most IoT settings, freshness of data is more important that perfect processing of each and every data point.  Thus, if some camera creates photo X of a vehicle on the highway, and then photo Y, and X is somehow lost because of a failure that occurs exactly as it is being sent, it often would make more sense to not fuss about X and just focus on processing request Y instead.

Microsoft has a system, Cosmos, in which a pipeline of processing is done on images and videos.  It manages without fault-tolerant handoff between stages because failures are rare and, if some object is missing, there is always a simple recipe to create it again from scratch.  Facebook apparently does this too.  Both systems need to be extra careful with the original copies of photos and videos, but computed artifacts can always be regenerated.  Thus, perfect fault tolerance isn't really needed!

Of course, one can easily imagine systems in which each piece of data sent to the service is individually of vital importance, and for those, an open question remains: is it really necessary to hand-code Cooper's Circus design pattern?  Or could there be a really nice way to package the Circus concept, for example by using a higher level language to describe services and compiling them down to SMR replicas that talk to one-another redundantly? 

I view this as an open research topic for Derecho, and one we may actually tackle in coming years.  Until then, Derecho can certainly support a high quality of adaptation after crashes, but won't seamlessly hide crashes from the developer who works with the technology.  But on the other hand, neither does any other technology of which I'm aware!

Saturday, 15 July 2017

How far could a secure Internet get us?

There is a standard refrain among those who teach and do research on security: it holds that the fundamental flaw in security originates with an early decision by the community that created the Internet.  In that early design, Internet routing occurs by a form of peering that can often occur anonymously.  Endpoint systems are not authenticated, and packets generally aren't encrypted or authenticated either.  As a consequence, packets are routed permissively even if they contain what are almost certainly false origin addresses, or have content that seems dubious in obvious ways (like a TCP connection packet that makes no sense on this particular link).

This story then holds that the original sin of the Internet design community has given us layer upon layer of infrastructure that cannot be secured because it resides on a flawed and easily compromised foundation.

As it turns out, there is little doubt that we actually can build an Internet that corrects that deep flaw, and that can be relatively compatible with the existing Internet.  Enterprises like military systems and security-conscious companies do it all the time.   Let me sketch out how this is done (it isn't rocket science, and you won't find many surprises).  But then let's ask if it would really be a game-changer for the security of higher level applications.

To appreciate the core challenge, it helps to start by understanding that everything you think you know about the network is really just an abstraction layered over something more basic.  For example, even my advanced students tend to think of Internet addresses as identifiers, as if my computer's true name were and your computer's true name was  In fact, these addresses are better understood as routing data: useful for deciding what the next hop should be as a packet progresses hop-by-hop through the network, but lacking any deep connection to identity. 

In fact, a single computer might have many network addresses (one for each of its network interfaces).  It might host multiple virtual machines, and those could each have virtual network addresses of their own: with virtualization, any computer can function as a kind of Internet and host large numbers of computers that run within it, complete with virtual networks, virtual name spaces, services of various kinds, you name it.  A single computer might move about, so that over short periods of times it takes on a sequence of addresses: the old addresses can cease to work, and traffic needs to be redirected to the new ones.  Addresses can be mapped by network address translators.

Once we free ourselves from this false equivalence of network address to identity, you need to ask what the real root of identity should be, in a modern system.  Here, the proper focus is on hardware security tokens combined with human users who authenticate themselves in the usual ways.  The hardware could be a built-in component of the computer, or it could be some form of plug-in.  But the key point is that when we associate authentication with these unforgeable hardware elements, used in tandem, we arrive at a much stronger notion of endpoint identity.  The deep roots of that form of identity reside in key pairs: the identity defines some form of private key, with which information can be authenticated by public components that only have access to the corresponding public key.

This then is our bootstrap opportunity: we live in a vast world of devices: computers, routers, IoT components like video cameras and smart televisions and smart cars, and each element can be understood as either being anonymous (if it lacks the ability to authenticate itself), or capable of proving that it has "authorized access" to some private key.  With that proof, we can consult a vendor-published registry and from that registry, can learn about this endpoint device.  A device can also be registered in secondary registries: when I bring a new router into my smart home, I could register my router as "Ken's router, at his summer cottage on Cayuga Lake".  And now there would be a trustworthy chain of reasoning that would let you convince yourself that certain messages were indeed sent by, or countersigned by, my router.

Sounds familiar?  It should, if you've ever brought your laptop from home to work.  Some companies won't allow you to connect the laptop at all (fearing viruses that your machine might carry), but those that do usually require precisely this sort of authentication and registration.

Given strong authentication, a second opportunity arises whenever we focus on an island of infrastructure that has a coherent owner.  Here on Cayuga Lake, my network provider is part of a broader system owned and controlled by a company that controls a fairly large regional network.  This ISP is paid for its services, and at least in theory, has complete control of every device that can connect directly to it, every router and switch it operates internally, and every network link used to glue them all together.  One can understand the ISP as a kind of military hierarchy: at the core we find a "master and commander" who has absolute control and is the sole source of permissions.  Below the master and commander are a collection of subordinates, to whom restrictive roles have been granted, and way down at the bottom are the common seaman who might permit packets originating in my router to sail the seas of this regional network -- or could block them.

As it happens, today's ISPs are somewhat relaxed about authentication, so one opportunity to secure the network would start by simply enforcing authentication when devices are connected to an ISP.  Today when I buy a new iPhone, I can simply start to use it in my home.  If I was trying to build a much more secure network, at a minimum my router might pop up a screen requiring me to fill in details: who owns this iPhone (and it might want proof: type in your password, your RSA code number, and hold your thumb on the fingerprint reader...)  Perhaps, that regional ISP would do the same and require a second level of device registration.  Military and other nationally critical infrastructure networks work precisely in this way: if you were to take a laptop into a major power plant and connect it to the Ethernet jack in the waiting room while waiting for your appointment with human resources, either it won't connect, or it will give you some form of very limited guest connectivity. 

Think back to the last time you took your laptop to work.  I bet that something along the lines just described happened then, too.  Any organization that takes security at all seriously is careful to track the devices connected to it, and to limit their networking "power".  A military system won't allow you to connect your own machine, at all.  But if you work at Cornell, like I do, you might be able to get permission -- except that your machine will end up registered for use in a specific context, such as from my office in the Computer Science building.  If I were to carry it across the street and connect it to a wall jack in the ECE department, I would be back to square zero.

With enclaves that use authentication, one can take further steps.  Without too much added cost, packets can be cryptographically signed or fully encrypted at the endpoint, yielding various forms of virtual private networks: subnetworks within which communication is safe, but with strong protection against traffic from the outside leaking in, or against intruders managing to read or tamper with data.  Such systems can also filter or block traffic that might try to escape the secure perimeter.

I worked at Microsoft for a few months in 2016, and they adopted this approach.  I could only connect to their system via a secured VPN, and the security perimeter it enforced when I worked from home was very carefully controlled and monitored.  I could continue projects from work while at home, but I could never have wandered the virtual Microsoft network with impunity from outside the office.  In my office, I had somewhat more relaxed access to internal software tools and projects.

This, then, gives some sense of what a secure Internet would look like.  But how secure can such a system really be?

As we push from the lowest layers of hardware and software up to higher levels of abstraction, the numbers of elements of a modern system increase exponentially.  There are dozens of operating systems and each has hundreds of variants and patch levels.  So the very first layer already has a level of diversity measureable in the thousands.  Then there are hundreds of programming tools and languages and services that can run on them, configurable in hundreds of ways, to say nothing of all the management options.  Layer by layer, we bring in a surreal degree of diversity simply by virtue of the choices made by each system designer and vendor.

In settings like military systems, or power grids, a major effort is invested to keep control over the forms of diversity that are actually present in the deployed system.   Individual users aren't permitted to install their own applications, patches are applied in a coordinated way, and monitoring tools are used to notice unexpected behavior that could point to an intrusion.  In contrast, networks used in other settings need to deal with enormous levels of diversity and individual customization.  Like it or not, the network provider simply won't know what the network is being used to do.

It seems to me that this diversity accounts for the insecurity of modern systems, to a far greater degree than the "original sin" of endpoint anonymity and unauthenticated peering.  While the insecurity of the network certainly makes it easier for attackers to mount denial of service attacks or to route spam emails with falsified origin data, those are just the a small aspect of the larger problem.  And that larger problem centers on the exceptionally large attack surface that modern systems offer: hundreds of millions if not billions of lines of code, riddled with bugs, and configured in every imaginable way (just yesterday I was horrified to discover that my home router has an administrative login and password, both set to the defaults.  Now I'm wondering about my television, and my internet box...). 

Fixing the original sin of the Internet genuinely would help in some ways, especially as we move to deploy an Internet of Things with more and more devices playing a diversity of roles in homes, cars, highways and other settings.  We should take that step.  But it is an illusion to imagine that true software security can be achieved by hardening the network itself, because the extreme diversity of uses would overwhelm any systematic attempt to impose security standards and ensure that they are respected.  Attempting to do so would probably create some very frustrated users, and yet would at best raise the bar slightly for one facet of a complex, multi-faceted problem.

More interesting, I think, is to think about diversity as a constructive tool for protecting large, complex systems.  Fred Schneider and I once coauthored an editorial on this topic, and a short paper, and I continue to believe that this was one of our better efforts.  Synthetic diversity, in particular, is a remarkable tool for combatting the wily attacker, who often has a surprisingly limited arsenal of off the shelf exploits and might be blocked by even small "surprises".

The basic idea is simple: just as a compiler takes source code and then can create a variety of executables (depending on the target instruction set, operating system, optimization level selected, etc), we can also "compile" programs to expose various forms of artificial diversity.  We can force the memory layout to look somewhat random (simply by adding random padding to objects allocated dynamically).  We can vary the stack layout and the order in which inputs are delivered if a program receives data from multiple sources.  We can potentially compile one program several ways, and pick the version that will be running today at random within the resulting set.  We can randomize the numbering used for operating systems calls.

Such steps diversity and obfuscate the attackable surface.  The attacker who was using an exploit that overruns an input buffer in a way that triggers a system call to download and then execute a bot will now run into an unpredictable memory location, a buffer that might not overflow, and in any case the system call to read from the network socket might have been remapped to some normally-unassigned code.  These are steps that really can pay off.

VPN security offers opportunities too.  Obviously, surfing the web requires that there be no barriers.  But for an application like a medical system that manages patient records or interacts with a lab, there is no real reason to also be able to surface the web, and there is no reason that random web programs should be able to communicate to that application, either.  VPNs can offer this sort of protection, and if we could deploy them more easily, they could wrap sensitive code in layers of armor.

So count me on the side of those who believe that Internet security can and should be a priority, particularly in critical infrastructure and Internet of Things scenarios.   It really is about time to make Internet authentication a readily available and widely standard function, and to deploy VPN technologies much more actively.  But doing so won't solve the whole problem.  We should also make more of an effort to synthetically diversify the systems we deploy widely, so that the attacker will encounter a bewildering variety of "versions" of any particular software.   

If you can't build a single impregnable castle, the next best thing is to populate a huge city with every imaginable variation on the castle theme.  Put police checkpoints on the roads leading to the city water pumping and power generating systems.  Post signs in a few languages, including some you made up just for this purpose.   Good luck to that hacker: he might break into one system, but doing so won't get him very far before we root him out...

Sunday, 11 June 2017

Moving AI to the edge

In today's data centers, a dominant paradigm shapes machine learning and AI systems: massive amounts of data are collected, cleaned, and stored into some form of database or collection of files.  Then machine learning tools are used to train a model on the observed data, and finally the resulting model is used for a while in the edge application.  All of this takes time, so the edge applications operate using stale data, at least to some degree.

In online AI/Ml systems, like smart highways controlling smart cars, smart homes, or the smart power grid, a pervasive need for instant responsiveness based on the most current data is a distinguishing characteristic: today's standard cloud systems can definitely react to new events extremely rapidly (100ms or less is the usual goal), but because the edge runs on cached data -- in this situation, cached models -- and these platforms can't update their models promptly, they will continue to use a stale model long after something fundamental has changed, invalidating it. 

So why would we care about stale models?  The term model, as used by the ML community, refers to any concise representation of knowledge.  For example, on a highway, knowledge of a particular truck's behavior (its apparent route, history of speeds and lane changes, perhaps any observations of risks such as a tire that could be shredding, or a piece of cargo that might not be properly tied down) are all part of the model.  In a continuous learning setting, the model shapes behavior for all the smart cars in the vicinity.  In a smart power grid, the model is our estimate of the state of the grid; if the grid starts to show an oscillatory imbalance or signs of a shortage of power, the model can evolve in milliseconds, and the grid control algorithms need to adjust accordingly.  Yesterday's model, or even the one from ten seconds ago, might not be acceptable.

What will it take to move AI to the edge?
  • The AI/ML community will need to figure out what aspects of their problems really need to run on the edge, and are incorrectly situated on the standard backend today.  Coding at the edge won't be the same as coding for the backend, so work will be required.  This suggests that many applications will need to be split into online and offline aspects, with the online parts kept as slim as possible.  The offline components will be easier to implement because they run in a more standard way.
  • We need programming tools and platforms for edge-oriented AI/ML tools.  I'm betting that Derecho can be the basis of such a solution, but I also think it will take a while to reach that point.  We'll need to understand what edge-hosted AI/ML tools will actually look like: how will they represent, store, and access machine-learned models?  How big are these models, and what data rates arise?  Where are the low-latency paths, and how low does  latency need to be?
  • We may need to integrate hardware accelerators into the infrastructure: if a system does vision, it probably wants to use GPU accelerators for image segmentation, tagging and for operations such as rotation, alignment, debluring, 3-D scene reconstruction, etc.  FPGA components offer a second rapid model, more focused on searching HTML text or other "byte stream" objects.  FPGA accelerators are also useful for doing quick cryptographic operations.  There could easily be other kinds of ASICs too: DFFT units, quantum thinkers, you name it. 
  • All of these pieces need to synchronize properly and be easy to program in a "correct" way...
I'm fascinated by this whole area.  If it interests you too,  send me a note:  perhaps we can find ways to team up!

Monday, 29 May 2017

Byzantine clients

Although I personally don't work on Byzantine block chains, we happen to be in the midst of a frenzy of research and hiring and startups centered on this model.  As you probably know, a block chain is just a sequence (totally ordered) of records shared among some (perhaps very large) population of participating institutions.  There is a rule for adding the next block to the end of the chain, and the chain itself is fully replicated.  Cryptographic protection is used to avoid risk of a record being corrupted or modified after it is first recorded.  A Byzantine block chain uses Byzantine agreement to put the blocks into order, and to force the servers to vote on the contents of every block.  This is believed to yield ultra robust services, and in this particular use-case, ultra robust block chains.

Financial institutions are very excited about this model, and many have already started to use a digital contracts language called hyper-ledger that seems to express a wide range of forms of contracts (including ones that might have elements that can only be filled in later, or that reference information defined in prior blocks), and one can definitely use block chains to support forms of cyber currency, like BitCoin (but there are many others by now).  In fact block chains can even represent actual transfers of money: this is like what the SWIFT international banking standard does, using text messages and paper ledgers to represent the transaction chain.

Modern block chain protocols that employ a Byzantine model work like this: we have a set of N participants, and within that set, we assume that at most T might be arbitrarily faulty: they could collude, might violate the protocol, certainly could lie or cheat in other ways.  Their nefarious objective might be to bring the system to a halt, to confuse non-faulty participants about the contents of the blockchain or the order in which the blocks were appended, to roll-back a block (as a way to undo a transaction, which might let them double-spend a digital coin or renege on a contract obligation), etc.  But T is always assumed to be less than N/3.  

Of course, this leads to a rich class of protocols, very well-studied, although (ironically), more often in a fully synchronous network than in an asynchronous one.  But the so-called PRACTI protocols created by Miguel Castro with Barbara Liskov work in real-world networks, and most block chain systems use some variation of them.  They basically build on consensus (the same problem solved by Lamport's Paxos protocol).  The non-compromised participants simply outvote any Byzantine ones.

What strikes me as being of interest here is that most studies have shown that in the real world, Byzantine faults almost never arise!   And when they do, they almost never involve faulty servers. Conversely when attackers do gain control, they usually compromise all the instances of any given thing that they were able to attack successfully.  So T=0, or T=N.

There have been a great many practical studies on this question.  I would trace this back to Jim Gray, who once wrote a lovely paper on "Why Do Computers Stop and What Can Be Done About It?".  Jim was working at Tandem Computers at that time, on systems designed to tolerate hardware and software problems.  Yet they crashed even so.  His approach was to collect a lot of data and then sift through it.

Basically, he found that human errors, software bugs and design errors were a much bigger problem then hardware failures.  Jim never saw any signs of Byzantine faults (well, any fault is Byzantine.  But I mean malicious behaviors, crafted to compromise a system).

More recent studies confirm this.  At Yahoo, for example,  Ben Reed examined data from a great many Zookeeper failures, and reported his findings ina WIPS talk at SOSP in 2007.  None were Byzantine (in the malicious sense).

At QNX Chris Hobbs has customers who worry that very small chips might experience higher rates of oddities that would best be modeled as Byzantine.  To find out, he irradiated some chips in a nuclear reactor, and looked at the resulting crashes to see what all the bit flips did to the code running on them (and I know of NASA studies of this kind, too).  In  fact, things do fail.  But mostly, by crashing.  The main issue turns out to be undetected data corruption, because  most of the surface area on the chips in our computers is used for SRAM, caching, and DRAM  storage.  Data protected by checksums and the like remains safe, but these other forms are somewhat prone to undetected bit flips.  But most data isn't in active use, in any case: most computer memory just has random stuff in it, or zeros, and the longest lived active form of memory is to hold the code of the OS and the application programs.  Bit-flips will eventually corrupt instructions that do matter, but corrupted instructions mostly trigger faults.  So, it turns out that the primary effect of radiation is to raise the rate of sudden crashes.

Back when Jim did his first study, Bruce Nelson built on it by suggesting that the software bugs he was seeing fall into two cases: Bohrbugs (deterministic and easily reproduced, like Bohr's model of the atom: an easy target to fix) and Heisenbugs (wherever you look, the bug skitters off to somewhere else).  Bruce showed that in a given version of a program, Bohrbugs are quickly eliminated, but patches and upgrades often introduce new ones, creating an endless cycle.  Meanwhile, the long-lived bugs fell into the Heisenbug category, often originating from data structure damage "early" in a run that didn't cause a crash until much later, or from concurrency issues sensitive to thread schedule ordering.  I guess that the QNX study just adds new elements to that stubborn class of Heisenbugs.

So, we don't have Byzantine faults in servers, but we do have a definite issue with crashes caused by bugs, concurrency mistakes, and environmental conditions that trigger hardware malfunctions. There isn't anything wrong with using Byzantine agreement in the server set, if you like.   But it probably won't actually make the service more robust or more secure.

Bugs can be fought in many ways.  My own preferred approach is simpler than running Byzantine agreement.  With Derecho or other similar systems, you just run N state machine replicas doing whatever the original program happened to do, but start them at different times and use state transfer to initialize them from the running system (the basis of virtual synchrony with dynamic group membership).

Could a poison pill kill them all at once?  Theoretically, of course (this is also a risk for a Byzantine version).  In practice, no.  By now we have thirty years of experience showing that in process group systems, replicas won't exhibit simultaneous crashes, leaving ample time to restart any that do crash, which will manifest as occasional one-time events.

Nancy Leveson invented a methodology called N-Version programming; her hypothesis was that if most failures were simply due to software bugs, then that by creating multiple versions of the most important services, you could mask the bugs because coding errors would probably not be shared among the set.  This works, too, although she was surprised at how often all N versions were buggy in the same way.  Apparently, coders often make the same mistakes, especially when they are given specifications that are confusing in some specific respect.

Fred Schneider and others looked at synthetic ways of "diversifying" programs, so that a single piece of code could be used to create the versions: this method automatically generates a bunch of different versions from one source file, with the same input/output behavior.  You get N versions too, often find that concurrency problems won't manifest in the identical way across the replicas, and they also are less prone to security compromises!

Dawson Engler pioneered automated tools for finding bugs and even for inferring specifications.  His debugging tools are amazing, and with them, bug-free code is an actual possibility.

Servers just shouldn't fail in weird, arbitrary ways anymore.  There is lots of stuff to worry about, but Byzantine compromise of a set of servers shouldn't be high on the radar.

Moreover, with strong cloud firewalls, Intel SGX, and layers and layers of monitoring, I would bet that the percentage of compromises that manage to take control of between and N/2-1 replicas (but not more), or. even of attacks that look Byzantine is even lower.

But what we do see  (including Reed's 2007 study), are correct services that come under attack from some form of malicious client.  For an attacker, it is far easier to hijack a client application than to penetrate a server, so clients that try to use their legitimate  connectivity to the server for evil ends are a genuine threat.  In fact with open source, many attackers just  get the client source code, hack it, then run the compromised code.  Clients that simply send bad data without meaning to are a problem too.

The client is usually the evil-doer.  Yet the Byzantine model blames the server, and ignore the clients.

It seems to me that much more could be done to characterize the class of client attacks that a robust service could potentially repel.  Options include monitoring client behavior and blacklisting any clients that are clearly malfunctioning, capturing data redundantly and somehow comparing values, so that a deliberately incorrect input can be flagged as suspicious or even suppressed entirely (obviously, this can only be done rarely, and for extreme situations), or filtering client input to protect against really odd data.  Gun Sirer and Fred Schneider once showed that a client could even include a cryptographic  proof that input strings really came from what the client typed, without tampering.

Manuel Costa and Miguel Castro came up with a great system they called Vigilante, a few years ago.  If a server was compromised by bad client input, it detected the attack,  isolated the cause and spread the word instantly.  Other servers could then adjust their firewalls to protect themselves, dynamically.  This is the sort of thing we need.

So here's the real puzzle.  If your bank plans to move my accounts to a block chain, I'm ok with that, but don't assume that BFT on the block chain secures the solutions.  You need to also come up without a way to protect the server against buggy clients, compromised clients, and clients that just upload bad data.  The "bad dudes" are out there, not inside the data center.  Develop a plan to to keep them there!