Thursday, 7 September 2017

Inadequacy of the Paxos specification, and what we can learn from the issue

In a blog ten days ago, I discussed the issue of specifications that omit coverage for cases that actually arise in real systems.  Since then two colleagues who follow the blog asked for examples to illustrate the issues, so I thought I might say a few more words on this, focusing on the classic specification of Paxos: Leslie Lamport's solution to the State Machine Replication problem (also sometimes called Consensus, or Uniform Agreement).

The traditional specification of Paxos has the following elements:
  • A specified set of participants.
  • An assignment of roles {leader, acceptor, learner} to the participants.  Each can have more than one role, and we often think of external clients as adding two additional roles: {command-initiator, command-consumer}. 
    • A leader runs a protocol for putting new commands into the Paxos log.
    • An acceptor holds a Paxos log (an ordered list of slots, each of which can be empty, or can hold a command), and information about which commands are known to have committed.  Any given acceptor might have gaps: slots for which it lacks the committed command (on top of this, there is also a somewhat subtle failure case in which a slot will permanently be left empty, so that every acceptor would have a gap at that spot).
    • A learner runs a protocol for computing the full list of committed commands and their ordering.  Any single acceptor might have gaps in its log, so the learner does this by merging logs in order to fill those gaps.
    • Paxos really says nothing at all about command initiators and consumers, except that the initiator waits for a response to the request that the command be posted (in doing so, Paxos eliminates what we sometimes refer to as a causal order obligation, in which a system is expected to track and respect ordering on pending asynchronous actions).
  • A rule for what it means for a command to be valid (non-triviality). 
  • Agreement on ordering.
  • Durability: in any state where a Paxos service responds to "learn" requests, it needs to return the entire list of previously ordered commands. 
Many authors just focus on the three bolded properties.  Yet notice that from this set of five elements, we can easily discern a whole series of additional, unspecified aspects:
  • There is a Paxos reconfiguration protocol, but it doesn't introduce additional specification elements, except to change the set of participants.  Yet there are several aspects one would normally wish to see addressed:
    • Malkhi has convincingly argued that one should terminate a membership epoch before starting actions in the next epoch.  This avoids having a command that was initiated in epoch k commit much later, perhaps in epoch k+1.  While Lamport has often said that this isn't necessarily a bad thing, Malkhi's point is that a delayed commit can be confusing, because many systems operate in a configuration-sensitive way.
    • A new member of a Paxos acceptor group should probably not have an initially empty log.  A means of performing state transfer to it is thus required.  Simply copying some existing log to the joiner is incorrect, because by doing so, a command that previously lacked a quorum and hence was uncommitted (Paxos has a scenario in which it leaves a slot "empty") can become committed because duplicating a log effectively duplicates a vote for that command.
    • When Paxos restarts after all its members crash, the protocol doesn't specify the rule for figuring out the proper initial configuration for the restarted service (this matters with dynamic membership).
  • The specification says nothing about flow-control, yet if a Paxos protocol has no flow control mechanism at all, a quorum of acceptors could advance unboundedly far into the future relative to a stalled or failed acceptor.  This might mean that the faulty acceptor has no feasible way to catch up later, and in effect, would decrease the resilience of the service: rather than having N members, we would have to think of it as having just N-1.
  • The specification says nothing about sizes of objects (size of commands), yet acceptors will presumably have bounded amounts of memory, and might not be able to accept arbitrarily large objects.  Solving this isn't necessarily hard (one could have commands that point to external objects), but then the objects would be less replicated than the commands, and one has to ask whether that somehow violates the durability property.
  • The specification says nothing about fairness or other quality of service properties, such as timeliness of response.  Yet real systems need fairness when many clients share a Paxos service and all the clients want a fair-share of access.  In fact, one can then ask what specific notion of fairness is desired: should it be round-robin (like in Derecho)?  Or should some clients be allowed to send 2x more commands than others, because they happen to be busier?  Should a slower client be "delayed" by activity of a faster client?
  • I mentioned that Paxos seemingly excludes situations where clients might issue a series of requests, without waiting for replies.  In practice, this is common (clients often "stream" requests).  We would want the client ordering to be respected by the system: if a client sends request A, then B, Paxos should preserve the A happened before B relationship.
  • At Google, the Paxos owner (in fact the leader of the team responsible for the Chubby service) pointed out to me that his big issue is concerned with wide area deployments of Paxos, which introduces a whole set of issues Lamport never considered:
    • Proper specification of a hierarchical Paxos.  Guerraoui and Pedone and Quema all have looked at this question, primarily in connection with Ring-Paxos protocols.
    • Heavy tailed behaviors.  Chubby struggles to deal with very delayed data that can be caused by overloaded or damaged WAN links.
    • One way to get around "late" data is to design systems that assume they have correct and current data, for example using locks or leases.  Chubby does this, but it turns out that when one does so, getting WAN service instances to release their read locks so that the data can be updated, or preventing them from renewing those leases, can be very slow.  This might violate a specification that requires fast normal read behavior, but also requires a fast way to be able to update "rarely changed" configuration parameters or program versions or other forms of WAN data that does change now and then.
Beyond these points, one encounters a further concern.  The Paxos specification is ideally suited to verifying the correctness of Paxos, with respect to its own "promises".  But the specification doesn't tell us anything at all about correct use of Paxos, or required behavior of the application using the service. 

For example, suppose that Paxos isn't the real repository for data but is playing an intermediary role: the program using the Paxos service might itself be a replicated database, or some other form of replicated service that wants updates delivered in a deterministic order.
  • Do we expect that such a service would always be able to use the identical order to the Paxos log?  What does this imply about the specification the client service must respect?
  • On restart from crashes, we now have two forms of state: state in the durable Paxos logs, and state in the database service replicas.  How should these be reconciled?
  • We won't want the Paxos state to grow without bounds, so we will need to truncate the Paxos logs.  There is a truncate protocol in Paxos, but what obligations fall upon the client service that wishes to make use of that truncate command?  How does truncation interplay with failure cases that might arise?
Believe or not, I could actually go on and list even more issues!  My point, though, is that when we use Lamport's core specification, we are really working from an inadequate specification that omits major, very important aspects of the real service we intend to build and use. 

But notice how hard it would be to check a Paxos specification for adequacy.  We would need a fairly elaborate (and adequate) specification of the environment, and of our larger goals.  Otherwise, questions such as flow-control or other aspects of bounding resource consumption, or of client state, could never even be posed.  So there is sense of chicken and egg: to understand if a Paxos specification is adequate, we really need an adequate specification of the setting where Paxos will be used, so that we can study the questions it poses about the Paxos service per-se.

I'll stop on that point, but there is more one can say about adequacy of specifications.  A good topic for some other posting down the road...


  1. Having only dabbled at the edges, this may be totally off-base: Is a completely specifiable system somehow trivial? (Or, am I translating my 70s computability learnings to a domain where they're not relevant?)



    1. Yes, in a way you are correct. If you can fully specify a problem, and can prove that the resulting problem statement is solvable in what is called a " constructive" way, the code can be extracted from the proof. So in this sense, creating the specification and convincing yourself that the specification describes a good solution to your goal, leaves a surprisingly straightforward last step.