Friday, 17 March 2017

The CAP conjecture is dead. Now what?

CAP has been around now for something like 15 years.  There are some circles in which the acronym is even more famous than FLP or SAT!  But CAP was never a theorem (let's call it a "folk-theorem"), and by now we have more and more examples of systems that violate CAP. 

Yet even if CAP is false, it was also a fantastic rule of thumb that was incredibly valuable to developers tasked with building high performance distributed systems on the cloud.  If we start to teach students that CAP is false, what can replace it in this role?

A little historical context: CAP is short for "you can only have two from Consistency, Availabilty and Partition Tolerance",  This assertion was initially put forward by Eric Brewer in a PODC keynote talk he gave in 2000. The justification he offered ran along the following lines.  First, he observed that in today's increasingly globalized web systems, we invariably deploy services with high latency WAN links between them.  These are balky, so services are forced to make a choice: either respond right now based on data locally available, or await restoration of the WAN link.  A tradeoff that obviously favors availability over consistency, and already tells us that many web services will have to find a way to prevent responses that reflect stale data from being noticed by users, or if they are noticed, from causing problems.  He used the term "partition tolerance" for this kind of fault-tolerance (namely, giving a response even when some link is down).  Hence the P in CAP.

He suggested that we are favoring "A and P over C".

Then he observed that even in a single data center, if you want the highest possible levels of performance, you'll need to handle web requests on edge nodes that take their data from cache, without first talking a backend server first: again weaker consistency, but higher availability.  So again, we see the A, as a kind of synonym for rapid response.

So he looked at all the different pairing: C and A over P, C and P over A, A and P over C.  He concluded that in practice we always seem to find that by taking A and P we get the best scalability.  But CAP itself asserted that although other mixes work, you always have to pick the two you like best, and you'll erode the third.

Although the definitions of A and P are a bit strange (P seems to have a bit of A mixed in, and also seems to sometimes mean fault-tolerance, since partitions never really arise within a data center), CAP is sort of catchy.  But like many such acronyms, much depends on the details: if you try and give rigorous definitions, to turn CAP into a theorem (people have done so), you find that it only holds in some narrow situations.

The result is that as professors teaching cloud computing, we generally treat CAP as a clever acronym, but one that works mostly as a rule of thumb.  In some sense, CAP is a useful kind of folklore.

CAP took hold back in 2000 because at that time, companies like eBay and Amazon were struggling with the high costs of ACID transactions in systems that the database SQL programming model.  Scalable database performance poses issues that are much more nuanced than the ones Eric had in mind: there were puzzles of lock conflict, complexity, data pipelines with non-trivial asynchronous ordering requirements, etc.  But the bottom line is that performance of the big systems at eBay and Amazon was erratic and often strayed outside the magic 100ms target for web-service and web-page responses.  This is the point at which a human user feels that the system is "snappy" and everyone wants to be in the sub-100ms range.

So, the technology leaders at eBay began to tell their developers that it was absolutely fine to write SQL code as a starting point in web application design (a pragmatic decision: they people they were hiring mostly had extensive SQL experience from their database courses), but that once the SQL code was working, to weaken the transactions by turning them into a series of individual atomic actions.   eBay began to talk about the resulting development methodology using a new acronym: BASE, by which they meant "Basically Available, Softstate systems with Eventual Consistency."

Notice that BASE doesn't abandon consistency.  Instead, it points out to the developer that many web systems just don't need ACID guarantees to work correctly.   ACID and SQL are great for creating a quick prototype that will be robust and easy to debug, but then you "optimize" it by taking away the ACID properties, without breaking the behavior in ways that violate the specification.

Amazon embraced BASE too.  Around 2006, they decided to rebuild many of their core applications around a key-value technology called Dynamo, but SQL users found it hard to use, and by 2008 the adoption of Dynamo began to falter.  To make the transition easier, Amazon layered in a NoSQL API for Dynamo, called Dynamo-DB: now SQL code could run on Dynamo, but with weaker guarantees than for a full SQL system (for example, NoSQL lacks join operations), and Dynamo-DB happened to be especially well-matched to BASE.

So you can see from these examples why CAP would be such convenient way to motivate developers to make the switch: it more or less tells them that if they optimize their code using BASE, it won't scale properly.  Moreover, the eBay memos about BASE include step by step instructions to explain precisely how to go about doing it.

Today, fifteen years later,  we've discovered more and more ways to implement scalable, high performance cloud services, edge caches that maintain coherence even at global scale, fully scalable transactional key-value storage systems, and the list goes on.  Derecho is one example: it helps you build systems that are highly Available, Replicated and Consistent.  Call this ARC.

The cool twist is that with ARC, you get lock-free strong consistency, right in the cloud edge! This is because we end up with very fast replication at the edge, and every replica is consistent.  You do need to think hard about your update sources and patterns if you hope to avoid using locks, but for most applications that aspect is solvable because in the cloud, there is usually some sense in which any particular data item has just once real update source.  So there is an update "pipeline" and once you have consistency in the system, you end up with a wide range of new options for building strongly consistent solutions.

An example I like very much was introduced by Marcos Aguilera in Sinfonia.  In that system, you can make strongly consistent cache snapshots, and run your code on the snapshot.  At massive scale you have as many of these snapshots as needed, and transactions mostly just run at the edge.  But when you want to update the system state, the trick he suggested is this: run your transaction and keep track of what data it read and wrote (version numbers).  Now instead of just responding to the client, generate a "minitransaction" that checks that these version numbers are still valid, and then does all the writes.  Send this to the owner of the true database: it validates your updates and either commits by applying the updates, or rejects the transaction, which you can then retry.

Since so much of the heavy lifting is down when speculatively executing the transactions at the first step, the database owner has way less compute load imposed on it.  Then you can start to think about systems with sharded data that has different owners for each shard, and limits the transactions to run within a single shard at a time (the trick is to design the sharing rule cleverly).  Sinfonia and the follow-on systems had lots of these ideas.

ARC creates a world in which solutions like the Sinfonia one are easy to implement -- and the Sinfonia approach is definitely not the only such option.

I'm convinced that as we move the cloud towards the Internet of Things services, developers will need this model,  because inconsistency in a system that controls smart cars or runs the power grid can wreak havoc and maybe even would be dangerous.

So now I want to claim that ARC is the best BASE story ever!  Why?  Well, remember that BASE is about basically available, soft-state systems with eventual consistency.  I would argue that in examples like the Sinfonia one, we see all elements of the BASE story!

For example, a Sinfonia system is basically available because with enough consistent snapshots you can always do consistent read-only operations at any scale you like.  The state is soft (the snapshots aren't the main copy of the system: they are replicas of it, asynchronously being updated as the main system evolves).  And they are eventually consistent because when you do make an update, the mini-transaction validation step lets you push the update results into the main system, in a consistent way.  It might take a few tries, but eventually, should succeed.

What about the eventual consistency aspect of BASE in the case of ARC?   ARC is about direct management of complex large-scale strongly consistent replicated state.   Well, if you use ARC at the edge, back-end servers tend to be doing updates in a slightly time-lagged way: batched updates and asynchronous pipelines improve performance.  Thus they are eventually consistent, too: the edge queues an update, then responds to the end-user in a consistent way, and the back end catches up soon afterwards.

And ARC isn't the only such story:  my colleague Lorenzo Alvisi has a way to combine BASE with ACID: he calls it SALT, and it works really well.

So, feeling burned by CAP?  Why not check out ARC?  And perhaps you would like a little SALT with that, for your databases?

No comments:

Post a Comment