Sunday, 11 June 2017

Moving AI to the edge

In today's data centers, a dominant paradigm shapes machine learning and AI systems: massive amounts of data are collected, cleaned, and stored into some form of database or collection of files.  Then machine learning tools are used to train a model on the observed data, and finally the resulting model is used for a while in the edge application.  All of this takes time, so the edge applications operate using stale data, at least to some degree.

In online AI/Ml systems, like smart highways controlling smart cars, smart homes, or the smart power grid, a pervasive need for instant responsiveness based on the most current data is a distinguishing characteristic: today's standard cloud systems can definitely react to new events extremely rapidly (100ms or less is the usual goal), but because the edge runs on cached data -- in this situation, cached models -- and these platforms can't update their models promptly, they will continue to use a stale model long after something fundamental has changed, invalidating it. 

So why would we care about stale models?  The term model, as used by the ML community, refers to any concise representation of knowledge.  For example, on a highway, knowledge of a particular truck's behavior (its apparent route, history of speeds and lane changes, perhaps any observations of risks such as a tire that could be shredding, or a piece of cargo that might not be properly tied down) are all part of the model.  In a continuous learning setting, the model shapes behavior for all the smart cars in the vicinity.  In a smart power grid, the model is our estimate of the state of the grid; if the grid starts to show an oscillatory imbalance or signs of a shortage of power, the model can evolve in milliseconds, and the grid control algorithms need to adjust accordingly.  Yesterday's model, or even the one from ten seconds ago, might not be acceptable.

What will it take to move AI to the edge?
  • The AI/ML community will need to figure out what aspects of their problems really need to run on the edge, and are incorrectly situated on the standard backend today.  Coding at the edge won't be the same as coding for the backend, so work will be required.  This suggests that many applications will need to be split into online and offline aspects, with the online parts kept as slim as possible.  The offline components will be easier to implement because they run in a more standard way.
  • We need programming tools and platforms for edge-oriented AI/ML tools.  I'm betting that Derecho can be the basis of such a solution, but I also think it will take a while to reach that point.  We'll need to understand what edge-hosted AI/ML tools will actually look like: how will they represent, store, and access machine-learned models?  How big are these models, and what data rates arise?  Where are the low-latency paths, and how low does  latency need to be?
  • We may need to integrate hardware accelerators into the infrastructure: if a system does vision, it probably wants to use GPU accelerators for image segmentation, tagging and for operations such as rotation, alignment, debluring, 3-D scene reconstruction, etc.  FPGA components offer a second rapid model, more focused on searching HTML text or other "byte stream" objects.  FPGA accelerators are also useful for doing quick cryptographic operations.  There could easily be other kinds of ASICs too: DFFT units, quantum thinkers, you name it. 
  • All of these pieces need to synchronize properly and be easy to program in a "correct" way...
I'm fascinated by this whole area.  If it interests you too,  send me a note:  perhaps we can find ways to team up!

Monday, 29 May 2017

Byzantine clients

Although I personally don't work on Byzantine block chains, we happen to be in the midst of a frenzy of research and hiring and startups centered on this model.  As you probably know, a block chain is just a sequence (totally ordered) of records shared among some (perhaps very large) population of participating institutions.  There is a rule for adding the next block to the end of the chain, and the chain itself is fully replicated.  Cryptographic protection is used to avoid risk of a record being corrupted or modified after it is first recorded.  A Byzantine block chain uses Byzantine agreement to put the blocks into order, and to force the servers to vote on the contents of every block.  This is believed to yield ultra robust services, and in this particular use-case, ultra robust block chains.

Financial institutions are very excited about this model, and many have already started to use a digital contracts language called hyper-ledger that seems to express a wide range of forms of contracts (including ones that might have elements that can only be filled in later, or that reference information defined in prior blocks), and one can definitely use block chains to support forms of cyber currency, like BitCoin (but there are many others by now).  In fact block chains can even represent actual transfers of money: this is like what the SWIFT international banking standard does, using text messages and paper ledgers to represent the transaction chain.

Modern block chain protocols that employ a Byzantine model work like this: we have a set of N participants, and within that set, we assume that at most T might be arbitrarily faulty: they could collude, might violate the protocol, certainly could lie or cheat in other ways.  Their nefarious objective might be to bring the system to a halt, to confuse non-faulty participants about the contents of the blockchain or the order in which the blocks were appended, to roll-back a block (as a way to undo a transaction, which might let them double-spend a digital coin or renege on a contract obligation), etc.  But T is always assumed to be less than N/3.  

Of course, this leads to a rich class of protocols, very well-studied, although (ironically), more often in a fully synchronous network than in an asynchronous one.  But the so-called PRACTI protocols created by Miguel Castro with Barbara Liskov work in real-world networks, and most block chain systems use some variation of them.  They basically build on consensus (the same problem solved by Lamport's Paxos protocol).  The non-compromised participants simply outvote any Byzantine ones.

What strikes me as being of interest here is that most studies have shown that in the real world, Byzantine faults almost never arise!   And when they do, they almost never involve faulty servers. Conversely when attackers do gain control, they usually compromise all the instances of any given thing that they were able to attack successfully.  So T=0, or T=N.

There have been a great many practical studies on this question.  I would trace this back to Jim Gray, who once wrote a lovely paper on "Why Do Computers Stop and What Can Be Done About It?".  Jim was working at Tandem Computers at that time, on systems designed to tolerate hardware and software problems.  Yet they crashed even so.  His approach was to collect a lot of data and then sift through it.

Basically, he found that human errors, software bugs and design errors were a much bigger problem then hardware failures.  Jim never saw any signs of Byzantine faults (well, any fault is Byzantine.  But I mean malicious behaviors, crafted to compromise a system).

More recent studies confirm this.  At Yahoo, for example,  Ben Reed examined data from a great many Zookeeper failures, and reported his findings ina WIPS talk at SOSP in 2007.  None were Byzantine (in the malicious sense).

At QNX Chris Hobbs has customers who worry that very small chips might experience higher rates of oddities that would best be modeled as Byzantine.  To find out, he irradiated some chips in a nuclear reactor, and looked at the resulting crashes to see what all the bit flips did to the code running on them (and I know of NASA studies of this kind, too).  In  fact, things do fail.  But mostly, by crashing.  The main issue turns out to be undetected data corruption, because  most of the surface area on the chips in our computers is used for SRAM, caching, and DRAM  storage.  Data protected by checksums and the like remains safe, but these other forms are somewhat prone to undetected bit flips.  But most data isn't in active use, in any case: most computer memory just has random stuff in it, or zeros, and the longest lived active form of memory is to hold the code of the OS and the application programs.  Bit-flips will eventually corrupt instructions that do matter, but corrupted instructions mostly trigger faults.  So, it turns out that the primary effect of radiation is to raise the rate of sudden crashes.

Back when Jim did his first study, Bruce Nelson built on it by suggesting that the software bugs he was seeing fall into two cases: Bohrbugs (deterministic and easily reproduced, like Bohr's model of the atom: an easy target to fix) and Heisenbugs (wherever you look, the bug skitters off to somewhere else).  Bruce showed that in a given version of a program, Bohrbugs are quickly eliminated, but patches and upgrades often introduce new ones, creating an endless cycle.  Meanwhile, the long-lived bugs fell into the Heisenbug category, often originating from data structure damage "early" in a run that didn't cause a crash until much later, or from concurrency issues sensitive to thread schedule ordering.  I guess that the QNX study just adds new elements to that stubborn class of Heisenbugs.

So, we don't have Byzantine faults in servers, but we do have a definite issue with crashes caused by bugs, concurrency mistakes, and environmental conditions that trigger hardware malfunctions. There isn't anything wrong with using Byzantine agreement in the server set, if you like.   But it probably won't actually make the service more robust or more secure.

Bugs can be fought in many ways.  My own preferred approach is simpler than running Byzantine agreement.  With Derecho or other similar systems, you just run N state machine replicas doing whatever the original program happened to do, but start them at different times and use state transfer to initialize them from the running system (the basis of virtual synchrony with dynamic group membership).

Could a poison pill kill them all at once?  Theoretically, of course (this is also a risk for a Byzantine version).  In practice, no.  By now we have thirty years of experience showing that in process group systems, replicas won't exhibit simultaneous crashes, leaving ample time to restart any that do crash, which will manifest as occasional one-time events.

Nancy Leveson invented a methodology called N-Version programming; her hypothesis was that if most failures were simply due to software bugs, then that by creating multiple versions of the most important services, you could mask the bugs because coding errors would probably not be shared among the set.  This works, too, although she was surprised at how often all N versions were buggy in the same way.  Apparently, coders often make the same mistakes, especially when they are given specifications that are confusing in some specific respect.

Fred Schneider and others looked at synthetic ways of "diversifying" programs, so that a single piece of code could be used to create the versions: this method automatically generates a bunch of different versions from one source file, with the same input/output behavior.  You get N versions too, often find that concurrency problems won't manifest in the identical way across the replicas, and they also are less prone to security compromises!

Dawson Engler pioneered automated tools for finding bugs and even for inferring specifications.  His debugging tools are amazing, and with them, bug-free code is an actual possibility.

Servers just shouldn't fail in weird, arbitrary ways anymore.  There is lots of stuff to worry about, but Byzantine compromise of a set of servers shouldn't be high on the radar.

Moreover, with strong cloud firewalls, Intel SGX, and layers and layers of monitoring, I would bet that the percentage of compromises that manage to take control of between and N/2-1 replicas (but not more), or. even of attacks that look Byzantine is even lower.

But what we do see  (including Reed's 2007 study), are correct services that come under attack from some form of malicious client.  For an attacker, it is far easier to hijack a client application than to penetrate a server, so clients that try to use their legitimate  connectivity to the server for evil ends are a genuine threat.  In fact with open source, many attackers just  get the client source code, hack it, then run the compromised code.  Clients that simply send bad data without meaning to are a problem too.

The client is usually the evil-doer.  Yet the Byzantine model blames the server, and ignore the clients.

It seems to me that much more could be done to characterize the class of client attacks that a robust service could potentially repel.  Options include monitoring client behavior and blacklisting any clients that are clearly malfunctioning, capturing data redundantly and somehow comparing values, so that a deliberately incorrect input can be flagged as suspicious or even suppressed entirely (obviously, this can only be done rarely, and for extreme situations), or filtering client input to protect against really odd data.  Gun Sirer and Fred Schneider once showed that a client could even include a cryptographic  proof that input strings really came from what the client typed, without tampering.

Manuel Costa and Miguel Castro came up with a great system they called Vigilante, a few years ago.  If a server was compromised by bad client input, it detected the attack,  isolated the cause and spread the word instantly.  Other servers could then adjust their firewalls to protect themselves, dynamically.  This is the sort of thing we need.

So here's the real puzzle.  If your bank plans to move my accounts to a block chain, I'm ok with that, but don't assume that BFT on the block chain secures the solutions.  You need to also come up without a way to protect the server against buggy clients, compromised clients, and clients that just upload bad data.  The "bad dudes" are out there, not inside the data center.  Develop a plan to to keep them there!

Friday, 26 May 2017

More thoughts on the learnability of implicit protocols

While driving from Sebastopol (in Sonoma valley, north of San Francisco) to Silicon Valley (which is south of San Francisco), one has a lot of time to think about how human-operated cars coordinate and implicitly communicate.  So I thought I might follow up on the posting I did a few weeks ago concerning the learnability of implicit protocols and see if I can't make the question a bit more concrete.

The best way to simplify a problem is to think of a base case, so I started by asking myself about the simplest model I could: vehicles modeled as points and a highway modeled as a straight line.

Interactions between vehicles would basically define a graph: any given car gets a chance to interact with the car in front of it, and the one behind it, and the protocol (whatever it might be) is like a state machine on these graphs.

So what behaviors would constitute a protocol in such a situation?  Well, we presumably would have car A in state Sa, and car B in state Sb, and then some "communication" takes place.  Not too many options: they either draw closer to one-another at some speed, move apart from one-another at some speed (I mean speed of closing or speed of separating, but probably the joint speed and direction of movement needs to be specified too), or perhaps one or the other brakes sharply or accelerates sharply.  So here we have a "vocabulary" for the cars, with which they communicate.

A protocol, then, would be a rule by which A reacts to inputs of this form from B, and vice versa, since any action by B on A is also perceived by B itself (that is, even when B probes A, B itself may be forced to take some reactive action too: the human version is that you are trying to shift from some horrendously bad radio station to a different, slightly less horrible, alternative when you look up and notice that you've gotten much closer to the car in front you than you expected.  So in this case it was your own darn fault, but just the same, you react!)

So to learn a protocol, we might imagine a series of "tests" by which vehicle B, approaching A from behind, experiments to investigate A's reaction to various inputs, recording these into a model that over time would converge towards B's approximation of A's algorithm.  A similarly is learning B's behavior (and A might elicit behaviors from B.  For example, sometimes a car comes up on my tail very close, and especially if we are already basically moving at the fastest possible speed, it annoys me enough that I gradually slow down, rather than speeding up, as he or she probably intended for me to do -- there is an optimal distance at which to follow me, if you are trying to do so.  Similarly for most drivers.  Drive 3x further back and I might ignore you completely.  Drive at half that distance and my behavior departs from the norm: I slow down, which is suboptimal for both of us.  My way of punishing your pushy behavior!).

If you could figure out how two cars can learn a protocol, then you could ask if this generalizes to situations that form trains of cars.  So here, B catches up with a train: not just A, but perhaps X-Y-Z-A, with A in the rear and some form of agreed-upon protocol in use by X,Y,Z and A.  B gets to join the train if it can learn the protocol and participate properly.  Failing to do so leaves B abandoned, or results in an accident, etc.

Optimal behavior, of course, maximizes speed and minimizes the risk of accidents.  I'm starting to see how one could add a kind of utility function here: B wants to learn A's protocol quickly and "efficiently", with as few tests as possible.  Then B wants to use the knowledge gained to achieve this highest possible speed "jointly" with A.  That defines a paired protocol, and the generalization becomes a train protocol.

Another observation: Suppose that B has a small set of models in a kind of grab-bag of models that cars commonly use.  Now B's job of learning the model is potentially easier: if A is using one of the standard models, it may only take a very small number of tests to figure that out.  Moreover, having a good guess might make some models "learnable" that would actually not be learnable with a small number of tests otherwise.  This would fit well with the experience I've had in New Jersey, Belgium, Tel Aviv and California: each country has its own norms for how closely cars tend to space themselves on high speed throughways and in other situations, so while you can feel disoriented at first, in fact this only adds up to four models.  If someone handed you those four models, I bet you could figure out which one applies with very few experiments.

Unless, of course, you accidentally end up behind a New Jersey driver who is learning the rules of the road in Tel Aviv for the first time, of course.  I think I've encountered a few such cases too.

So that would be a further interesting case to consider: roads with mixtures of models: subsets of cars, with each subset using a distinct model.  In fact my story of New York City cab drivers, from last time, would fall into that category.  The cabs figure out how to form school-of-fish packs that flow around obstacles and other cars, no matter what those other cars might be doing.  Meanwhile, there could easily be other protocols: New York public transit buses, for example, have their own rules (they basically are bigger than you, and can do whatever they like, and that is precisely how they behave).

Conversely, the New Jersey driver also illustrates a complication: the implicit protocols of interest to me are cooperative behaviors that maximize utility for those participating in the protocol, and yield higher utility than could be gained from a selfish behavior.  But any vehicle also will have an autonomous set of behaviors: a protocol too, but one that it engages in purely for its own mysterious goals.  And these selfish behaviors might not be at all optimal: I recently watched a teenage driver in the car behind me putting on makeup while talking on a cell phone and apparently texting as well.  So this particular driver wasn't maximizing any normal concept of utility!  And all of us have experienced drivers in the grips of road-rage, driving hyper aggressively, or teenage guys hot-dogging on motorcycles or weaving through traffic in their cars.  Their sense of utility is clearly very warped towards speed and has a very diminished negative utility for things like accidents, traffic violations and tickets, or sudden death.  So even as we learn to cooperate with vehicles prepared to cooperate, and need to figure out the protocols they are using, the environment is subject to heavy noise from these kinds of loners, inattentive drivers, reckless drivers, and the list goes on.

One thing that strikes me is that the cab driver protocol in New York, once you know it, is a good example of a very easily discovered policy.  Despite a huge level of non-compliant vehicles, anyone who knows the cab protocol and drives using it will quickly fall into synchrony with a pack of others doing so.  So the level of positive feedback must be massive, if we had the proper metric, relative to the level of noise. 

Interestingly, there is clearly a pack-like behavior even for a line of cars driving on a single lane.  This would be a very good case to start by understanding.

Of course the real-world-school-of-fish behaviors involve (1) shifting lanes, which is the easy case and (2) ignoring lane boundaries, which is a harder case.  Given that even the cars on a string case taxes my imaginative capabilities, I think I'll leave the full 2-dimensional generalization to the reader!

Saturday, 13 May 2017

Smart Memory: How tight will the timing loop need to be?

Smart memory systems that integrate scalable data capture, computation and storage with machine-learning technologies represent an exciting opportunity area for research and product development.  But where should a smart memory live in today's data center computing "stack"?

Cornell's work on Derecho has been focused on two design points.  One centers on Derecho itself, which is offered as a C++ software library.  Derecho is blazingly fast both in terms of data capture and replication/storage speeds and latency, but can only be used by C++ programmers: unlike other C++ libraries that can easily be loaded by languages like Java or Python, modern C++ has many features that don't currently have direct mappings into what those languages can support (variadic template expansion and constant expression evaluation and conditional compile-time logic), hence it will be a while before this kind of very modern C++ library is accessible from other languages. 

The other main design point has simply been a file system API: POSIX plus snapshots.  Derecho incorporates an idea from our Freeze Frame File System in this respect: the file system snapshots are generated lazily (so there is no "overhead" for having a snapshot and no need to preplan them or anything of that kind), and our snapshots are temporally precise and causally consistent. 

Today, Derecho v1 (the version on GitHub) lacks this integration, but fairly soon we will upload a Derecho v2 that saves data into a novel file system that applications can access via these normal file system APIs.  One can think of them as read-only snapshots frozen in time, with very high accuracy.  Of course the output of these programs would be files too, but they wouldn't be written to the original snapshot: they get written to some other read/write file system.

This leads to a model in which one builds a specialized data capture service to receive incoming sensor data (which could include video streams or other large objects), compute upon the data (compress, deduplicate, segment/tag, discard uninteresting data, etc).  The high-value content needs to be replicated and indexed.  Concurrently, machine learning algorithms would run on snapshots of the file representation, enabling the user to leverage today's powerful "off the shelf" solutions, running them on a series of temporal snapshots at fine-grained resolution: our snapshots can be as precise as you like, so there is no problem accessing data even at sub-millisecond resolution (of course in the limit, the quality of your GPS time sources does impose limitations).

The puzzle is whether this will work well enough to cover the large class of real-time applications that will need to respond continuously as events occur.  In the anticipated Derecho smart-warehouse the performance-critical step is roughly this: data is received, processed, stored and then the machine learning applications can process it, at which point they can initiate action.  The key lag is the delay between receipt of data and stability of the file system snapshot: at least a few milliseconds, although this will depend on the size of the data and the computed artifacts.  One can imagine that this path could be quite fast, but even so, the round-trip cycle time before an action can occur in the real-world can obviously easily grow to 10's of ms or more.  We plan to do experiments to see how good a round-trip responsiveness we can guarantee.  

The alternative would move the machine learning into the C++ service, but this requires coding new ML applications directly in C++.  I'm assuming that doing so would be a large undertaking, just considering how much existing technology is out there, and might have to be duplicated.  The benefit of doing so, though, would be that we could then imagine response times measured in milliseconds, not tens or hundreds of milliseconds.

I would be curious to hear about known real-time data warehousing use cases, and the round-trip response times they require. 

Monday, 24 April 2017

Will smart memory need an ACID model?

This is really part II of my posting on smart memory from ten days ago.

Let's imagine that we've created the ultimate data warehouse, using Derecho.  This warehouse hosts terabytes of persistent memory, It can absorb updates at a staggering rate: hundreds of gigabits per second, has tens of thousands of processors that crunch the data down and then store it as collections of higher level "knowledge models", and is always hungry for more.  

We used to think of data warehouses as respositories for nuggets of data, so perhaps we could call these "muggets:" models, that can now be queried.  Hopefully this term isn't painfully cute.

Next, we imagine some collection of machine learning applications that consume these muggets and learn from them, or compute some sort of response to them.  For example if the muggets represent knowledge collected on the local highway, the queries might be posed by smart cars trying to optimize their driving plans.  "Is it ok with everyone if I shift to the passing lane for a moment?"  "Does anyone know if there are obstacles on the other side of this big truck?"  "What's the driving history for this motorcycle approaching me: is this guy some sort of a daredevil who might swoop in front of me with inches to spare?"  "Is that refrigerator coming loose, the one strapped to the pickup truck way up ahead?"

If you drive enough, you realize that answers to such questions would be very important to a smart car!  Honestly, I could use help with such things now and then too, and my car is pretty dumb.

We clearly want to answer the queries with strong consistency (for a smart car, a quick but incorrect answer might not be helpful!), but also very rapidly, even if this entails using slightly stale data.  In Dercho, we have a new way to do this that adapts the FFFS snapshot approach described in our SOCC paper to run in what we call version vectors, which is how Derecho stores volatile and persistent data. Details will be forthcoming shortly, I promise.

Here's my question: Derecho's kind of data warehouse currently can't support the full ACID database style of computation, because at present, Derecho only has read-only consistent queries against its temporally precise, causally consistent snapshots.  So we have strong consistency for updates, which are totally ordered atomic actions against sets of version vectors, and strong consistency for read-only queries, but not read/write queries, where an application might read the current state of the data warehouse, compute on it, and then update it.  Is this a bad thing?

I've been turning the question over as I bike around on the bumpy, potholed roads here in Sebastopol, where we are finishing up my sabbatical (visiting my daughter, but I've also dropped in down in the valley and at Berkeley now and then).  Honestly, cyclists could use a little smart warehouse help too, around here!  I'm getting really paranoid about fast downhills with oak trees: the shadows invariably conceal massive threats!  But I digress...

The argument that Derecho's time-lagged model suffices is roughly as follows: ACID databases are  hard to scale, as Jim Gray observed in his lovely paper on "Dangers of Database Scalability".  Basically, the standard model slows down as n^5 (n is the number of nodes running the system).  This observation gave us CAP and BASE and ultimately, today's wonderfully scalable key-value stores and noSQL databases.  But those have very weak consistency guarantees.

Our Derecho warehouse, sketched above (fleshed out in the TOCS paper we are just about to submit) gets a little further. Derecho can work quite well for that smart highway or similar purposes, especially if we keep the latency low enough.  Sure, queries will only be able to access the state as of perhaps 100ms in the past, because the incoming database pipeline is busy computing on the more current state.  But this isn't so terrible.

So the question we are left with is this: for machine learning in IoT settings, or similar online systems, are there compelling use cases that actually need the full ACID model?  Or can Maine learning systems always manage with a big "nearly synchronous" warehouse, running with strong consistency but 100ms or so lagged relative to the current state of the world?  Is there some important class of control systems or applications that can probably be ruled out by that limitation?  I really want a proof, if so: show me that omitting the fully ACID behavior will be a huge mistake, and convince me with mathematics, not handwaving... (facts, not alt-facts).

I'm leaning towards the "Derecho would suffice" answer.  But very curious to hear other thoughts...

Saturday, 15 April 2017

Will smart memory be the next big thing?

We're fast approaching a new era of online machine learning and online machine-learned behavior: platforms that will capture real-time data streams at high data rates, process the information instantly, and then act on the output of the processing stage.  A good example would be a smart highway that tells cars what to expect up around the next curve, or a real-time television feed integrated with social networking tools (like a world-cup soccer broadcast that lets the viewers control camera angles while their friends join from a remote location).

If you ask how such systems will need to be structured, part of the story is familiar: as cloud-hosted services capturing data from sensors of various kinds (I'm including video cameras here), crunching on it, then initiating actions. 

A great example of such a service is Microsoft's Cosmos data farm.  This isn't a research platform but there have been talks on it at various forums.  The company organized a large number of storage and compute nodes (hundreds of thousands) into a data warehouse that absorbs incoming objects,  then replicates them onto a few SSD storage units (generally three replicas per object, with the replication pattern fairly randomized to smooth loads, but in such a way that the replicas are on fault-independent machines: ones that live in different parts of the warehouse and are therefore unlikely to crash simultaneously).

Once the data is securely stored, Cosmos computes on it: it might compress or resize an image or video, or it could deduplicate, or run an image segmentation program.  This yields intermediary results, which it also replicates, stores, and then might further process: perhaps, given the segmented image, it could run a face recognition program to automatically tag people in a photo.  Eventually, the useful data is served back to the front-end for actions (like the smart highway that tells cars what to do).  Cold but valuable data is stored to a massive backend storage system, like Microsoft's Pelican shingled storage server.  Unneeded intermediary data is deleted to make room for new inputs.

Thus Cosmos has a lot in common with Spark, the famous data processing platform, except on a much larger scale, and with an emphasis on a pipeline of transformations rather than on MapReduce.

If we step way back, we can start to perceive Cosmos as an example of a smart memory system: it remembers things, and also can think about them (process them), and could potentially query them too, although as far as I know Cosmos and Spark have limited interactive database query functionality.  But you could easily imagine a massive storage system of this kind with an SQL front-end, and with some form of internal schema, dynamically managed secondary index structures, etc.

With such a capability, a program could interrogate the memory even as new data is received and stored into it.  With Cornell's Derecho system, the data capture and storage steps can be an asynchronous pipeline that would still guarantee consistency.  Then, because Derecho stores data into version vectors, queries can run asynchronously too, by accessing specific versions or data at specific times.  It seems to me that the temporal style of indexing is particularly powerful.

The interesting mix here is massive parallelism, and massive amounts of storage, with strong consistency... and it is especially interesting that in Derecho, the data is all moved asynchronously using RDMA transfers.  Nobody has to wait for anything, and queries can be done in parallel.

Tomorrow's machine learning systems will surely need this kind of smart memory, so for those of us working in systems, I would say that smart memory architectures jump out as a very exciting next topic to explore.  How should such a system be organized, and what compute model should it support? As you've gathered, I think Cosmos is already pretty much on the right track (and that Derecho can support this, even better than Cosmos does).  How can we query it? (Again, my comment about SQL wasn't haphazard: I would bet that we want this thing to look like a database).  How efficiently can it use the next general of memory hardware: 3-D XPoint, phase-change storage, high-density RAID-style SSD, RDMA and GenZ communication fabrics?

Check my web page in a few years: I'll let you know what we come up with at Cornell...

Wednesday, 5 April 2017

Implicit protocols and dynamic protocol discovery

During the past ten days I've driven in the area around Tel Aviv, then Brussels, then New Jersey/New York City.  The rules of the road are widely different in each of these areas.

Beyond the actual driving laws, there is an interesting kind of social networking at play, and it seems to suggest a possible research topic... one outside of my immediate area (anyhow, Derecho has me busy!), so I'll share it here in the hope that someone else might find the thought useful.

To set the stage, let me describe a few examples of what I'll call "implicit protocols" that drivers employ in these different areas.  Even before giving examples, I should explain what I mean with this term.  I have in mind situations where some actual communication occurs between drivers: we might look at each other and I might nod, meaning that I'm letting you pull in front of me on a crowded street.  You could drift slightly towards my lane, and I might acknowledge that by slowing my car, just a hair, and this is enough for you to realize that you have my permission to shift lanes.  The list goes on, and becomes pretty long.  So, an implicit driving protocol is (1) a behavioral rule, familiar to the people who drive in the region, and (2) one that isn't automatically used; it has to somehow be requested, and acknowledged.

So here are some examples:

Example 1: In Israel, as far as I can tell, drivers almost never look behind themselves when shifting lanes (they do check in other situations, I'm talking her specifically about lane-to-lane movement). They seem to check behind them normally, but once they decide that a lane-change is needed, the decision of when to do that apparently is based on the 180 degree region sideways and in front,  Once you get used to this you can easily understand intentions, but if you don't realize that they never check behind themselves, Israeli drivers often seem to be cutting you off because they will shift into your lane with inches to spare.  The key point is that you, as the driver coming up from the rear, were supposed to realize "long ago" that the driver up ahead was planning to shift into your lane and leave room for that.   If a driver seems to intend to change lanes and they don't (even with just inches to spare, as I said), you as the driver behind them are allowed to get impatient and honk the horn: "get on with it!".  And then conversely, if they shift lanes and nearly crash into you, they will be angry at you if they think the rules made it quite clear that this is what they were doing.  "You idiot, are you driving with your eyes closed?!!  Anyone could see what I was doing!"  They don't actually say stuff like that, by the way.  I just mean that they are thinking it.

In fact the first time you drive in Tel Aviv you feel as if people break the law whenever they find it convenient to do so.  But later you realize they actually have social rules for when to defy the laws, and even the city police respect those.  Break the rules in a way that they don't normally consider appropriate, do something nobody should ever do, and I promise you: every driver in blocks around will lean on his or her horn instantly, and you just might find that the local policeman wants to chat, too!  (There is also a whole protocol around weird rude behavior, like stopping your car in a way the blocks the entire street and then hanging out and smoking a cigarette because you are waiting for your cousin to come down from the fifth floor, but she's elderly and slow and it might take five minutes, during which every single car will be honking non-stop...  because after all, if you don't wait for her right at that spot, well, you might have to drive around the block, or she might have to stand there...  but let's not even go there!)

Example 2: In Belgium, the whole country uses a rule called "Priority a Droit".  You didn't know it, but probably you are familiar with priority-to-the-left rule, and a good way to understand it is to start by thinking about a round-about (a traffic circle).  Suppose someone wants to get into the circle, and you are driving around the circle.  Priority to the left is the usual rule: the driver in the circle has the right to keep driving, and the one who wants to enter has to wait for a free slot.  As you can imagine, the traffic circle deadlocks with priority to the right ("a droit" in French) because new cars can enter in priority over cars wishing to exit.  (Fun fact: for historical reasons rooted deep in the past, France has two major traffic circles that actually use priority a droit, namely the ones at the Etoile and at Place Victor Hugo.  I've seen them get deadlocked and stay that way for hours, literally hours.  The identical cars just sit there... forever.  Very interesting for me as a computer scientist!)

I'll toss out an example 2-a but this is a minor one: They have a ton of traffic circles, in fact, all over Europe.  And there often are two lanes.  How will they be used?  Well, the rule is to use the inner lane if you will exit two or more exits from here, but be in the outer lane if this next exit is where you will leave the circle.  All the drivers let you do this, if you know the rule and follow the rule.  But tourists often get it wrong, and cause fend-benders.

But enough about traffic circles.  Let's get to the real example 2.  So, here's the actual issue for Belgium.  It turns out that in Belgium, a great many small streets lack any kind of stop sign or traffic lights on the corners.  So you are driving on a street, maybe even a nice large road, 2 lanes in each direction, and in comes this dinky little street from the right with no markings on the intersection, and in fact the little thing could almost be a driveway, it looks so dubious as far as streetness goes. 

In the US, without question, you would assume that you on the main road have priority over the little road coming in from the right.  And actually, legally, you do have priority: we use a priority to the left rule in this situation, like on normal traffic circles.  So when an intersection is unmarked, the car to the left gets to drive directly through, and the car to the right must stop, check that the left is clear, and only then can advance into the intersection.

In Belgium, the rule is actually the opposite: dinky or not, the driver on that little alley has priority over the vehicles on the main 2-lane road if the intersection is unmarked, unless they tell you that you are on a "route prioritaire", which literally means "a road with priority over incoming traffic".  So everyone on the main road is guaranteed to tap the brakes at every intersection.  As the driver behind such a car you could crash right into them by not realizing they are forced to do this, and for that matter if you were that clueless, you could easily drive through without checking to the right, causing a major accident.  As it happens, this rule also gets used in the south of Europe quite often, so do make sure to understand it if you plan to drive there.

Where is the implicit protocol?  Partly, of course, the protocol is the legal rule.  But beyond that there is also this question of whether you are too far into the intersection to slow down and yield.  So there is still an exchange of context between you and the other driver that takes place: a form of communication that might be eye to eye, or might relate to how you drive your car.  But information is conveyed and helps you know if the guy coming in is ok with you passing even though he has priority, or if he is unaware of you, or deeply committed, in which case, slam on your brakes my friend!

Example 3:  In the area around New York and especially in the city itself, cars collaborate "school of fish" style to maximize throughput on streets that happen to have slightly coordinated traffic lights, such as the main north/south avenues that run for miles from the top of Manhattan to the bottom.  So on these, all the drivers are implicitly in agreement that the plan is to get as far as possible before stopping for the next red light, and with this in mind, they need to shift lanes and otherwise coordinate much like fish in a school of fish.  The car density might be very high, and yet the cars try to flow around obstacles, which are unfortunately common in New York City: gaping holes in the pavement where work on a steam pipe is underway, perhaps with fencing around the hole, perhaps just a big hole; steel plates that look unstable, maybe a homeless person pushing a shopping cart of stuff across the street right at that moment.  The cars just all flow, seamlessly. 

When you understand this it feels like a kind of dancing: a ballet of cars, and especially of taxi cars because there are additional complications: the taxi drivers all drive "correctly" but people from out of town on these roads are treated kind of like potholes: the taxis are very wary around them because they know that those drivers are clueless and might do stupid (or inefficient) things.  But there is a form of implicit communication here too: very quickly, I find that I'm in sync with the taxi and other drivers around me: I'm not in a yellow cab, yet the school of cabs accepts me.   It isn't an eye-to-eye thing: it just comes to the rules you are following. 

I could definitely extend this list: Switzerland has its own style (absurdly, excessively, polite and respectful of the rules, plus: "tous qui n'est pas interdit est obligatoire", meaning that "If it isn't illegal, it is obligatory").  California has its own style.  The style in Mexico can be rather creative. 

The computer science puzzle:  Ok, hopefully we're on the same page now.  As you can see, by driving in all of these different settings in a short amount of time, I've had to repeatedly adapt my meta-driving-protocol: I've had to shift to different implicit protocols, so that I would behave as expected from the point of view of all those others on the road.  Make the shift and you drive more safely and efficiently; fail to make it, and horns blare at you from all directions!

But how am I doing this?  And how did I figure out the rules in Tel Aviv, which is a new environment for me?  Or other such places?

I'll postulate that first of all, humans have explicit and implicit ways to communicate.  We talk, but we also communicate eye-to-eye through glances and little nods, and through behavior like when your car drifts just a little to the right or left while in motion in a way that "makes sense" under the local implicit driving rules.  So we send information through a dozen little "tells", as a poker player might express the idea, and this is often a two-way thing or even an n-way thing, as in the case of schools of drivers on heading south on 2nd Avenue in New York.

Next, the rules that matter on roadways are discoverable: a person who drives well and doesn't get stressed easily can learn these rules very quickly.  How?  I have no idea: I'm not adept at languages (I know like five words of Hebrew by now, a bit less than two words per month for the past three months: not impressive!)  But somehow these implicit protocols are so evident that they are just totally obvious once you settle into the situation, and you develop hypotheses and quickly validate them.

The means of dynamic discovery is also incredibly robust: we can learn these rules even though perhaps half or more of the drivers aren't using them at all, or not using them properly.  We can learn which cars "get it" and which ones don't, and then we as a school of cars can all cooperate to shun the ones that are clueless.  In fact you can figure out that the guy driving that SUV is clueless, and even though you are way up ahead of me, I'll realize this too, and will route around the SUV too, because I've learned something from you.  And yet I can't see you, don't know you.  Somehow you communicate this knowledge through your behavior!  And I somehow deduce it from that behavior.

Why does all this matter?  Here's why.  First, I think this represents a kind of exciting topic for research in social networks.  I bet that if you plan to come to do a PhD at Cornell and knock on Jon Kleinberg's door, you could even talk him into working on this with you.  (The topic is more his kind of thing than mine).  So do that!  We always need more brilliant PhD students.

And the second reason is that self-driving cars will never be safe until we understand this aspect of human behavior.  Obviously, if you follow these blog postings of mine, you'll know that I've become very worried about self-driving cars (check out that tag if you missed my prior comments on this).  But setting that stuff to the side, the situation we have is that Uber and others want to create self-driving taxis that will be safe in New York, and in California too, and Tel Aviv, and Brussels.  Well, good luck to them on this if they don't figure this stuff out!  I myself will try to opt out for as long as possible.

And in case you are wondering, I'm still driving standard shifts, as much as possible.  Of course I can drive an automatic.  But I just don't trust those automatic shifting systems...

Tuesday, 28 March 2017

Is it sensible to try to `` teach'' cloud computing?

Many years ago, I was teaching a rather polished course on distributed computing, and students seemed to like it very much: we looked at distributed programming models and algorithms, and the students learned to create new protocols of their own, solving problems like atomic multicast.  Mostly I taught from my own textbook, initially created back in 1996 on a prior sabbatical.

When distributed systems suddenly expanded enormously and started to be known as cloud infrastructure systems (I would date this roughly to 2005 when AWS was introduced), it seemed like an obvious moment to modernize that class and shift it to become a cloud computing course. 

What followed became a cycle: I would revise my distributed systems textbook, update all my slides,  and then the course would feel reasonably current: I did this in 2010, then again in 2012.

But today I'm facing an interesting puzzle: as part of my sabbatical I thought I might again attempt to revamp the class, but the more I learn about the current state of the art, the less clear it is that this is even possible!  Cloud computing may have become too large a topic, and too arcane, to permit a proper treatment.  Here are a few of the more important topics:

  • Edge platforms.  Devices like iPhones and iPads were completely new back when the cloud emerged, and fairly simple.  Today, they are incredibly complex and very powerful platforms, and it would probably take a full year-long course to just cover the full range of technologies embedded into them.
  • The network itself is increasingly sophisticated, and I use the word "increasingly" with some hesitation, because after decades of continuous evolution and change, it is hard for a person who thinks of the Internet in terms of the old TCP/IP structure to fathom quite how changed the modern network actually has become.  Today's network is a computational structure: it dynamically adapts routing, is able to model each separate enterprise as a distinct domain with its own security and quality of service, it can cache huge amounts of data, and it understands mobility and adapts proactively, so that by the time your car emerges from the tunnel, the network is ready to reconnect and deliver the next bytes of your children's videos.  With the push towards P4, the network is increasingly programmable: an active entity that computes on data flows running at rates of 100Gbs or more.
  • Any serious cloud computing company operates a lot of data centers, at locations spread globally (obviously, smaller players lease capacity from data centers operated by specialists, then customize their slice with their own software layers).  Some systems are just limited-functionality cache and web service structures: simple points of presence; others are full-featured data warehouses that do extensive computation.   Thus the cloud is a heavily distributed structure, with a hierarchy.  Routing of client requests is heavily managed.
  • Within any single data center we have layers of functionality: edge systems that run from cache and are mostly stateless (but this is changing), then back-end systems that track dynamic data, and compute engines that apply iterative computational steps to the flow: data arrives, is persisted, is compressed or analyzed, this creates new meta-data artifacts that in turn are processed, and the entire infrastructure may run on tens or hundreds of thousands of machines.
  • Big data is hosted entirely in the cloud, simply because there is so much of it.  So we also have these staggeringly-large data sets of every imaginable kind, together with indices of various kinds intended to transform all that raw stuff into useful "content".
  • We have elaborate scalable tools that are more and more common: key-value stores for caching (the basic MemCacheD model), transactional ones that can support SQL queries, even more elaborate key-value based database systems. 
  • The cloud is a world of extensive virtualization, and virtualized security enclaves.  All the issues raised by multitenancy arise, and those associated with data leakage, ORAM models, and then technologies like ISGX that offer hardware remedies.
  • Within the cloud, the network itself is a complex and dynamic creation, increasingly supporting RDMA communication, with programmable network interface cards, switches and routers that can perform aspects of machine learning tasks, such as in-network reductions and aggregation.
  • There are bump-in-the-wire processors: NetFPGA and other ASIC devices, plus GPU clusters, and these are all interconnected via new and rather exotic high speed bus technologies that need to be carefully managed and controlled, but permit amazingly fast data transformations.
  • File systems and event notification buses have evolved and proliferated, so in any given category one has an endless list of major players.  For example, beyond the simple file systems like HDFS we have ones that offer strong synchronization, like Zookeeper, ones that are object oriented, like Ceph, real-time versions like Cornell's Freeze Frame (FFFS), big-data oriented ones, and the list goes on and on.  Message bus options might include Kafka, Rabbit, OpenSlice, and these are just three of a list that could extend to include 25.  There are dozens of key-value stores.  Each solution has its special feature set, advantages and disadvantages.
  • There are theories and counter-theories: CAP, BASE, FLP, you name it.  Most are actually false theories, in the sense that they do apply to some specific situation but don't generalize.  Yet developers often elevate them to the status of folk-legend: CAP is so true in the mind of the developers that it almost doesn't matter if CAP is false in any strong technical sense.
  • We argue endlessly about consistency and responsiveness and the best ways to program asynchronously.  The technical tools support some models better than others, but because there are so many tools, there is no simple answer.
  • Then in the back-end we have all the technologies of modern machine learning: neural networks and MapReduce/Hadoop and curated database systems with layers of data cleaning and automated index creation and all the functionality associated with those tasks.
I could easily go on at far greater length.  In fact I was able to attend a presentation on Amazon's latest tools for AWS and then, soon after, one for Microsoft's latest Azure offerings, and then a third one focused on the Google cloud.  Every one of these presentations reveals myriad personalities you can pick and chose from: Azure, for example, is really an Azure PaaS offering for building scalable web applications, an Azure fabric and storage infrastructure, an Azure IaaS product that provides virtual machines running various forms of Linux (yes, Microsoft is a Linux vendor these days!), a container offering based on Mesos/Docker, and then there is Azure HPC, offering medium-size clusters that run MPI supercomputing codes over Infiniband.  All of this comes with storage and compute managers and developer tools to help you build and debug your code, and endless sets of powerful tools you can use at runtime.

Yet all of this is also terribly easy to misuse.  A friend who runs IT for a big company was just telling me about their move to the cloud: a simple mistake ran up a $100k AWS bill one weekend by generating files that simply got bigger and bigger and bigger. On a single computer, you run out of space and your buggy code crashes.  In the cloud, the system actually has the capacity to store exobytes... so if you do this, it just works.  But the, on Monday the bill arrives.  Microsoft, Amazon and Google are vying for the best ways to protect your IT department against surprises, but of course you need to realize that the dashboard has those options, and enable them, much like credit card alerts from Visa or MasterCard.

So even cloud dashboards have become an elaborate topic.

Where does this story end?  The short answer is that it definitely isn't going to end, not for decades.  If anything, it will accelerate: with the trend towards Internet of Things, we'll be moving all sorts of mission-critical real-time systems into the cloud, and even larger data flows: self-driving cars, self-managed smart highways and cities and buildings, you name it. 

And you know what?  I don't see a way to teach this anymore!  I was just talking to Idit Keidar yesterday at Technion, quite possibly the world's very best distributed systems researcher, and she told be a little about her graduate class in distributed systems.  Five years ago, I might have thought to myself that wow, she should really be teaching the cloud, that students will insist on it.  Yesterday my reaction was exactly the opposite: when I resume teaching this fall at Cornell, I might just do what she's doing and just return to my real roots.  Cloud computing was a fun course to teach for a while, but the growth of the area has simply taken it completely out of the scope of what we as faculty members can possibly master and cover in any one class.




Friday, 17 March 2017

The CAP conjecture is dead. Now what?

CAP has been around now for something like 15 years.  There are some circles in which the acronym is even more famous than FLP or SAT!  But CAP was never a theorem (let's call it a "folk-theorem"), and by now we have more and more examples of systems that violate CAP. 

Yet even if CAP is false, it was also a fantastic rule of thumb that was incredibly valuable to developers tasked with building high performance distributed systems on the cloud.  If we start to teach students that CAP is false, what can replace it in this role?

A little historical context: CAP is short for "you can only have two from Consistency, Availabilty and Partition Tolerance",  This assertion was initially put forward by Eric Brewer in a PODC keynote talk he gave in 2000. The justification he offered ran along the following lines.  First, he observed that in today's increasingly globalized web systems, we invariably deploy services with high latency WAN links between them.  These are balky, so services are forced to make a choice: either respond right now based on data locally available, or await restoration of the WAN link.  A tradeoff that obviously favors availability over consistency, and already tells us that many web services will have to find a way to prevent responses that reflect stale data from being noticed by users, or if they are noticed, from causing problems.  He used the term "partition tolerance" for this kind of fault-tolerance (namely, giving a response even when some link is down).  Hence the P in CAP.

He suggested that we are favoring "A and P over C".

Then he observed that even in a single data center, if you want the highest possible levels of performance, you'll need to handle web requests on edge nodes that take their data from cache, without first talking a backend server first: again weaker consistency, but higher availability.  So again, we see the A, as a kind of synonym for rapid response.

So he looked at all the different pairing: C and A over P, C and P over A, A and P over C.  He concluded that in practice we always seem to find that by taking A and P we get the best scalability.  But CAP itself asserted that although other mixes work, you always have to pick the two you like best, and you'll erode the third.

Although the definitions of A and P are a bit strange (P seems to have a bit of A mixed in, and also seems to sometimes mean fault-tolerance, since partitions never really arise within a data center), CAP is sort of catchy.  But like many such acronyms, much depends on the details: if you try and give rigorous definitions, to turn CAP into a theorem (people have done so), you find that it only holds in some narrow situations.

The result is that as professors teaching cloud computing, we generally treat CAP as a clever acronym, but one that works mostly as a rule of thumb.  In some sense, CAP is a useful kind of folklore.

CAP took hold back in 2000 because at that time, companies like eBay and Amazon were struggling with the high costs of ACID transactions in systems that the database SQL programming model.  Scalable database performance poses issues that are much more nuanced than the ones Eric had in mind: there were puzzles of lock conflict, complexity, data pipelines with non-trivial asynchronous ordering requirements, etc.  But the bottom line is that performance of the big systems at eBay and Amazon was erratic and often strayed outside the magic 100ms target for web-service and web-page responses.  This is the point at which a human user feels that the system is "snappy" and everyone wants to be in the sub-100ms range.

So, the technology leaders at eBay began to tell their developers that it was absolutely fine to write SQL code as a starting point in web application design (a pragmatic decision: they people they were hiring mostly had extensive SQL experience from their database courses), but that once the SQL code was working, to weaken the transactions by turning them into a series of individual atomic actions.   eBay began to talk about the resulting development methodology using a new acronym: BASE, by which they meant "Basically Available, Softstate systems with Eventual Consistency."

Notice that BASE doesn't abandon consistency.  Instead, it points out to the developer that many web systems just don't need ACID guarantees to work correctly.   ACID and SQL are great for creating a quick prototype that will be robust and easy to debug, but then you "optimize" it by taking away the ACID properties, without breaking the behavior in ways that violate the specification.

Amazon embraced BASE too.  Around 2006, they decided to rebuild many of their core applications around a key-value technology called Dynamo, but SQL users found it hard to use, and by 2008 the adoption of Dynamo began to falter.  To make the transition easier, Amazon layered in a NoSQL API for Dynamo, called Dynamo-DB: now SQL code could run on Dynamo, but with weaker guarantees than for a full SQL system (for example, NoSQL lacks join operations), and Dynamo-DB happened to be especially well-matched to BASE.

So you can see from these examples why CAP would be such convenient way to motivate developers to make the switch: it more or less tells them that if they optimize their code using BASE, it won't scale properly.  Moreover, the eBay memos about BASE include step by step instructions to explain precisely how to go about doing it.

Today, fifteen years later,  we've discovered more and more ways to implement scalable, high performance cloud services, edge caches that maintain coherence even at global scale, fully scalable transactional key-value storage systems, and the list goes on.  Derecho is one example: it helps you build systems that are highly Available, Replicated and Consistent.  Call this ARC.

The cool twist is that with ARC, you get lock-free strong consistency, right in the cloud edge! This is because we end up with very fast replication at the edge, and every replica is consistent.  You do need to think hard about your update sources and patterns if you hope to avoid using locks, but for most applications that aspect is solvable because in the cloud, there is usually some sense in which any particular data item has just once real update source.  So there is an update "pipeline" and once you have consistency in the system, you end up with a wide range of new options for building strongly consistent solutions.

An example I like very much was introduced by Marcos Aguilera in Sinfonia.  In that system, you can make strongly consistent cache snapshots, and run your code on the snapshot.  At massive scale you have as many of these snapshots as needed, and transactions mostly just run at the edge.  But when you want to update the system state, the trick he suggested is this: run your transaction and keep track of what data it read and wrote (version numbers).  Now instead of just responding to the client, generate a "minitransaction" that checks that these version numbers are still valid, and then does all the writes.  Send this to the owner of the true database: it validates your updates and either commits by applying the updates, or rejects the transaction, which you can then retry.

Since so much of the heavy lifting is down when speculatively executing the transactions at the first step, the database owner has way less compute load imposed on it.  Then you can start to think about systems with sharded data that has different owners for each shard, and limits the transactions to run within a single shard at a time (the trick is to design the sharing rule cleverly).  Sinfonia and the follow-on systems had lots of these ideas.

ARC creates a world in which solutions like the Sinfonia one are easy to implement -- and the Sinfonia approach is definitely not the only such option.

I'm convinced that as we move the cloud towards the Internet of Things services, developers will need this model,  because inconsistency in a system that controls smart cars or runs the power grid can wreak havoc and maybe even would be dangerous.

So now I want to claim that ARC is the best BASE story ever!  Why?  Well, remember that BASE is about basically available, soft-state systems with eventual consistency.  I would argue that in examples like the Sinfonia one, we see all elements of the BASE story!

For example, a Sinfonia system is basically available because with enough consistent snapshots you can always do consistent read-only operations at any scale you like.  The state is soft (the snapshots aren't the main copy of the system: they are replicas of it, asynchronously being updated as the main system evolves).  And they are eventually consistent because when you do make an update, the mini-transaction validation step lets you push the update results into the main system, in a consistent way.  It might take a few tries, but eventually, should succeed.

What about the eventual consistency aspect of BASE in the case of ARC?   ARC is about direct management of complex large-scale strongly consistent replicated state.   Well, if you use ARC at the edge, back-end servers tend to be doing updates in a slightly time-lagged way: batched updates and asynchronous pipelines improve performance.  Thus they are eventually consistent, too: the edge queues an update, then responds to the end-user in a consistent way, and the back end catches up soon afterwards.

And ARC isn't the only such story:  my colleague Lorenzo Alvisi has a way to combine BASE with ACID: he calls it SALT, and it works really well.

So, feeling burned by CAP?  Why not check out ARC?  And perhaps you would like a little SALT with that, for your databases?

Sunday, 12 March 2017

Should Derecho do optimistic early delivery?

In the Isis and Vsync systems, we supported a form of early delivery that could be described as "optimistic":  messages were delivered as soon as they arrived, before the system was certain safety had been achieved, enabling the application to start processing them, asynchronously with respect to the background exchanges of acknowledgements required to learn that stability had been reached.  

For example, suppose that a client process, P, asked a process Q to take some action.  Perhaps as a side effect of this, Q updates data replicated in group G with members {Q, R, S}.  In an optimistic case, the update multicast itself might return as soon as the multicast was in process, and the delivery events could occur as soon as ordering was determined.  Latency is thus minimal.

The trouble with optimistic early delivery is that a crash can erase the multicast at the first stages of such a protocol.  Thus there might be a window of a milisecond or so during which a power failure could kill some process after it sent (or received and delivered ) such a multicast, but before it was fully delivered at other members.  Perhaps we would end up with R having delivered (and acted upon the message), and use of the crash, S never gets a copy.

This is a bad behavior: normally, one wants failure atomicity.  Accordingly, the Isis or Vsync application developer who used these optimistic protocols was told to invoke a flush primitive before taking actions that might remain visible after a crash, like giving out cash from the ATM, or sending a reply to an external client application.  The flush primitive simply waited until stability was achieved, returning after the message had become safe.  Thus if R called flush before handing out cash, either the multicast stabilized and we distribute the money, or the crash happens while flush is waiting, and R is killed,  but no money was handed out, hence nothing bad has occurred.

The idea was that optimistic delivery plus flush would be equivalent to pessimistic delivery, but that during the period before flush was called, computing overlaps with background message passing and extra work can be done.  In Vsync and Isis, the totally ordered multicast was  optimistic, whereas the "safe" primitive was pessimistic.  You may have heard of vertical Paxos protocol: this is also a pessimistic protocol.  Pessimistic protocols in effect call flush before the application ever sees the message, so that the application layer is always safe.

As a teaching device, we would ask new Isis or Vsync developers to think about the way that buffered disk writes are done in a file system.  The application writes some text, but it ends up buffered first by the file I/O library, and then in the kernel.  Seconds can elapse before the data is on the disk.  The value of buffering is speed: performance is dramatically improved, far better than if fflush is called character by character or even line by line.

Today I'm revisiting this same question in Derecho.  Our basic OrderedSend is pessimistic, and pretty fast.  We also have a raw version of send that lacks ordering guarantees or failure atomicity and runs at the highest speeds possible.  But do we need an optimistic delivery too?

The case for optimistic deliver centers on this question: are there applications that need strong properties, yet also need the small degree of overlapped computing afforded by optimism?  A pessimistic delivery rarely delays more than a few microseconds relative to an optimistic one.  So you can deliver a small multicast after, say, 12us in an optimistic mode, but might have to wait 25us to deliver it with full safety (pessimistically).

In our experiments, the performance gap is most extreme for small messages.  Our impression, so far, is that these suffer from two kinds of delay.  First, with very small multicasts, the actual time needed to report stability through the SST subsystem is proportionally larger relative to data transfer delays, so when you compare a small message delivery to a large one, the wait for stabilization simply looks larger.  And then the other issue is that because these messages need to be buffered, Derecho runs out of space and at high rates, new incoming messages are delayed waiting for an admission slot.

In contrast, optimistic delivery mode lets Derecho hand off incoming messages to the application instantly, hence the system buffers only data that is in-flight, and won't incur this admission backlog buildup.  Stabilization takes the same amount of time, but by the time flush is invoked by a handler processing message m, there is more hope that m will have fully stabilized and that flush won't delay at all.

But keep in mind that with raw mode, there is no delay and no need to use flush at all.     The question becomes this: will people who would find optimistic delivery useful be content to use raw mode instead?  If so, the demand for optimistic mode would be low, and we should then omit the extra mechanisms and extra complexity.  Or do they need stronger guarantees when failures occur?  And if the latter, just what would those applications look like, that need atomicity but also need to shave 25us or so off the delivery delay?

Let us know if you have a great example of an application that would really benefit from optimistic early delivery.  For now, we know how to support optimism within Derecho, but are learning towards not doing so!


Wednesday, 8 March 2017

The ultimate Paxos (and Atomic Multicast) protocol

Many years ago, probably around 1986, Barbara Simons organized a workshop on replication at Asilomar, a California resort that had an iconic role in early systems research (it took a while, but eventually our community was too large to fit there). 

Her idea was to bring people together across areas: theory and practice, and within the practitioners, databases as well as other forms of distributed systems.  It was a good idea, and a great workshop.

In Jim Gray's talk, he pointed out that the interaction pattern we associate with the 2-phase commit protocol was seemingly universal, and that perhaps we were wrong to think of 2-phase commit as primarily a database construct.  Instead, he then asked, if 2-phase commit is really concerned with something deeper than database transactions, what is this deeper core concept?  The essential point of his talk was that the 2-phase pattern seen in 2PC clearly allowed a distributed system to "learn" something.  And from this, he suggested, that if  we could start to appreciate the minimal "knowledge" required for a distributed system to be correct, we could build a minimal implementation of that protocol and solve the problem once and for all.

A long time has passed.  Jim is no longer with us, and we know all about knowledge in distributed systems, and how processes learn.  Even so, Jim's question still sometimes comes up.  I mentioned this to Idit Keidar today over lunch, and her response was definitive: she convinced me that today, we really do (finally) understand the answer to Jim's question.    Her thinking centers on work she published in 2006 with Alex Shraer.

Here's what she explained.

First, she observed, distributed computing is ultimately about just one question: fault-tolerant consensus.  For her, Jim had noticed this, but was understanding it as something about the 2PC communication pattern rather than appreciating that the thing that matters more is consensus: moving from a state in which some decision is uncertain to one in which a decision has been reached.  As Idit views the question, one can transform consensus into other forms, we can route the messages in all sorts of patterns, but the bottom line is that either we achieve agreement via consensus or we simply aren't building systems that can make logically sound claims about their behavior. 

Next, she pointed out, agreement is trivial while failures aren't happening.  While a system remains stable, the author of a new piece of information simply needs to multicast it, and this accomplishes consensus.    If there might be many authors, you can circulate a token around them either before they send (as in the old Totem and Transis protocols) or after the messages arrive, to assign them a delivery ordering.

And so in Idit's view, the only hard issue is to handle failures.  Obviously, in the case of 2PC, particularly back in 1986, the field started out with a somewhat muddled understanding of this aspect: 2PC starts by stating that there is a set of processes that need to achieve agreement, but where does that set come from?  In a modern system like Derecho, the set itself is the output of a consensus layer.  2PC uses a crash failure model, but this is one of many models.  Back in 1986 people were very focused on models: crash failures, failstop failures, omission to send, omission to receive, timing errors, Byzantine models.

But these days all that really matters is to handle crash failures correctly.  Some organizations toy with Byzantine fault-tolerance, but honestly, I've never met anyone who had a service that actually came under a Byzantine attack.  Maybe the blockchain people will finally have that experience.

So let's focus on crash failures for a moment.  In light of the FLP theorem, we know now that without a perfect failure detector, protocols can't be proved correct: you can show safety, but not liveness.  In practice, we solve this by forcing a seemingly faulty process to leave the system and then, if you want, it can rejoin. 

So failure becomes a purely behavioral abstraction: if the majority of the system deems some process to be worthy of excluding, for whatever arbitrary reason it desires, than out goes the offending process, end of story. 

Where did the majority constraint come from?  The role of the majority restriction is to prevent logical partitioning. 

So, we end up with a rather dynamic form of membership: the system defines its own membership, excluding processes that seem faulty, and makes progress as long as it can maintain a dynamic form of majority.  To whit: at any point in time, some epoch defines the composition of the system.  In order to move to a new epoch, the systems needs agreement by the majority of whoever is in the active epoch.

So here's where all of this leads: Derecho is a concrete implementation of this new-age perspective on Jim's question.  Indeed, it is the ultimate story in the sense that the system is optimal in many different ways.

To appreciate this, you'll need to read the Derecho papers, but in a nutshell, the system maps consensus onto RDMA hardware in an exceptionally efficient way.  But the protocol itself runs consensus on the configuration of each epoch, and then uses a cheaper consensus protocol within an epoch (one that can assume membership is stable and that failures won't occur), leaving a very clean realization of distributed computing.

Indeed, Derecho is a constructive lower bound in every interesting dimension.  It is an optimal solution to the problem of distributed computing, using consensus-based methods.

Why do I make this claim?  Well, first, during a given epoch, the system is as quick to deliver messages as is possible: one can prove that any system that delivers messages with fewer "exchanges of information" between its processes is either incorrect, or at last at risk of needing to stop because of some single fault.

Next, one can show that Derecho's agreement on the state of each membership epoch is optimal.   Here, Derecho makes use of an all-to-all pattern of information exchange, and I asked Idit if she thought the protocol could be improved.  Idit pointed again to her 2006 papers with Alex.  Without prior knowledge of which processes failed, and how many failed, she explained, this pattern of information exchange is the quickest way to reach agreement on the membership of the next epoch.

Finally, we can show that Derecho makes progress with a failure model called <>P: eventually perfect failure detection.  Idit explained to me that in the past, people actually thought it might make sense to focus on progress with weaker detection models, like <>W, but that if you do so, you either end up with a fault model equivalent to <>P, or you end up with slower agreement protocols that look much more like Byzantine agreement.  So, for the style of quick progress Derecho is after, she explains, <>P is the right goal.  And Derecho is live with <>P, provided that no more than a minority of processes fail in any epoch.  Which again, is the best one can do.

Now, Idit is a theory person, but it is interesting for me as a practitioner to also think about practicalities.  As the old adage goes: "In theory, theory and practice are the same, but in practice, they differ."

As it happens, Derecho is optimal in a communications-engineering (networking) sense. 

Given that reliable networks always have a one-to-one acknowledgement based layer, reliable multicast over a tree is an optimal data dissemination pattern: if you try and disseminate multicasts over an unreliable 1-N protocol, the cost of detecting lost packets and resending them will be very high compared to a tree of unicast transfers.  (Perhaps optical networks with optically aggregated acknowledgements could offer a slight hardware speedup, but even this isn't clear, since the optical aggregator would itself be a tree). 

Now if you happen to be a cloud computing engineer, and read the above, what will jump to mind is heavy-tailed behaviors and multi-tenancy: Derecho protocols sometimes relay data, so if a relayer is very slow, everyone waits.  Paxos classically would avoid this using quorums: the system doesn't wait for the slowest process.  So how can I claim that Derecho is optimal in a practical sense if it doesn't use quorums?

There are a few answers.  A long-winded one would talk about work Fernando Pedone and his student, Parissa Jallali, did while visiting my group a few years ago.  Basically, they studied quorum behaviors in Paxos and discovered that while Paxos can briefly "outrun" a slow process, eventually work piles up and either flow-control causes the protocol to pause, or a shifting pattern of just who is the slow process switches the slow guy into the quorum and some previous quorum member becomes slow.  Either way, Paxos basically halts until everyone catches up and is back in sync.  So quorum patterns do not evade the disruption caused by heavy tailed behaviors.

Conversely, offloading the protocol into hardware actually can eliminate that issue, because the hardware is dedicated: the network spends 100% of its time communicating, so if you can describe a pattern of communication, then let it rip, the network is the ideal engine for moving data.  As it happens, Derecho generates deterministic data transfer schedules, hence given adequately programmable NICs we can hand the entire sequence of block transfers to the NIC.  So we can even make "optimal" use of the network, and since a NIC never sleeps or pauses, quorum behaviors aren't needed even if end-applications sometimes are a bit slow.

So a road that for me started around when Jim asked his question about 2PC seems to have reached its end: Derecho implements the ultimate Paxos protocols for atomic multicast (often called vertical Paxos) and for persisted memory (classic Paxos).  We could add an optimistic early delivery protocol with a flush, too, as in Isis and Vsync, but we decided to keep the system simple and omitted it: most people who would use that feature probably just want raw RDMC from the Derecho API, and this we do offer.

And so the Paxos problem is solved.  And you know what?  Its about time!  Idit feels that it was solved back in 2006, actually.  As for me, well, until I can write an application using the ultimate protocol, the problem is open.  But today, I finally can.  (I don't mind at all that Derecho employs a protocol that is actually extremely similar to our first Isis protocols from 1985.)

So should we all pack our bags and go home? 

Not quite yet.  First, there are other forms of distributed consistency, like convergent (gossip) protocols, self-stabilization, and of course, Byzantine Agreement.  It would be nice to see how those fit into this picture and whether there can be a single integrated story that combines all the elements.

A second issue is the engineering complexity of modern platforms.  I'll write a whole blog posting on this sometime soon, but suffice it to say that in a data center with physical topology (racks, switches, TOR switches, failure-independence domains...), GPU clusters, NetFPGA accelerators, Intel SGX protection enclaves... it just isn't obvious how to write code for such environments.  Derecho is just part of the answer, and not the whole story.

Then beyond all of this are dimensions we have yet to tackle in Derecho itself.  For example, even if Derecho is the ultimate data center protocol, is it also the ultimate WAN version?  As it happens, I suspect that this may be true too, but more attention to the question will be needed.  Anyhow, until we have it running, I won't believe the story even if I figure it out "in theory".  After all, in theory, theory and practice are the same...  but in practice, they are enormously different.

So I wouldn't despair: very likely we are finally at the end of the road for Paxos, but it certainly isn't the end of the road for distributed systems.  Watch this space: I can promise plenty of interesting new questions, and new answers, in years to come.

Friday, 24 February 2017

On systematic errors in complex computer programs

Society as a whole, including many computing professionals, seems to assume that mistakes made by unsupervised deep learning and other forms of machine-learning will be like Heisenbugs: rare, random, and controllable through a more professional code development and testing process. This belief underlies a growing tendency to trust computing systems that embody unknown (indeed, unknowable) machine-learned behaviors.

Why is this a concern?  

In many parts of computer science, there is a belief in "ultimate correctness": a perception that if we merely formalize our goals and our algorithms and successfully carry out a proof that our code accomplishes the goals, then we will have created a correct artifact.  If the goals covered a set of safety objectives, then the artifact should be safe to use, up to the quality of our specification.  Highly professional development practices and extensive testing strengthen our confidence; model-checking or other forms of machine-checked proofs that run on the real software carry this confidence to the point at which little more can be said, if you trust the specification.

Yet we also know how unrealistic such a perspective can be.  Typical computing systems depend upon tens or hundreds of millions of lines of complex code, spread over multiple modules, perhaps including firmware programs downloaded into the hardware.  All of this might then be deployed onto the cloud, and hence dependent on the Internet, and on the cloud data-center's thousands of computers. The normal behavior of such programs resides somewhere in the cross-product of the individual states of the component elements: an impossibly large space of possible configurations. 

While we can certainly debug systems through thorough testing, software is never even close to flawless: by some estimates, a bug might lurk in every few tens or hundreds of lines of logic.  The compiler and the operating system and various helper programs are probably buggy too.  Testing helps us gain confidence that these latent bugs are rarely exercised under production conditions, not that the code is genuinely correct.

Beyond what is testable we enter the realm of Heisen-behaviors: irreproducible oddities that can be impossible to explain: perhaps caused by unexpected scheduling effects, or by cosmic rays that flip bits, or by tiny hardware glitches.  The hardware itself is often better modelled as being probabilistic than deterministic: 1 + 1 certainly should equal 2, but perhaps sometimes the answer comes out as 0, or as 3.  Obviously, frequent problems of that sort would make a machine sufficiently unreliable that we would probably repair or replace it.  But an event happening perhaps once a week?  Such problems often pass unnoticed.

Thus, ultimate correctness is elusive; we know this, and we've accommodated to the limitations of computing systems by doing our best to fully specify systems, holding semi-adversarial code reviews aimed at finding design bugs, employing clean-room development practices, then red-team testing, then acceptance testing, then integration testing.  The process works surprisingly well, although patches and other upgrades commonly introduce new problems, and adapting a stable existing system to operate on new hardware or in a new setting can reveal surprising unnoticed issues that were concealed or even compensated for by the prior pattern of use, or by mechanisms that evolved specifically for that purpose.

The puzzle with deep learning and other forms of unsupervised or semi-supervised training is that we create systems that lack a true specification.  Instead, they have a self-defined behavioral goal: reinforcement learning trains a system to respond to situations resembling ones it has seen before by repeating whatever response dominated in the training data.  In effect: "when the traffic light turns yellow, drive very fast."

Thus we have a kind of autonomously-learned specification, and because the specification is extracted automatically by training against a data set, the learned model is inherently shaped by the content of the data set.

Train such a system on a language sample in which plurals always end in "s", and it won't realize that "cattle" and "calamari" are plural.  Train it on images in which all the terrorists have dark hair and complexions, and the system will learn that anyone with dark hair or skin is a potential threat.  Teach it to drive in California, where every intersection either has stop signs on one or both streets, or has a traffic signal, and it won't understand how to drive in Europe, where many regions use a "priority to the right" model, whereby incoming traffic (even from a small street) has priority over any traffic from the left (even if from a major road). 

Machine learning systems trained in this way conflate correlation with causation.  In contrast, human learning teases out causal explanations from examples.  The resulting knowledge is different from a knowledge model learned by training today's machine learning technologies, no matter how impressive the machine learning system's ability to do pattern matching.

Human knowledge also understands time, and understands that behavior must evolve over time.  Stephen Gould often wrote about being diagnosed as a young adult with a fatal circulatory cancer.  Medical statistics of the period gave him a life expectancy of no more than a few months, perhaps a year at best.  But as it happened, a new medication proved to be a true magic bullet: he was cured.   The large-population statistics were based on prior treatments and hence not predictive of the outcomes for those who received this new treatment.  The story resonated in Gould's case because in his academic life, he studied "punctuated equilibria", which are situations in which a population that has been relatively static suddenly evolves in dramatic ways: often, because of some significant change in the environment.  Which is precisely he point.

Those who fail to learn from the past are doomed to repeat it.  But those who fail to appreciate that the past may not predict the future are also doomed.  Genuine wisdom comes not from raw knowledge, but also from the ability to reason about novel situations in robust ways.

Machine learning  systems tend to learn a single set of models at a time.  They squeeze everything into a limited collection of models, which blurs information if the system lacks a needed category: "drives on the left", or "uses social networking apps".  Humans create models, revise models, and are constantly on the lookout for exceptions.  "Is  that really a  pile of  leaves, or has the cheetah realized it can hide in a pile of leaves?  It never did that before.  Clever cheetah!"   Such insights once were of life-or-death importance.

Today, a new element enters the mix: systematic error in which a system is programmed to learn a pattern, but overgeneralizes and consequently behaves incorrectly every time a situation arises that exercises the erroneous generalization.  Systematic error is counterintuitive, and perhaps this explains our seeming inability to recognize the risk: viewing artificially intelligent systems as mirrors of ourselves, we are blind to the idea that actually, they can exhibit bizarre and very non-random misbehavior.  Indeed, it is in the nature of this form of machine learning to misbehave in unintuitive  ways!

My concern is this: while we've learned to create robust solutions from somewhat unreliable components, little of what we know about reliability extends to this new world of machine-learning components that can embody systematic error, model inadequacies, or an inability to adapt and learn as conditions evolve.  This exposes us to a wide range of new failure modalities never before seen, and that could challenge the industry and the computer science community to overcome.  We lack systematic ways to recognize and respond to these new kinds of systematic flaws.

Systematic error also creates new and worrying attack surfaces that hackers and others might exploit.  Knowing how a machine learning system is trained, a terrorist might circulate some photoshopped images of him or herself with very pale makeup and light brown or blond hair, to bind other biometrics that are harder to change (like fingerprints, corneal patterns) with interpretations suggesting "not a threat".  Knowing how a self-driving car makes decisions, a hacker might trick it into driving into a pylon.

Welcome to the new world of complex systems with inadequate artificial intelligences.  The public places far too much confidence in these systems, and our research community has been far too complacent.  We need to open our eyes to the risks, and to teach the public about them, too.