Saturday, 15 July 2017

How far could a secure Internet get us?

There is a standard refrain among those who teach and do research on security: it holds that the fundamental flaw in security originates with an early decision by the community that created the Internet.  In that early design, Internet routing occurs by a form of peering that can often occur anonymously.  Endpoint systems are not authenticated, and packets generally aren't encrypted or authenticated either.  As a consequence, packets are routed permissively even if they contain what are almost certainly false origin addresses, or have content that seems dubious in obvious ways (like a TCP connection packet that makes no sense on this particular link).

This story then holds that the original sin of the Internet design community has given us layer upon layer of infrastructure that cannot be secured because it resides on a flawed and easily compromised foundation.

As it turns out, there is little doubt that we actually can build an Internet that corrects that deep flaw, and that can be relatively compatible with the existing Internet.  Enterprises like military systems and security-conscious companies do it all the time.   Let me sketch out how this is done (it isn't rocket science, and you won't find many surprises).  But then let's ask if it would really be a game-changer for the security of higher level applications.

To appreciate the core challenge, it helps to start by understanding that everything you think you know about the network is really just an abstraction layered over something more basic.  For example, even my advanced students tend to think of Internet addresses as identifiers, as if my computer's true name were 128.84.16.001 and your computer's true name was 167.88.194.023.  In fact, these addresses are better understood as routing data: useful for deciding what the next hop should be as a packet progresses hop-by-hop through the network, but lacking any deep connection to identity. 

In fact, a single computer might have many network addresses (one for each of its network interfaces).  It might host multiple virtual machines, and those could each have virtual network addresses of their own: with virtualization, any computer can function as a kind of Internet and host large numbers of computers that run within it, complete with virtual networks, virtual name spaces, services of various kinds, you name it.  A single computer might move about, so that over short periods of times it takes on a sequence of addresses: the old addresses can cease to work, and traffic needs to be redirected to the new ones.  Addresses can be mapped by network address translators.

Once we free ourselves from this false equivalence of network address to identity, you need to ask what the real root of identity should be, in a modern system.  Here, the proper focus is on hardware security tokens combined with human users who authenticate themselves in the usual ways.  The hardware could be a built-in component of the computer, or it could be some form of plug-in.  But the key point is that when we associate authentication with these unforgeable hardware elements, used in tandem, we arrive at a much stronger notion of endpoint identity.  The deep roots of that form of identity reside in key pairs: the identity defines some form of private key, with which information can be authenticated by public components that only have access to the corresponding public key.

This then is our bootstrap opportunity: we live in a vast world of devices: computers, routers, IoT components like video cameras and smart televisions and smart cars, and each element can be understood as either being anonymous (if it lacks the ability to authenticate itself), or capable of proving that it has "authorized access" to some private key.  With that proof, we can consult a vendor-published registry and from that registry, can learn about this endpoint device.  A device can also be registered in secondary registries: when I bring a new router into my smart home, I could register my router as "Ken's router, at his summer cottage on Cayuga Lake".  And now there would be a trustworthy chain of reasoning that would let you convince yourself that certain messages were indeed sent by, or countersigned by, my router.

Sounds familiar?  It should, if you've ever brought your laptop from home to work.  Some companies won't allow you to connect the laptop at all (fearing viruses that your machine might carry), but those that do usually require precisely this sort of authentication and registration.

Given strong authentication, a second opportunity arises whenever we focus on an island of infrastructure that has a coherent owner.  Here on Cayuga Lake, my network provider is part of a broader system owned and controlled by a company that controls a fairly large regional network.  This ISP is paid for its services, and at least in theory, has complete control of every device that can connect directly to it, every router and switch it operates internally, and every network link used to glue them all together.  One can understand the ISP as a kind of military hierarchy: at the core we find a "master and commander" who has absolute control and is the sole source of permissions.  Below the master and commander are a collection of subordinates, to whom restrictive roles have been granted, and way down at the bottom are the common seaman who might permit packets originating in my router to sail the seas of this regional network -- or could block them.

As it happens, today's ISPs are somewhat relaxed about authentication, so one opportunity to secure the network would start by simply enforcing authentication when devices are connected to an ISP.  Today when I buy a new iPhone, I can simply start to use it in my home.  If I was trying to build a much more secure network, at a minimum my router might pop up a screen requiring me to fill in details: who owns this iPhone (and it might want proof: type in your password, your RSA code number, and hold your thumb on the fingerprint reader...)  Perhaps, that regional ISP would do the same and require a second level of device registration.  Military and other nationally critical infrastructure networks work precisely in this way: if you were to take a laptop into a major power plant and connect it to the Ethernet jack in the waiting room while waiting for your appointment with human resources, either it won't connect, or it will give you some form of very limited guest connectivity. 

Think back to the last time you took your laptop to work.  I bet that something along the lines just described happened then, too.  Any organization that takes security at all seriously is careful to track the devices connected to it, and to limit their networking "power".  A military system won't allow you to connect your own machine, at all.  But if you work at Cornell, like I do, you might be able to get permission -- except that your machine will end up registered for use in a specific context, such as from my office in the Computer Science building.  If I were to carry it across the street and connect it to a wall jack in the ECE department, I would be back to square zero.

With enclaves that use authentication, one can take further steps.  Without too much added cost, packets can be cryptographically signed or fully encrypted at the endpoint, yielding various forms of virtual private networks: subnetworks within which communication is safe, but with strong protection against traffic from the outside leaking in, or against intruders managing to read or tamper with data.  Such systems can also filter or block traffic that might try to escape the secure perimeter.

I worked at Microsoft for a few months in 2016, and they adopted this approach.  I could only connect to their system via a secured VPN, and the security perimeter it enforced when I worked from home was very carefully controlled and monitored.  I could continue projects from work while at home, but I could never have wandered the virtual Microsoft network with impunity from outside the office.  In my office, I had somewhat more relaxed access to internal software tools and projects.

This, then, gives some sense of what a secure Internet would look like.  But how secure can such a system really be?

As we push from the lowest layers of hardware and software up to higher levels of abstraction, the numbers of elements of a modern system increase exponentially.  There are dozens of operating systems and each has hundreds of variants and patch levels.  So the very first layer already has a level of diversity measureable in the thousands.  Then there are hundreds of programming tools and languages and services that can run on them, configurable in hundreds of ways, to say nothing of all the management options.  Layer by layer, we bring in a surreal degree of diversity simply by virtue of the choices made by each system designer and vendor.

In settings like military systems, or power grids, a major effort is invested to keep control over the forms of diversity that are actually present in the deployed system.   Individual users aren't permitted to install their own applications, patches are applied in a coordinated way, and monitoring tools are used to notice unexpected behavior that could point to an intrusion.  In contrast, networks used in other settings need to deal with enormous levels of diversity and individual customization.  Like it or not, the network provider simply won't know what the network is being used to do.

It seems to me that this diversity accounts for the insecurity of modern systems, to a far greater degree than the "original sin" of endpoint anonymity and unauthenticated peering.  While the insecurity of the network certainly makes it easier for attackers to mount denial of service attacks or to route spam emails with falsified origin data, those are just the a small aspect of the larger problem.  And that larger problem centers on the exceptionally large attack surface that modern systems offer: hundreds of millions if not billions of lines of code, riddled with bugs, and configured in every imaginable way (just yesterday I was horrified to discover that my home router has an administrative login and password, both set to the defaults.  Now I'm wondering about my television, and my internet box...). 

Fixing the original sin of the Internet genuinely would help in some ways, especially as we move to deploy an Internet of Things with more and more devices playing a diversity of roles in homes, cars, highways and other settings.  We should take that step.  But it is an illusion to imagine that true software security can be achieved by hardening the network itself, because the extreme diversity of uses would overwhelm any systematic attempt to impose security standards and ensure that they are respected.  Attempting to do so would probably create some very frustrated users, and yet would at best raise the bar slightly for one facet of a complex, multi-faceted problem.

More interesting, I think, is to think about diversity as a constructive tool for protecting large, complex systems.  Fred Schneider and I once coauthored an editorial on this topic, and a short paper, and I continue to believe that this was one of our better efforts.  Synthetic diversity, in particular, is a remarkable tool for combatting the wily attacker, who often has a surprisingly limited arsenal of off the shelf exploits and might be blocked by even small "surprises".

The basic idea is simple: just as a compiler takes source code and then can create a variety of executables (depending on the target instruction set, operating system, optimization level selected, etc), we can also "compile" programs to expose various forms of artificial diversity.  We can force the memory layout to look somewhat random (simply by adding random padding to objects allocated dynamically).  We can vary the stack layout and the order in which inputs are delivered if a program receives data from multiple sources.  We can potentially compile one program several ways, and pick the version that will be running today at random within the resulting set.  We can randomize the numbering used for operating systems calls.

Such steps diversity and obfuscate the attackable surface.  The attacker who was using an exploit that overruns an input buffer in a way that triggers a system call to download and then execute a bot will now run into an unpredictable memory location, a buffer that might not overflow, and in any case the system call to read from the network socket might have been remapped to some normally-unassigned code.  These are steps that really can pay off.

VPN security offers opportunities too.  Obviously, surfing the web requires that there be no barriers.  But for an application like a medical system that manages patient records or interacts with a lab, there is no real reason to also be able to surface the web, and there is no reason that random web programs should be able to communicate to that application, either.  VPNs can offer this sort of protection, and if we could deploy them more easily, they could wrap sensitive code in layers of armor.

So count me on the side of those who believe that Internet security can and should be a priority, particularly in critical infrastructure and Internet of Things scenarios.   It really is about time to make Internet authentication a readily available and widely standard function, and to deploy VPN technologies much more actively.  But doing so won't solve the whole problem.  We should also make more of an effort to synthetically diversify the systems we deploy widely, so that the attacker will encounter a bewildering variety of "versions" of any particular software.   

If you can't build a single impregnable castle, the next best thing is to populate a huge city with every imaginable variation on the castle theme.  Put police checkpoints on the roads leading to the city water pumping and power generating systems.  Post signs in a few languages, including some you made up just for this purpose.   Good luck to that hacker: he might break into one system, but doing so won't get him very far before we root him out...

Sunday, 11 June 2017

Moving AI to the edge

In today's data centers, a dominant paradigm shapes machine learning and AI systems: massive amounts of data are collected, cleaned, and stored into some form of database or collection of files.  Then machine learning tools are used to train a model on the observed data, and finally the resulting model is used for a while in the edge application.  All of this takes time, so the edge applications operate using stale data, at least to some degree.

In online AI/Ml systems, like smart highways controlling smart cars, smart homes, or the smart power grid, a pervasive need for instant responsiveness based on the most current data is a distinguishing characteristic: today's standard cloud systems can definitely react to new events extremely rapidly (100ms or less is the usual goal), but because the edge runs on cached data -- in this situation, cached models -- and these platforms can't update their models promptly, they will continue to use a stale model long after something fundamental has changed, invalidating it. 

So why would we care about stale models?  The term model, as used by the ML community, refers to any concise representation of knowledge.  For example, on a highway, knowledge of a particular truck's behavior (its apparent route, history of speeds and lane changes, perhaps any observations of risks such as a tire that could be shredding, or a piece of cargo that might not be properly tied down) are all part of the model.  In a continuous learning setting, the model shapes behavior for all the smart cars in the vicinity.  In a smart power grid, the model is our estimate of the state of the grid; if the grid starts to show an oscillatory imbalance or signs of a shortage of power, the model can evolve in milliseconds, and the grid control algorithms need to adjust accordingly.  Yesterday's model, or even the one from ten seconds ago, might not be acceptable.

What will it take to move AI to the edge?
  • The AI/ML community will need to figure out what aspects of their problems really need to run on the edge, and are incorrectly situated on the standard backend today.  Coding at the edge won't be the same as coding for the backend, so work will be required.  This suggests that many applications will need to be split into online and offline aspects, with the online parts kept as slim as possible.  The offline components will be easier to implement because they run in a more standard way.
  • We need programming tools and platforms for edge-oriented AI/ML tools.  I'm betting that Derecho can be the basis of such a solution, but I also think it will take a while to reach that point.  We'll need to understand what edge-hosted AI/ML tools will actually look like: how will they represent, store, and access machine-learned models?  How big are these models, and what data rates arise?  Where are the low-latency paths, and how low does  latency need to be?
  • We may need to integrate hardware accelerators into the infrastructure: if a system does vision, it probably wants to use GPU accelerators for image segmentation, tagging and for operations such as rotation, alignment, debluring, 3-D scene reconstruction, etc.  FPGA components offer a second rapid model, more focused on searching HTML text or other "byte stream" objects.  FPGA accelerators are also useful for doing quick cryptographic operations.  There could easily be other kinds of ASICs too: DFFT units, quantum thinkers, you name it. 
  • All of these pieces need to synchronize properly and be easy to program in a "correct" way...
I'm fascinated by this whole area.  If it interests you too,  send me a note:  perhaps we can find ways to team up!

Monday, 29 May 2017

Byzantine clients

Although I personally don't work on Byzantine block chains, we happen to be in the midst of a frenzy of research and hiring and startups centered on this model.  As you probably know, a block chain is just a sequence (totally ordered) of records shared among some (perhaps very large) population of participating institutions.  There is a rule for adding the next block to the end of the chain, and the chain itself is fully replicated.  Cryptographic protection is used to avoid risk of a record being corrupted or modified after it is first recorded.  A Byzantine block chain uses Byzantine agreement to put the blocks into order, and to force the servers to vote on the contents of every block.  This is believed to yield ultra robust services, and in this particular use-case, ultra robust block chains.

Financial institutions are very excited about this model, and many have already started to use a digital contracts language called hyper-ledger that seems to express a wide range of forms of contracts (including ones that might have elements that can only be filled in later, or that reference information defined in prior blocks), and one can definitely use block chains to support forms of cyber currency, like BitCoin (but there are many others by now).  In fact block chains can even represent actual transfers of money: this is like what the SWIFT international banking standard does, using text messages and paper ledgers to represent the transaction chain.

Modern block chain protocols that employ a Byzantine model work like this: we have a set of N participants, and within that set, we assume that at most T might be arbitrarily faulty: they could collude, might violate the protocol, certainly could lie or cheat in other ways.  Their nefarious objective might be to bring the system to a halt, to confuse non-faulty participants about the contents of the blockchain or the order in which the blocks were appended, to roll-back a block (as a way to undo a transaction, which might let them double-spend a digital coin or renege on a contract obligation), etc.  But T is always assumed to be less than N/3.  

Of course, this leads to a rich class of protocols, very well-studied, although (ironically), more often in a fully synchronous network than in an asynchronous one.  But the so-called PRACTI protocols created by Miguel Castro with Barbara Liskov work in real-world networks, and most block chain systems use some variation of them.  They basically build on consensus (the same problem solved by Lamport's Paxos protocol).  The non-compromised participants simply outvote any Byzantine ones.

What strikes me as being of interest here is that most studies have shown that in the real world, Byzantine faults almost never arise!   And when they do, they almost never involve faulty servers. Conversely when attackers do gain control, they usually compromise all the instances of any given thing that they were able to attack successfully.  So T=0, or T=N.

There have been a great many practical studies on this question.  I would trace this back to Jim Gray, who once wrote a lovely paper on "Why Do Computers Stop and What Can Be Done About It?".  Jim was working at Tandem Computers at that time, on systems designed to tolerate hardware and software problems.  Yet they crashed even so.  His approach was to collect a lot of data and then sift through it.

Basically, he found that human errors, software bugs and design errors were a much bigger problem then hardware failures.  Jim never saw any signs of Byzantine faults (well, any fault is Byzantine.  But I mean malicious behaviors, crafted to compromise a system).

More recent studies confirm this.  At Yahoo, for example,  Ben Reed examined data from a great many Zookeeper failures, and reported his findings ina WIPS talk at SOSP in 2007.  None were Byzantine (in the malicious sense).

At QNX Chris Hobbs has customers who worry that very small chips might experience higher rates of oddities that would best be modeled as Byzantine.  To find out, he irradiated some chips in a nuclear reactor, and looked at the resulting crashes to see what all the bit flips did to the code running on them (and I know of NASA studies of this kind, too).  In  fact, things do fail.  But mostly, by crashing.  The main issue turns out to be undetected data corruption, because  most of the surface area on the chips in our computers is used for SRAM, caching, and DRAM  storage.  Data protected by checksums and the like remains safe, but these other forms are somewhat prone to undetected bit flips.  But most data isn't in active use, in any case: most computer memory just has random stuff in it, or zeros, and the longest lived active form of memory is to hold the code of the OS and the application programs.  Bit-flips will eventually corrupt instructions that do matter, but corrupted instructions mostly trigger faults.  So, it turns out that the primary effect of radiation is to raise the rate of sudden crashes.

Back when Jim did his first study, Bruce Nelson built on it by suggesting that the software bugs he was seeing fall into two cases: Bohrbugs (deterministic and easily reproduced, like Bohr's model of the atom: an easy target to fix) and Heisenbugs (wherever you look, the bug skitters off to somewhere else).  Bruce showed that in a given version of a program, Bohrbugs are quickly eliminated, but patches and upgrades often introduce new ones, creating an endless cycle.  Meanwhile, the long-lived bugs fell into the Heisenbug category, often originating from data structure damage "early" in a run that didn't cause a crash until much later, or from concurrency issues sensitive to thread schedule ordering.  I guess that the QNX study just adds new elements to that stubborn class of Heisenbugs.

So, we don't have Byzantine faults in servers, but we do have a definite issue with crashes caused by bugs, concurrency mistakes, and environmental conditions that trigger hardware malfunctions. There isn't anything wrong with using Byzantine agreement in the server set, if you like.   But it probably won't actually make the service more robust or more secure.

Bugs can be fought in many ways.  My own preferred approach is simpler than running Byzantine agreement.  With Derecho or other similar systems, you just run N state machine replicas doing whatever the original program happened to do, but start them at different times and use state transfer to initialize them from the running system (the basis of virtual synchrony with dynamic group membership).

Could a poison pill kill them all at once?  Theoretically, of course (this is also a risk for a Byzantine version).  In practice, no.  By now we have thirty years of experience showing that in process group systems, replicas won't exhibit simultaneous crashes, leaving ample time to restart any that do crash, which will manifest as occasional one-time events.

Nancy Leveson invented a methodology called N-Version programming; her hypothesis was that if most failures were simply due to software bugs, then that by creating multiple versions of the most important services, you could mask the bugs because coding errors would probably not be shared among the set.  This works, too, although she was surprised at how often all N versions were buggy in the same way.  Apparently, coders often make the same mistakes, especially when they are given specifications that are confusing in some specific respect.

Fred Schneider and others looked at synthetic ways of "diversifying" programs, so that a single piece of code could be used to create the versions: this method automatically generates a bunch of different versions from one source file, with the same input/output behavior.  You get N versions too, often find that concurrency problems won't manifest in the identical way across the replicas, and they also are less prone to security compromises!

Dawson Engler pioneered automated tools for finding bugs and even for inferring specifications.  His debugging tools are amazing, and with them, bug-free code is an actual possibility.

Servers just shouldn't fail in weird, arbitrary ways anymore.  There is lots of stuff to worry about, but Byzantine compromise of a set of servers shouldn't be high on the radar.

Moreover, with strong cloud firewalls, Intel SGX, and layers and layers of monitoring, I would bet that the percentage of compromises that manage to take control of between and N/2-1 replicas (but not more), or. even of attacks that look Byzantine is even lower.

But what we do see  (including Reed's 2007 study), are correct services that come under attack from some form of malicious client.  For an attacker, it is far easier to hijack a client application than to penetrate a server, so clients that try to use their legitimate  connectivity to the server for evil ends are a genuine threat.  In fact with open source, many attackers just  get the client source code, hack it, then run the compromised code.  Clients that simply send bad data without meaning to are a problem too.

The client is usually the evil-doer.  Yet the Byzantine model blames the server, and ignore the clients.

It seems to me that much more could be done to characterize the class of client attacks that a robust service could potentially repel.  Options include monitoring client behavior and blacklisting any clients that are clearly malfunctioning, capturing data redundantly and somehow comparing values, so that a deliberately incorrect input can be flagged as suspicious or even suppressed entirely (obviously, this can only be done rarely, and for extreme situations), or filtering client input to protect against really odd data.  Gun Sirer and Fred Schneider once showed that a client could even include a cryptographic  proof that input strings really came from what the client typed, without tampering.

Manuel Costa and Miguel Castro came up with a great system they called Vigilante, a few years ago.  If a server was compromised by bad client input, it detected the attack,  isolated the cause and spread the word instantly.  Other servers could then adjust their firewalls to protect themselves, dynamically.  This is the sort of thing we need.

So here's the real puzzle.  If your bank plans to move my accounts to a block chain, I'm ok with that, but don't assume that BFT on the block chain secures the solutions.  You need to also come up without a way to protect the server against buggy clients, compromised clients, and clients that just upload bad data.  The "bad dudes" are out there, not inside the data center.  Develop a plan to to keep them there!

Friday, 26 May 2017

More thoughts on the learnability of implicit protocols

While driving from Sebastopol (in Sonoma valley, north of San Francisco) to Silicon Valley (which is south of San Francisco), one has a lot of time to think about how human-operated cars coordinate and implicitly communicate.  So I thought I might follow up on the posting I did a few weeks ago concerning the learnability of implicit protocols and see if I can't make the question a bit more concrete.

The best way to simplify a problem is to think of a base case, so I started by asking myself about the simplest model I could: vehicles modeled as points and a highway modeled as a straight line.

Interactions between vehicles would basically define a graph: any given car gets a chance to interact with the car in front of it, and the one behind it, and the protocol (whatever it might be) is like a state machine on these graphs.

So what behaviors would constitute a protocol in such a situation?  Well, we presumably would have car A in state Sa, and car B in state Sb, and then some "communication" takes place.  Not too many options: they either draw closer to one-another at some speed, move apart from one-another at some speed (I mean speed of closing or speed of separating, but probably the joint speed and direction of movement needs to be specified too), or perhaps one or the other brakes sharply or accelerates sharply.  So here we have a "vocabulary" for the cars, with which they communicate.

A protocol, then, would be a rule by which A reacts to inputs of this form from B, and vice versa, since any action by B on A is also perceived by B itself (that is, even when B probes A, B itself may be forced to take some reactive action too: the human version is that you are trying to shift from some horrendously bad radio station to a different, slightly less horrible, alternative when you look up and notice that you've gotten much closer to the car in front you than you expected.  So in this case it was your own darn fault, but just the same, you react!)

So to learn a protocol, we might imagine a series of "tests" by which vehicle B, approaching A from behind, experiments to investigate A's reaction to various inputs, recording these into a model that over time would converge towards B's approximation of A's algorithm.  A similarly is learning B's behavior (and A might elicit behaviors from B.  For example, sometimes a car comes up on my tail very close, and especially if we are already basically moving at the fastest possible speed, it annoys me enough that I gradually slow down, rather than speeding up, as he or she probably intended for me to do -- there is an optimal distance at which to follow me, if you are trying to do so.  Similarly for most drivers.  Drive 3x further back and I might ignore you completely.  Drive at half that distance and my behavior departs from the norm: I slow down, which is suboptimal for both of us.  My way of punishing your pushy behavior!).

If you could figure out how two cars can learn a protocol, then you could ask if this generalizes to situations that form trains of cars.  So here, B catches up with a train: not just A, but perhaps X-Y-Z-A, with A in the rear and some form of agreed-upon protocol in use by X,Y,Z and A.  B gets to join the train if it can learn the protocol and participate properly.  Failing to do so leaves B abandoned, or results in an accident, etc.

Optimal behavior, of course, maximizes speed and minimizes the risk of accidents.  I'm starting to see how one could add a kind of utility function here: B wants to learn A's protocol quickly and "efficiently", with as few tests as possible.  Then B wants to use the knowledge gained to achieve this highest possible speed "jointly" with A.  That defines a paired protocol, and the generalization becomes a train protocol.

Another observation: Suppose that B has a small set of models in a kind of grab-bag of models that cars commonly use.  Now B's job of learning the model is potentially easier: if A is using one of the standard models, it may only take a very small number of tests to figure that out.  Moreover, having a good guess might make some models "learnable" that would actually not be learnable with a small number of tests otherwise.  This would fit well with the experience I've had in New Jersey, Belgium, Tel Aviv and California: each country has its own norms for how closely cars tend to space themselves on high speed throughways and in other situations, so while you can feel disoriented at first, in fact this only adds up to four models.  If someone handed you those four models, I bet you could figure out which one applies with very few experiments.

Unless, of course, you accidentally end up behind a New Jersey driver who is learning the rules of the road in Tel Aviv for the first time, of course.  I think I've encountered a few such cases too.

So that would be a further interesting case to consider: roads with mixtures of models: subsets of cars, with each subset using a distinct model.  In fact my story of New York City cab drivers, from last time, would fall into that category.  The cabs figure out how to form school-of-fish packs that flow around obstacles and other cars, no matter what those other cars might be doing.  Meanwhile, there could easily be other protocols: New York public transit buses, for example, have their own rules (they basically are bigger than you, and can do whatever they like, and that is precisely how they behave).

Conversely, the New Jersey driver also illustrates a complication: the implicit protocols of interest to me are cooperative behaviors that maximize utility for those participating in the protocol, and yield higher utility than could be gained from a selfish behavior.  But any vehicle also will have an autonomous set of behaviors: a protocol too, but one that it engages in purely for its own mysterious goals.  And these selfish behaviors might not be at all optimal: I recently watched a teenage driver in the car behind me putting on makeup while talking on a cell phone and apparently texting as well.  So this particular driver wasn't maximizing any normal concept of utility!  And all of us have experienced drivers in the grips of road-rage, driving hyper aggressively, or teenage guys hot-dogging on motorcycles or weaving through traffic in their cars.  Their sense of utility is clearly very warped towards speed and has a very diminished negative utility for things like accidents, traffic violations and tickets, or sudden death.  So even as we learn to cooperate with vehicles prepared to cooperate, and need to figure out the protocols they are using, the environment is subject to heavy noise from these kinds of loners, inattentive drivers, reckless drivers, and the list goes on.

One thing that strikes me is that the cab driver protocol in New York, once you know it, is a good example of a very easily discovered policy.  Despite a huge level of non-compliant vehicles, anyone who knows the cab protocol and drives using it will quickly fall into synchrony with a pack of others doing so.  So the level of positive feedback must be massive, if we had the proper metric, relative to the level of noise. 

Interestingly, there is clearly a pack-like behavior even for a line of cars driving on a single lane.  This would be a very good case to start by understanding.

Of course the real-world-school-of-fish behaviors involve (1) shifting lanes, which is the easy case and (2) ignoring lane boundaries, which is a harder case.  Given that even the cars on a string case taxes my imaginative capabilities, I think I'll leave the full 2-dimensional generalization to the reader!

Saturday, 13 May 2017

Smart Memory: How tight will the timing loop need to be?

Smart memory systems that integrate scalable data capture, computation and storage with machine-learning technologies represent an exciting opportunity area for research and product development.  But where should a smart memory live in today's data center computing "stack"?

Cornell's work on Derecho has been focused on two design points.  One centers on Derecho itself, which is offered as a C++ software library.  Derecho is blazingly fast both in terms of data capture and replication/storage speeds and latency, but can only be used by C++ programmers: unlike other C++ libraries that can easily be loaded by languages like Java or Python, modern C++ has many features that don't currently have direct mappings into what those languages can support (variadic template expansion and constant expression evaluation and conditional compile-time logic), hence it will be a while before this kind of very modern C++ library is accessible from other languages. 

The other main design point has simply been a file system API: POSIX plus snapshots.  Derecho incorporates an idea from our Freeze Frame File System in this respect: the file system snapshots are generated lazily (so there is no "overhead" for having a snapshot and no need to preplan them or anything of that kind), and our snapshots are temporally precise and causally consistent. 

Today, Derecho v1 (the version on GitHub) lacks this integration, but fairly soon we will upload a Derecho v2 that saves data into a novel file system that applications can access via these normal file system APIs.  One can think of them as read-only snapshots frozen in time, with very high accuracy.  Of course the output of these programs would be files too, but they wouldn't be written to the original snapshot: they get written to some other read/write file system.

This leads to a model in which one builds a specialized data capture service to receive incoming sensor data (which could include video streams or other large objects), compute upon the data (compress, deduplicate, segment/tag, discard uninteresting data, etc).  The high-value content needs to be replicated and indexed.  Concurrently, machine learning algorithms would run on snapshots of the file representation, enabling the user to leverage today's powerful "off the shelf" solutions, running them on a series of temporal snapshots at fine-grained resolution: our snapshots can be as precise as you like, so there is no problem accessing data even at sub-millisecond resolution (of course in the limit, the quality of your GPS time sources does impose limitations).

The puzzle is whether this will work well enough to cover the large class of real-time applications that will need to respond continuously as events occur.  In the anticipated Derecho smart-warehouse the performance-critical step is roughly this: data is received, processed, stored and then the machine learning applications can process it, at which point they can initiate action.  The key lag is the delay between receipt of data and stability of the file system snapshot: at least a few milliseconds, although this will depend on the size of the data and the computed artifacts.  One can imagine that this path could be quite fast, but even so, the round-trip cycle time before an action can occur in the real-world can obviously easily grow to 10's of ms or more.  We plan to do experiments to see how good a round-trip responsiveness we can guarantee.  

The alternative would move the machine learning into the C++ service, but this requires coding new ML applications directly in C++.  I'm assuming that doing so would be a large undertaking, just considering how much existing technology is out there, and might have to be duplicated.  The benefit of doing so, though, would be that we could then imagine response times measured in milliseconds, not tens or hundreds of milliseconds.

I would be curious to hear about known real-time data warehousing use cases, and the round-trip response times they require. 

Monday, 24 April 2017

Will smart memory need an ACID model?

This is really part II of my posting on smart memory from ten days ago.

Let's imagine that we've created the ultimate data warehouse, using Derecho.  This warehouse hosts terabytes of persistent memory, It can absorb updates at a staggering rate: hundreds of gigabits per second, has tens of thousands of processors that crunch the data down and then store it as collections of higher level "knowledge models", and is always hungry for more.  

We used to think of data warehouses as respositories for nuggets of data, so perhaps we could call these "muggets:" models, that can now be queried.  Hopefully this term isn't painfully cute.

Next, we imagine some collection of machine learning applications that consume these muggets and learn from them, or compute some sort of response to them.  For example if the muggets represent knowledge collected on the local highway, the queries might be posed by smart cars trying to optimize their driving plans.  "Is it ok with everyone if I shift to the passing lane for a moment?"  "Does anyone know if there are obstacles on the other side of this big truck?"  "What's the driving history for this motorcycle approaching me: is this guy some sort of a daredevil who might swoop in front of me with inches to spare?"  "Is that refrigerator coming loose, the one strapped to the pickup truck way up ahead?"

If you drive enough, you realize that answers to such questions would be very important to a smart car!  Honestly, I could use help with such things now and then too, and my car is pretty dumb.

We clearly want to answer the queries with strong consistency (for a smart car, a quick but incorrect answer might not be helpful!), but also very rapidly, even if this entails using slightly stale data.  In Dercho, we have a new way to do this that adapts the FFFS snapshot approach described in our SOCC paper to run in what we call version vectors, which is how Derecho stores volatile and persistent data. Details will be forthcoming shortly, I promise.

Here's my question: Derecho's kind of data warehouse currently can't support the full ACID database style of computation, because at present, Derecho only has read-only consistent queries against its temporally precise, causally consistent snapshots.  So we have strong consistency for updates, which are totally ordered atomic actions against sets of version vectors, and strong consistency for read-only queries, but not read/write queries, where an application might read the current state of the data warehouse, compute on it, and then update it.  Is this a bad thing?

I've been turning the question over as I bike around on the bumpy, potholed roads here in Sebastopol, where we are finishing up my sabbatical (visiting my daughter, but I've also dropped in down in the valley and at Berkeley now and then).  Honestly, cyclists could use a little smart warehouse help too, around here!  I'm getting really paranoid about fast downhills with oak trees: the shadows invariably conceal massive threats!  But I digress...

The argument that Derecho's time-lagged model suffices is roughly as follows: ACID databases are  hard to scale, as Jim Gray observed in his lovely paper on "Dangers of Database Scalability".  Basically, the standard model slows down as n^5 (n is the number of nodes running the system).  This observation gave us CAP and BASE and ultimately, today's wonderfully scalable key-value stores and noSQL databases.  But those have very weak consistency guarantees.

Our Derecho warehouse, sketched above (fleshed out in the TOCS paper we are just about to submit) gets a little further. Derecho can work quite well for that smart highway or similar purposes, especially if we keep the latency low enough.  Sure, queries will only be able to access the state as of perhaps 100ms in the past, because the incoming database pipeline is busy computing on the more current state.  But this isn't so terrible.

So the question we are left with is this: for machine learning in IoT settings, or similar online systems, are there compelling use cases that actually need the full ACID model?  Or can Maine learning systems always manage with a big "nearly synchronous" warehouse, running with strong consistency but 100ms or so lagged relative to the current state of the world?  Is there some important class of control systems or applications that can probably be ruled out by that limitation?  I really want a proof, if so: show me that omitting the fully ACID behavior will be a huge mistake, and convince me with mathematics, not handwaving... (facts, not alt-facts).

I'm leaning towards the "Derecho would suffice" answer.  But very curious to hear other thoughts...

Saturday, 15 April 2017

Will smart memory be the next big thing?

We're fast approaching a new era of online machine learning and online machine-learned behavior: platforms that will capture real-time data streams at high data rates, process the information instantly, and then act on the output of the processing stage.  A good example would be a smart highway that tells cars what to expect up around the next curve, or a real-time television feed integrated with social networking tools (like a world-cup soccer broadcast that lets the viewers control camera angles while their friends join from a remote location).

If you ask how such systems will need to be structured, part of the story is familiar: as cloud-hosted services capturing data from sensors of various kinds (I'm including video cameras here), crunching on it, then initiating actions. 

A great example of such a service is Microsoft's Cosmos data farm.  This isn't a research platform but there have been talks on it at various forums.  The company organized a large number of storage and compute nodes (hundreds of thousands) into a data warehouse that absorbs incoming objects,  then replicates them onto a few SSD storage units (generally three replicas per object, with the replication pattern fairly randomized to smooth loads, but in such a way that the replicas are on fault-independent machines: ones that live in different parts of the warehouse and are therefore unlikely to crash simultaneously).

Once the data is securely stored, Cosmos computes on it: it might compress or resize an image or video, or it could deduplicate, or run an image segmentation program.  This yields intermediary results, which it also replicates, stores, and then might further process: perhaps, given the segmented image, it could run a face recognition program to automatically tag people in a photo.  Eventually, the useful data is served back to the front-end for actions (like the smart highway that tells cars what to do).  Cold but valuable data is stored to a massive backend storage system, like Microsoft's Pelican shingled storage server.  Unneeded intermediary data is deleted to make room for new inputs.

Thus Cosmos has a lot in common with Spark, the famous data processing platform, except on a much larger scale, and with an emphasis on a pipeline of transformations rather than on MapReduce.

If we step way back, we can start to perceive Cosmos as an example of a smart memory system: it remembers things, and also can think about them (process them), and could potentially query them too, although as far as I know Cosmos and Spark have limited interactive database query functionality.  But you could easily imagine a massive storage system of this kind with an SQL front-end, and with some form of internal schema, dynamically managed secondary index structures, etc.

With such a capability, a program could interrogate the memory even as new data is received and stored into it.  With Cornell's Derecho system, the data capture and storage steps can be an asynchronous pipeline that would still guarantee consistency.  Then, because Derecho stores data into version vectors, queries can run asynchronously too, by accessing specific versions or data at specific times.  It seems to me that the temporal style of indexing is particularly powerful.

The interesting mix here is massive parallelism, and massive amounts of storage, with strong consistency... and it is especially interesting that in Derecho, the data is all moved asynchronously using RDMA transfers.  Nobody has to wait for anything, and queries can be done in parallel.

Tomorrow's machine learning systems will surely need this kind of smart memory, so for those of us working in systems, I would say that smart memory architectures jump out as a very exciting next topic to explore.  How should such a system be organized, and what compute model should it support? As you've gathered, I think Cosmos is already pretty much on the right track (and that Derecho can support this, even better than Cosmos does).  How can we query it? (Again, my comment about SQL wasn't haphazard: I would bet that we want this thing to look like a database).  How efficiently can it use the next general of memory hardware: 3-D XPoint, phase-change storage, high-density RAID-style SSD, RDMA and GenZ communication fabrics?

Check my web page in a few years: I'll let you know what we come up with at Cornell...