Monday, 31 July 2017

Why is it so hard to mask failures?

When we talk about fault tolerant distributed computing, using the state machine replication approach, it may seem obvious that a system of this kind should be capable of completely masking failures.  In fact, however, this is not the case.  Our ability to hide failures is really very limited.

When developers use state machine replication techniques (SMR), the usual approach is to replace components of systems or distributed services with groups of N members, and then use some sort of library that delivers the same inputs to each, in the same order.  If the replicated component is deterministic, and if care is taken to initialize each component in a manner properly synchronized with respect to its peers, this is enough to guarantee that the copies will remain synchronized.  Thus, we have an N-replica group that seemingly will tolerate N-1 faults.

Unfortunately, theory is one thing and reality can be quite a different matter.  When people first began to experiment with SMR in the 1990's, developers quickly noticed that because software bugs are a major cause of failure, perfect replication will replicate many kinds of faults!  Over time, a more nuanced approach emerged, in which the various replicas are proactively shut down and restarted in an uncoordinated way, so that on average there would still be N copies, but at any instant in time there might be N-1 copies, with one copy shutting down or rejoining.  The trick is to transfer the current state of the group to the recovering member, and is solved using the virtual synchrony model, in which group membership advances through a series of epochs, reported via view upcall notifications, with state transfers performed during epoch transitions.

The benefit of this sort of staggered restart is to overcome so-called Heisenbugs.  The term refers to bugs that are hard to pin down: they could cause non-deterministic behavior (in which case the replicas might diverge), or bugs that seem to shift around when the developer tries to isolate them.
A common form of Heisenbug involves situations where a thread damages a data structure, but the damage won't be noticed until much later, at which point any of a number of other threads could try to access the structure and crash.  Thus the failure, when it occurs, is associated with logic remote from the true bug, and may jump about depending on scheduling order.  If the developer realizes that the root cause is the earlier damage to the data structure, it generally isn't too hard to fix the problem.  But if the developer is tricked into thinking the bug manifested in the code that triggered the crash, any attempts to modify that logic will probably just make things worse! 

The reason that staggered restart overcomes Heisenbugs is that a restarting program will load its initial state from some form of checkpoint, hence we end up with N copies, each using different operations to reach the same coordinated state as the other N-1.  If the data-structure corruption problem isn't a common thing, this joining process is unlikely to have corrupted the same data structure as did the others.  With proactive restart, all N copies may be in equivalent yet rather different states.  We can take this form of diversity even further by load-balancing read-requests across our N copies: each will see different read operations and this will be a further source of execution diversity, without mutating states in ways that can cause the N replicas to diverse.
With such steps, it isn't hard to build an ultra-resilient SMR service, that can remain alive even through extremely disruptive failure episodes. But can such a service "mask" failures?

The answer is yes and no.

On the "yes" side we find work by Robert Surton at Cornell, who created a clever little TCP fail-over solution called TCP-R.  Using this protocol, a TCP connection can seamlessly roll from one machine (the failed server) to another.  The cleverness arises because of the way that TCP itself handles acknowledgements: in Surton's approach, a service member accepts a TCP connection, reads the request (this assumes that the request size is smaller than the TCP window size, in bytes), replicates the received request using an SMR multicast, and only then allows TCP to acknowledge the bytes  comprising the request. 

If a failure disrupts the sequence, TCP-R allows a backup to take control over the TCP session and to reread the same bytes from the window.   Thus the service is guaranteed to read the request at least once.  A de-duplication step ensures that a request that happens to be read twice won't cause multiple state updates.

Replies back to the end user are handled in a similar way.   The service member about to send the reply first notifies the other members, using a multicast, and only then sends the reply.  If a failure occurs, one of the other members claims the TCP endpoint and finishes the interrupted send.
With TCP-R the end-user's experience is of a fully masked failure: the application sends its request, and the service definitely gets the request (unless all N members crash simultaneously, which will break the TCP session).

Lacking TCP-R, the situation is quite a bit more complex.  In effect, the end-user would need to send the request, but also be prepared to re-send it if the endpoint fails without responding.  For read-only requests, the service can just perform the request multiple times if it shows up multiple times, but updates are more complex.  For these, the service would need logic to deduplicate requests: if the same request shows up twice, it should resend the original reply and not mutate the service state by performing the identical operation a second time.  TCP-R masks the service failure from the perspective of the client, although the service itself still needs this form of deduplication logic.

On the "no" side of the coin, we need to consider the much broader range of situations that can arise in systems that use SMR for fault-tolerance.  In particular, suppose that one SMR-replicated service somehow interacts with a second SMR-replicated service.  Should all N members of the first replica group repeat the same request to the M members of the second group?   Doing so is clearly the most general solution, but runs into the difficulty that bytes will be transferred N times to each member: a high cost if we are simply trying to mask a rare event!

Eric Cooper studied this question in his PhD thesis on a system called Circus, at Berkeley in the 1990's.  Basically, he explored the range of options from sending one request from group A to group B, but reissuing the request if the sender in A failed or the receiver in B, all the way to the full N x M approach in which every member of A multicasts every request to every member of B, and the members of B thus receive N copies and must discard N-1 of them in the usual case.  (TCP-R can be understood as an instance of the first approach, but with the client-side logic hidden under TCP itself, so that only the server has to be aware of the risk of redundancy, and so that it only arises when a genuine failure occurs.)

Cooper pointed out that even with high costs for redundancy, the N x M approach can often outperform any scheme that waits to sense a failure before retrying.  His insight was that because detecting a failure can be slow (often 30s or more), proactively sending multiple copies of each request will generally produce extra load on the receivers, but with the benefit of ensuring a snappy response because at least some receiver will act promptly and send the desired reply with minimal delay.

Cooper's solution, Circus, is best understood as a design pattern: a methodology that the application developer is expected to implement.  It involves a multicast from group A to group B, code in group B to remember recent responses to requests from A, and logic to de-duplicate the request stream, so that whether B receives a request 1 time or N times, it behaves identically and responds to A in the same manner.

In Derecho, we don't currently offer any special help for this heavily redundant approach, but all the needed functionality is available and the design pattern shouldn't be hard to instantiate.  But in fact, many Derecho users are more likely to use a non-fault-tolerant approach when building a data processing pipeline.  More specifically, while Derecho users would often use replication to make the processing elements of the pipeline fault-tolerant, they might decide not to pay the overhead of making the handoff of requests, stage by stage in the pipeline, ultra-reliable.

The reason for this compromise is  that in the IoT settings where a "smart memory service" might be used, most sensors are rather redundant, sending photo after photo of the same car on the highway, or location after location for the cat in the apartment.  The service receives this heavily duplicative input and will actually start by sorting out the good content and discarding the replicated data.  Thus we suspect that most Derecho users will be more concerned with ensuring that the service itself is highly available, and less concerned with ensuring that every single message sent to the service is processed. 

Indeed, in most IoT settings, freshness of data is more important that perfect processing of each and every data point.  Thus, if some camera creates photo X of a vehicle on the highway, and then photo Y, and X is somehow lost because of a failure that occurs exactly as it is being sent, it often would make more sense to not fuss about X and just focus on processing request Y instead.

Microsoft has a system, Cosmos, in which a pipeline of processing is done on images and videos.  It manages without fault-tolerant handoff between stages because failures are rare and, if some object is missing, there is always a simple recipe to create it again from scratch.  Facebook apparently does this too.  Both systems need to be extra careful with the original copies of photos and videos, but computed artifacts can always be regenerated.  Thus, perfect fault tolerance isn't really needed!

Of course, one can easily imagine systems in which each piece of data sent to the service is individually of vital importance, and for those, an open question remains: is it really necessary to hand-code Cooper's Circus design pattern?  Or could there be a really nice way to package the Circus concept, for example by using a higher level language to describe services and compiling them down to SMR replicas that talk to one-another redundantly? 

I view this as an open research topic for Derecho, and one we may actually tackle in coming years.  Until then, Derecho can certainly support a high quality of adaptation after crashes, but won't seamlessly hide crashes from the developer who works with the technology.  But on the other hand, neither does any other technology of which I'm aware!

Saturday, 15 July 2017

How far could a secure Internet get us?

There is a standard refrain among those who teach and do research on security: it holds that the fundamental flaw in security originates with an early decision by the community that created the Internet.  In that early design, Internet routing occurs by a form of peering that can often occur anonymously.  Endpoint systems are not authenticated, and packets generally aren't encrypted or authenticated either.  As a consequence, packets are routed permissively even if they contain what are almost certainly false origin addresses, or have content that seems dubious in obvious ways (like a TCP connection packet that makes no sense on this particular link).

This story then holds that the original sin of the Internet design community has given us layer upon layer of infrastructure that cannot be secured because it resides on a flawed and easily compromised foundation.

As it turns out, there is little doubt that we actually can build an Internet that corrects that deep flaw, and that can be relatively compatible with the existing Internet.  Enterprises like military systems and security-conscious companies do it all the time.   Let me sketch out how this is done (it isn't rocket science, and you won't find many surprises).  But then let's ask if it would really be a game-changer for the security of higher level applications.

To appreciate the core challenge, it helps to start by understanding that everything you think you know about the network is really just an abstraction layered over something more basic.  For example, even my advanced students tend to think of Internet addresses as identifiers, as if my computer's true name were and your computer's true name was  In fact, these addresses are better understood as routing data: useful for deciding what the next hop should be as a packet progresses hop-by-hop through the network, but lacking any deep connection to identity. 

In fact, a single computer might have many network addresses (one for each of its network interfaces).  It might host multiple virtual machines, and those could each have virtual network addresses of their own: with virtualization, any computer can function as a kind of Internet and host large numbers of computers that run within it, complete with virtual networks, virtual name spaces, services of various kinds, you name it.  A single computer might move about, so that over short periods of times it takes on a sequence of addresses: the old addresses can cease to work, and traffic needs to be redirected to the new ones.  Addresses can be mapped by network address translators.

Once we free ourselves from this false equivalence of network address to identity, you need to ask what the real root of identity should be, in a modern system.  Here, the proper focus is on hardware security tokens combined with human users who authenticate themselves in the usual ways.  The hardware could be a built-in component of the computer, or it could be some form of plug-in.  But the key point is that when we associate authentication with these unforgeable hardware elements, used in tandem, we arrive at a much stronger notion of endpoint identity.  The deep roots of that form of identity reside in key pairs: the identity defines some form of private key, with which information can be authenticated by public components that only have access to the corresponding public key.

This then is our bootstrap opportunity: we live in a vast world of devices: computers, routers, IoT components like video cameras and smart televisions and smart cars, and each element can be understood as either being anonymous (if it lacks the ability to authenticate itself), or capable of proving that it has "authorized access" to some private key.  With that proof, we can consult a vendor-published registry and from that registry, can learn about this endpoint device.  A device can also be registered in secondary registries: when I bring a new router into my smart home, I could register my router as "Ken's router, at his summer cottage on Cayuga Lake".  And now there would be a trustworthy chain of reasoning that would let you convince yourself that certain messages were indeed sent by, or countersigned by, my router.

Sounds familiar?  It should, if you've ever brought your laptop from home to work.  Some companies won't allow you to connect the laptop at all (fearing viruses that your machine might carry), but those that do usually require precisely this sort of authentication and registration.

Given strong authentication, a second opportunity arises whenever we focus on an island of infrastructure that has a coherent owner.  Here on Cayuga Lake, my network provider is part of a broader system owned and controlled by a company that controls a fairly large regional network.  This ISP is paid for its services, and at least in theory, has complete control of every device that can connect directly to it, every router and switch it operates internally, and every network link used to glue them all together.  One can understand the ISP as a kind of military hierarchy: at the core we find a "master and commander" who has absolute control and is the sole source of permissions.  Below the master and commander are a collection of subordinates, to whom restrictive roles have been granted, and way down at the bottom are the common seaman who might permit packets originating in my router to sail the seas of this regional network -- or could block them.

As it happens, today's ISPs are somewhat relaxed about authentication, so one opportunity to secure the network would start by simply enforcing authentication when devices are connected to an ISP.  Today when I buy a new iPhone, I can simply start to use it in my home.  If I was trying to build a much more secure network, at a minimum my router might pop up a screen requiring me to fill in details: who owns this iPhone (and it might want proof: type in your password, your RSA code number, and hold your thumb on the fingerprint reader...)  Perhaps, that regional ISP would do the same and require a second level of device registration.  Military and other nationally critical infrastructure networks work precisely in this way: if you were to take a laptop into a major power plant and connect it to the Ethernet jack in the waiting room while waiting for your appointment with human resources, either it won't connect, or it will give you some form of very limited guest connectivity. 

Think back to the last time you took your laptop to work.  I bet that something along the lines just described happened then, too.  Any organization that takes security at all seriously is careful to track the devices connected to it, and to limit their networking "power".  A military system won't allow you to connect your own machine, at all.  But if you work at Cornell, like I do, you might be able to get permission -- except that your machine will end up registered for use in a specific context, such as from my office in the Computer Science building.  If I were to carry it across the street and connect it to a wall jack in the ECE department, I would be back to square zero.

With enclaves that use authentication, one can take further steps.  Without too much added cost, packets can be cryptographically signed or fully encrypted at the endpoint, yielding various forms of virtual private networks: subnetworks within which communication is safe, but with strong protection against traffic from the outside leaking in, or against intruders managing to read or tamper with data.  Such systems can also filter or block traffic that might try to escape the secure perimeter.

I worked at Microsoft for a few months in 2016, and they adopted this approach.  I could only connect to their system via a secured VPN, and the security perimeter it enforced when I worked from home was very carefully controlled and monitored.  I could continue projects from work while at home, but I could never have wandered the virtual Microsoft network with impunity from outside the office.  In my office, I had somewhat more relaxed access to internal software tools and projects.

This, then, gives some sense of what a secure Internet would look like.  But how secure can such a system really be?

As we push from the lowest layers of hardware and software up to higher levels of abstraction, the numbers of elements of a modern system increase exponentially.  There are dozens of operating systems and each has hundreds of variants and patch levels.  So the very first layer already has a level of diversity measureable in the thousands.  Then there are hundreds of programming tools and languages and services that can run on them, configurable in hundreds of ways, to say nothing of all the management options.  Layer by layer, we bring in a surreal degree of diversity simply by virtue of the choices made by each system designer and vendor.

In settings like military systems, or power grids, a major effort is invested to keep control over the forms of diversity that are actually present in the deployed system.   Individual users aren't permitted to install their own applications, patches are applied in a coordinated way, and monitoring tools are used to notice unexpected behavior that could point to an intrusion.  In contrast, networks used in other settings need to deal with enormous levels of diversity and individual customization.  Like it or not, the network provider simply won't know what the network is being used to do.

It seems to me that this diversity accounts for the insecurity of modern systems, to a far greater degree than the "original sin" of endpoint anonymity and unauthenticated peering.  While the insecurity of the network certainly makes it easier for attackers to mount denial of service attacks or to route spam emails with falsified origin data, those are just the a small aspect of the larger problem.  And that larger problem centers on the exceptionally large attack surface that modern systems offer: hundreds of millions if not billions of lines of code, riddled with bugs, and configured in every imaginable way (just yesterday I was horrified to discover that my home router has an administrative login and password, both set to the defaults.  Now I'm wondering about my television, and my internet box...). 

Fixing the original sin of the Internet genuinely would help in some ways, especially as we move to deploy an Internet of Things with more and more devices playing a diversity of roles in homes, cars, highways and other settings.  We should take that step.  But it is an illusion to imagine that true software security can be achieved by hardening the network itself, because the extreme diversity of uses would overwhelm any systematic attempt to impose security standards and ensure that they are respected.  Attempting to do so would probably create some very frustrated users, and yet would at best raise the bar slightly for one facet of a complex, multi-faceted problem.

More interesting, I think, is to think about diversity as a constructive tool for protecting large, complex systems.  Fred Schneider and I once coauthored an editorial on this topic, and a short paper, and I continue to believe that this was one of our better efforts.  Synthetic diversity, in particular, is a remarkable tool for combatting the wily attacker, who often has a surprisingly limited arsenal of off the shelf exploits and might be blocked by even small "surprises".

The basic idea is simple: just as a compiler takes source code and then can create a variety of executables (depending on the target instruction set, operating system, optimization level selected, etc), we can also "compile" programs to expose various forms of artificial diversity.  We can force the memory layout to look somewhat random (simply by adding random padding to objects allocated dynamically).  We can vary the stack layout and the order in which inputs are delivered if a program receives data from multiple sources.  We can potentially compile one program several ways, and pick the version that will be running today at random within the resulting set.  We can randomize the numbering used for operating systems calls.

Such steps diversity and obfuscate the attackable surface.  The attacker who was using an exploit that overruns an input buffer in a way that triggers a system call to download and then execute a bot will now run into an unpredictable memory location, a buffer that might not overflow, and in any case the system call to read from the network socket might have been remapped to some normally-unassigned code.  These are steps that really can pay off.

VPN security offers opportunities too.  Obviously, surfing the web requires that there be no barriers.  But for an application like a medical system that manages patient records or interacts with a lab, there is no real reason to also be able to surface the web, and there is no reason that random web programs should be able to communicate to that application, either.  VPNs can offer this sort of protection, and if we could deploy them more easily, they could wrap sensitive code in layers of armor.

So count me on the side of those who believe that Internet security can and should be a priority, particularly in critical infrastructure and Internet of Things scenarios.   It really is about time to make Internet authentication a readily available and widely standard function, and to deploy VPN technologies much more actively.  But doing so won't solve the whole problem.  We should also make more of an effort to synthetically diversify the systems we deploy widely, so that the attacker will encounter a bewildering variety of "versions" of any particular software.   

If you can't build a single impregnable castle, the next best thing is to populate a huge city with every imaginable variation on the castle theme.  Put police checkpoints on the roads leading to the city water pumping and power generating systems.  Post signs in a few languages, including some you made up just for this purpose.   Good luck to that hacker: he might break into one system, but doing so won't get him very far before we root him out...