A Few Thoughts on Distributed Computing: derecho

Showing posts with label derecho. Show all posts

Friday, 21 August 2020

Knowledge logics (and love)

Alice and Bob, two star-crossed lovers, had planned to meet for lunch in downtown Palo Alto if the weather was good, this to be determined by Alice, who would email Bob, who then would confirm (being a late-riser, there was some risk of him not actually seeing an email in time). But they never actually did meet, and indeed, Alice never once sent an email -- despite the fact that Northern California had a near record of three months without a single cloudy day! When those emails didn't arrive, Bob realized that he couldn't imagine life without her.

By now you've figured out the puzzle, if you tried. The core problem is uncertainty: When Alice emails Bob "I just checked, nice weather again (no surprise...). How about Vegan Burger at 12:15?", she has no certainty Bob will see the email in time to meet her. Bob, who woke up early that morning, replies instantly: "Can't wait!". But did Alice see his reply? Alice replies to Bob: "It will be so nice to see you again...". And so it goes... with neither ever reaching a state of certainty. So, realizing this, Alice didn't waste the bits on a pointless, unending exchange. Pankaj Mehra wins the prize for being first to post the correct answer!

In fact the deeper puzzle is precisely the opposite of the one I posed. The question that we really need to ask is "Why, then, do lovers ever manage to meet for lunch? And how can distributed systems reach agreement, on anything?"

Yoram Moses and Joe Halpern formalized this scenario using what came to be known as a "knowledge logic." You start with a normal logic, but have a set of participants (Alice, Bob, ...) and a way for them to exchange messages, which are generally treated as reliable, but with unbounded delays. If predicate P is true for Alice, we say that Alice "knows" P: K(P)Alice. Now we can write K(K(P)Alice)Bob to denote that Bob knows that Alice knows P. For example, Bob learns that Alice knows that the weather is fine. The problem is that at the moment Bob learns this (transitions from not knowing to knowing), Alice does not know that Bob knows that Alice knows that the weather is fine. When Bob emails his confirmation, the same issue arises but with their roles reversed: now Alice knows that Bob knows that the weather is nice, but Bob doesn't know that Alice saw the confirmation, hence for him, the situation is actually unchanged. This knowledge asymmetry is the root of their dilemma.

Last week I suggested that anyone puzzled by my puzzle watch The Princess Bride. Early in the movie there is a fantastic scene structured around this exact question. The criminal Vizzini -- who happens to be the smartest man in the world (Wallace Shawn) --engages in a duel to the death with Westley, who is the current incarnation of the Dread Pirate Robert (Cary Elwes). The duel centers on deciding which of two cups of wine has been poisoned. Vizzini debates himself until, after some number of back-and-forth ripostes, he suddenly concludes "but Westley certainly isn’t smart enough to have realized this!" He seizes one of the cups and drinks it, while Westley drinks the other. Shawn dies... but was his logic wrong?

The logic of knowledge is quite elaborate, and can express all sorts of statements about distributed systems and multi-party reasons. It also comes with some theorems, including the following. To get rid of all the names, let's say that Bob has "K1" knowledge of P if Bob knows that someone knows (or some set of people) P. So if Bob knows that Alice knows the weather, he has K1 knowledge (Alice has K0 knowledge). Once Alice reads Bob's first acknowledgement, she has K2 knowledge, and so forth.

The bizarre aspect of our lover's dilemma is that both Alice and Bob actually do know that the weather will be fine -- and both of them quickly learn that both of them know this to be true. Both, that is, transition through a K1 knowledge state. Yet the particular (peculiar, one might say) setup is such that the weather being fine is not enough. To actually meet, both must be sure that both will be there. This is the aspect over which the plan stumbles.

K* knowledge, also called "Common Knowledge", is defined to be a level of knowledge that implies Kn for all values of n. Yoram and Joe proved in their papers that common knowledge cannot be achieved in a symmetric system where all the participants have equal trust in one-another and communicate via messages. Alice and Bob, in effect, agreed to require common knowledge, not realizing that with uncertain communication delays, this wouldn't be possible.

A curious feature of common knowledge is that if we simply introduce a special participant who is trusted to disseminate common knowledge, a system can simply accept the statements of this omnipotent participant and achieve things that, lacking such a participant, would be infeasible.

Here's a famous example that I often use when teaching the idea. Alice and Bob, now a couple with children, have organized a birthday party for their daughter Carol. The kids come in from the yard, where they have been playing and on their way in, they wash up... but poorly. None washes their face, and as a result all have mud on their faces, but nowhere else. So Bob announces "in our house, you need to be clean to sit at the table. So if you want cake, you need a clean face!". No child loves to wash twice (and children are like cats where water on their faces is concerned!). Not one budges. Bob, increasingly frustrated, repeats this a few times. Dessert is in peril!

Meanwhile, in a different reality, Bob said something slightly differently: when the kids came in, his first remark was "Wow, I see a dirty face!" All of the children hear him clearly, and of course, young children place absolute truth in adults. In this situation, with N children, all march to the sink the N'th time Bob repeats the rule.

The oddity in all of these stories is that everyone seemingly knows the answer, yet they behave very differently depending on how common knowledge is handled. Alice knows the weather is fine. When she gets Bob's reply, he knows this too, and she is sure he knows -- and if she takes the next step and replies to his reply, he knows that she knows this. Clearly, this should be adequate for a nice veggie burger special!

Or consider the children: every child sees mud on every face (except of course, their own), so on Bob's very first announcement, why didn't all of them head for the sink? But in fact, in version two, they do head for the sink.

In reality two, Bob's introduction of common knowledge seems to be of vital importance. But common knowledge is indefinitely elusive! You can construct N-child proofs that reduce the N-child case to the N-1 child case, and the crux of the proof turns out to be similar to the dilemma of Bob and Alice: with 2 children, perhaps Carol and Doug, you can play this out. In reality one, Carol is hoping against hope that her face is clean (and similarly, Doug). She sees mud on Doug's face, he sees mud on hers, so neither budges. But Carol is hoping Doug sees a clean face, and Doug is hoping she is looking at his clean face. Lacking certainty that at least one child has a dirty face, neither moves. When Bob introduces that common knowledge, it changes the game, and both can deduce that their own face, tragically, must be muddy. How can we avoid making the mistake of requiring Common Knowledge in a problem statement?

We often build systems that need to cooperate, elect leaders and so forth. We certainly use a logic of knowledge when designing them; the resulting built-in facts are, in effect, common knowledge. But the dynamic form of knowledge associated with new events is the more interesting case. Why is it that we can solve problems like implementing state machine replication using virtually synchronous Paxos (the consistency model and protocol employed in Derecho)?

Do we think of Derecho as a trusted source of common knowledge, or is it simply that these kinds of problems don't actually need common knowledge?

For me, the really fascinating insight is that when a system manages its own membership, creating epochs of stable membership, state machine replication seems to be a complete story for everything we ever need to do in distributed systems: if we need consistency, this model offers the answers; if we don't need consistency -- well, then use TCP and you should be happy! Moreover, as far as I can tell, these state machine replication protocols use the membership as a form of common knowledge. This suggests that a system like Derecho is actually a two-level knowledge structure, with one structure subordinate to and trusting the other, and that higher level structure permitted to hand common knowledge statements to the lower one!

This, however, is not something I have ever seen treated in a paper. Perhaps someone has looked at it, but if so, it didn't cross my desk!

The other side of this story, equally fascinating, is that K1 knowledge is sufficient both for self-managed membership and for tasks such as ordered multicast delivery with safety in the event of a crash: In the Derecho paper I linked to earlier, we discuss this and present our reduction of the protocols to a knowledge logic and out implementation over 1-sided RDMA. So, in this two-level formulation, we seem to have a K1 protocol at each level, but with one level viewed as a source of common knowledge by the other level!

Back in the real world, there is a third aspect to the same insight. If you reflect on it, Bob and Alice's dilemma "made sense" and does seem to capture the way we describe situations and problems. Yet the K1 style of solution used in real systems also seems relevant: in real life, Bob and Alice would never have gotten into such a jam -- Alice would have emailed that the weather is fine, Bob would have acknowledged. His iphone would show a little check mark: sent. Then a second one: read by Alice (or more accurately, it popped up on her screen and she swiped it away). Thus in the real world, Alice and Bob make a kind of leap of faith and begin to treat the weather as common knowledge.

This idea of a leap of faith is really about probabilistic reasoning: we simply decide, at some point, that the likelihood of a problem is so low that it can be dismissed. Logicians (including both Joe and Yoram) have looked at this issue, and shown that it has a great deal of relevance to AI. But I'm not actually convinced that Derecho involves such a leap of faith.

If anyone knows of work on this question, I would love to hear about it! And in the meanwhile, stay safe and enjoy the rest of your summer, happy in the knowledge that it all worked out for Alice and Bob.

Wednesday, 24 July 2019

In theory, asymptotic complexity matters. In practice...

Derecho matches Keidar and Shraer’s lower bounds for dynamically uniform agreement: No Paxos protocol can safely deliver messages with fewer "information exchange" steps. But does this matter?

Derecho targets a variety of potential deployments and use cases. A common use would be to replicate state within some kind of "sharded" service -- a big pool of servers but broken into smaller replicated subservices that use state machine replication in subsets of perhaps 2, 3 or 5. A different use case would be for massive replication -- tasks like sharing a VM image, a container, or a machine-learned model over huge numbers of nodes. In those cases the number of nodes might be large enough for asymptotic protocol complexity bounds to start to matter -- Derecho's optimality could be a winning argument. But would an infrastructure management service really stream high rates of VM images, containers, and machine-learned models? I suspect that this could arise in future AI Systems... it wouldn't today.

All of which adds up to an interesting question: if theoretical optimality is kind of a "meh" thing, what efficiency bounds really matter for a system like Derecho? And how close to ideal efficiency can a system like this really come?

To answer this question, let me start by arguing that 99% of Derecho can be ignored. Derecho actually consists of a collection of subsystems: you link your C++ code to one library, but internally, that library has several distinct sets of "moving parts". A first subsystem is concerned with moving bytes: our data plane. The second worries about data persistency and versioning. A third is where we implement the Paxos semantics: Derecho's control plane. In fact it handles more than just Paxos -- Derecho's control plane is a single thread that loops through a set of predicates, testing them one by one and then taking triggered actions for any predicate that turns out to be enabled. A fourth subsystem handles requests that query the distributed state: it runs purely on data that has become stable and is totally lock-free and asynchronous -- the other three subsystems can ignore this one entirely. In fact the other three subsystems are as lock-free and asynchronous as we could manage, too -- this is the whole game when working with high speed hardware, because the hardware is often far faster than the software that manages it. We like to think of the RDMA layer and the NVM storage as two additional concurrent systems, and our way of visualizing Derecho is a bit like imagining a machine with five separate moving parts that interact in a few spots, but are as independent as we could manage.

For steady state performance -- bandwidth and latency -- we can actually ignore everything except the update path and the query path. And as it happens, Derecho's query path is just like any query-intensive read-only subsystem: it uses a ton of hashed indices to approximate one-hop access to objects it needs, and it uses RDMA if that one hop involves somehow fetching data from a remote node, or sending a computational task to that remote node. This leads to fascinating questions, in fact: you want those paths to be lock-free, zero-copy, ideally efficient, etc. But we can set those questions to the side for our purposes here -- results like the one by Keidar and Shraer really are about update rates. And for this, as noted a second ago, almost nothing matters except the data-movement path used by the one subsystem concerned with that role. Let's have a closer look.

For large transfers Derecho uses a tree-based data movement protocol that we call a binomial pipeline. In simple terms, we build a binary tree, and over it, create a flow pattern of point to point block transfers that obtains a high level of internal concurrency, like a two-directional bucket brigade -- we call this a reliable multicast over RDMA, or "RDMC"). Just like in an actual bucket brigade, every node settles into a steady behavior, receiving one bucket of data (a "chunk" of bytes) as it sends some other bucket, more or less simultaneously. The idea is to max-out the RDMA network bandwidth (the hardware simply can't move data more efficiently). The actual data structure creates a hypercube "overlay" (a conceptual routing diagram that lives on our actual network, which allows any-to-any communication) of dimension d, and then wraps d binomial trees over it, and you can read about it in our DSN paper, or in the main Derecho paper.

A binary tree is the best you can hope for when using point-to-point transfers to replicate large, chunked, objects. And indeed, when we measure RDMC, it seems to do as well as one can possibly do on RDMA, given that RDMA lacks a reliable one-to-many chunk transfer protocol. So here we actually do have an ideal mapping of data movement to RDMA primitives.

Unfortunately, RDMC isn't very helpful for data too small to "chunk". If we don't have enough data a binomial tree won't settle into its steady-state bucket brigade mode and we would just see a series of point-to-point copying actions. This is still "optimal" at large-scale, but recall that often we will be replicating in a shard of size two, three or perhaps five. We decided that Derecho needed a second protocol for small multicasts, and Sagar Jha implemented what he calls the SMC protocol.

SMC is very simple. The sender, call it process P, has a window, and a counter. To send a message, P places the data in a free slot in its window (each sender has a different window, so we mean "P's window"), and increments the counter (again, P's counter). When every receiver (call them P, Q and R: this protocol actually loops data back, so P sends to itself as well as to the other shard members) has received the message, the slot is freed and P can reuse it, round-robin. In a shard of size three where all the members send, there would be one instance of this per member: three windows, three counters, three sets of receive counters (one per sender).

SMC is quite efficient with small shards. RDMA has a direct-remote-write feature that we can leverage (RDMC uses a TCP-like feature where the receiver needs to post a buffer before the sender transmits, but this direct write is different: here the receiver declares a region of memory into which the sender can do direct writes, without locking).

Or is it? Here we run into a curious philosophical debate that centers on the proper semantics of Derecho's ordered_send: should an ordered_send be immediate, or delayed for purposes of buffering, like a file I/O stream? Sagar, when he designed this layer, opted for urgency. His reasoning was that if a developer can afford to batch messages and send big RDMC messages that carry thousands of smaller ones, this is exactly what he or she would do. So a developer opting for SMC must be someone who prioritizes immediate sends, and wants the lowest possible latency.

So, assume that ordered_send is required to be "urgent". Let's count the RDMA operations that will be needed to send one small object from P to itself (ordered_send loops back), Q and R. First we need to copy the data from P to Q and R: two RDMA operations, because reliable one-sided RDMA is a one-to-one action. Next P increments its full-slots counter and pushes it too -- the updated counter can't be sent in the same operation that sends the data because RDMA has a memory consistency model under which a single operation that spans different cache-lines only guarantees sequential consistency on a per-cache-line basis, and we wouldn't want P or Q to see the full-slots counter increment without certainty that the data would be visible to them. You need two distinct RDMA operations to be sure of that (each is said to be "memory fenced.") So, two more RDMA operations are required. In our three-member shard, we are up to four RDMA operations per SMC multicast.

But now we need acknowledgements. P can't overwrite the slot until P, Q and R have received the data and looked at it, and to report when this has been completed, the three update their receive counters. These counters need to be mirrored to one-another (for fault-tolerance reasons), so P must send its updated receive counter to Q and R, Q to P and R, and R to P and Q: six more RDMA operations, giving a total of ten. In general with a shard of size N, we will see 2*(N-1) RDMA operations to send the data and count, and N*(N-1) for these receive counter reports, a total of N^2+N-2. Asymptotically, RDMC will dominate because of the N^2 term, but N would need to be much larger than five for this to kick in. At a scale of two to five members, we can think of N as more or less a constant, and so this entire term is like a constant.

So by this argument, sending M messages using SMC with an urgent-send semantic "must" cost us M*(N^2+N-2) RDMA operations. Is this optimal?

Here we run into a hardware issue. If you check the specification for the Connect X4 Mellanox device used in my group's experiments, you'll find that it can transmit 75M RDMA messages per second, and also that it has peak performance of 100Gbps (12.5GB) in each link direction. But if your 75M messages are used to report updates to tiny little 4-byte counters, you haven't used much of the available bandwidth: 75M times 4 bytes is only 300MB/s, and as noted above, the device is is bidirection. Since we are talking about bytes, the bidirectional speed could be as high as 25GB/s with an ideal pattern of transfers. Oops: we're too slow by a factor of 75x!

In our TOCS paper SMC peaks at around 7.5M small messages per second, which bears out this observation. We seem to be leaving a lot of capacity unused. If you think about it, everything centers on the assumption that ordered_send should be as urgent as possible. This is actually limiting performance and for applications that average out at 7.5M SMC messages per second or less, but have bursts that might be much higher, this is even inflating latency (a higher-rate burst will just fill the window and the sender will have to wait for a slot).

Suppose our sender wants fast SMC streaming and low latency, and simply wasn't able to do application-level batching (maybe the application has a few independent subsystems of its own that send SMC messages). Well, everyone is familiar with file I/O streaming and buffering. Why not use the same idea here?

Clearly we could have aggregated a bunch of SMC messages, and then done one RDMA transfer for the entire set of full window slots (it happens that RDMA has a so-called "scatter gather/put feature", and we can use that to transfer precisely the newly full slots even if they wrap around the window). Now one counter update covers the full set. Moreover, the receivers can do "batched" receives, and one counter update would then cover the full batch of receives.

An SMC window might have 1000 sender slots in it, with the cutoff for "small" messages being perhaps 100B. Suppose we run with batches of size 250. We'll have cut the overhead factors dramatically: for 1000 SMC messages in the urgent approach, the existing system would send 1000*10 RDMA messages for the 3-member shard: 10,000 in total. Modified to batch 250 messages at a time, only 40 RDMA operations are needed: a clean 250x improvement. In theory, our 7.5M SMC messages per second performance could then leap to 1.9B/second. But here, predictions break down: With 100 byte payloads, that rate would actually be substantially over the limit we calculated earlier, 25GB/s, which limits us to 250M SMC messages per second. Still, 250M is quite a bit faster than 7.5M and worth trying to achieve.

It might not be trivial to get from here to there, even with batching. Optimizations at these insane data rates often aren't be nearly as simple as a pencil-and-paper calculation might suggest. And there are also those urgency semantics issues to think about: A bursty sender might have some gaps in its sending stream. Were one to occur in the middle of a 250 message batch, we shouldn't leave those SMC messages dangling: some form of automatic flush has to kick in. We should also have an API operation so that a user could explicitly force a flush.

Interestingly, once you start to think about this, you'll realize that in this latency sense, Sagar's original SMC is probably "more optimal" than any batched solution can be. If you have just one very urgent notification to send, not a batch, SMC is already a very low-latency protocol; arguably, given his argument that the API itself dictates that SMC should be an urgent protocol, his solution actually is "ideally efficient." What we see above is that if you question that assumption, you can identify an inefficiency -- not that the protocol as given is inefficient under the assumptions it reflects.

Moral of the story? The good news is that right this second, there should be a way to improve Derecho performance for small messages, if the user is a tiny bit less worried about urgency and would like to enable a batching mode (we can make it a configurable feature). But more broadly, you can see is that although Derecho lives in a world governed in part by theory, in the extreme performance range we target and with the various hardware constraints and properties we need to keep in mind, tiny decisions can sometimes shape performance to a far greater degree.

I happen to be a performance nut (and nobody stays in my group unless they share that quirk). Now that we are aware of this SMC performance issue, which was actually called to our attention by Joe Israelevitz when he compared his Acuerdo protocol over RDMA with our Derecho one for 100B objects and beat us hands-down, we'll certainly tackle it. I've outlined one example of an optimization, but it will certainly turn out that there are others too, and I bet we'll end up with a nice paper on performance, and a substantial speedup, and maybe even some deep insights. But they probably won't be insights about protocol complexity. At the end of the day, Derecho may be quite a bit faster for some cases, and certainly this SMC one will be such a case. Yet the asymptotic optimality of the protocol will not really have been impacted: the system is optimal in that sense today! It just isn't as fast as it probably should be, at least for SMC messages sent in high-rate streams!

Wednesday, 26 June 2019

Whiteboard analysis: IoT Edge reactive path

One of my favorite papers is the one Jim Gray wrote with Pat Helland, Patrick O'Neil and Dennis Shasha, on the costs of replicating a database over a large set of servers, which they showed to be prohibitive if you don't fragment (shard) the database into smaller and independentally accessed portions: mini-databases. In some sense, this paper gave us the modern cloud, because you can view Brewer's CAP conjecture and the eBay/Amazon BASE methodologies as both flowing from Gray's original insight.

Fundamentally, what Jim and his colleagues did was to undertake a whiteboard analysis of the scalability of concurrency control in an uncontrolled situation, where transactions are simply submitted to some big pool of servers, and then compete for locks in accordance with a two-phase locking model (one in which a transaction acquires all its locks before releasing any), and then terminates using a two-phase or three-phase commit. They show that without some mechanism to prevent lock conflicts, there is a predictable and steadily increasing rate of lock conflicts leading to delay and even deadlock/rollback/retry. The phenomenon causes overheads to rise as a polynomial in the number of servers over which you replicate the data, and quite sharply: I believe it was N^3 in the number of servers, and T^5 in the rate of transactions. So your single replicated database will have a perform collapse. With shards, using state machine replication (implemented using Derecho!) this isn't an issue, but of course we don't get the full SQL model at that point -- we end up with a form of NoSQL on the sharded database, similar to what MongoDB or Amazon's Dynamo DB offers.

Of course the "dangers" paper is iconic, but the techniques it uses are of broad value. And this was central to the way Jim approached problems: he was a huge fan in working out the critical paths and measuring costs along them. In his cloud database setup, a bit of fancy mathematics let the group he was working with turn that sort of thinking into a scalability analysis that led to a foundational insight. But even if you don't have an identical chance to change the world, it makes sense to try and follow a similar path.

This has had me thinking about paper-and-pencil analysis of the critical paths and potential consistency conflict points for large edge IoT deployments of the kind I described last week. Right now, those paths are pretty messy, if you approach it this way. Without an edge service, we would see something like this:

IoT IoT Function Micro
Sensor ---------------> Hub ---> Server ------> Service

In this example I am acting as if the function server "is" the function itself, and hiding the step in which the function server looks up the class of function that should handle this event, launches it (or perhaps had one waiting, warm-started), and then hands off the event data to the function for handling on one of its servers. Had I included this handoff the image would be more like this:

IoT IoT Function Function Micro

Sensor ---------------> Hub ---> Server ------> F -----> Service

F is "your function", coded in a language like C#, F# or C++ or Python, and then encapsulated into a container of some form. You'll want to keep these programs very small and lightweight for speed. In particular, a function is not the place to do any serious computing, or to try and store anything. Real work occurs in the micro service, the one you built using Derecho. Even so, this particular step looks costly to me: without warm-starting it, launching F could take a substantial fraction of a section. And if F was warm-started, the context switch still involves some form of message passing, plus waking F up, and could still be many tens or even hundreds of milliseconds: an eternity at cloud speeds!

Even more concerning, many sensors can't connect directly to the cloud, and we end up cloning the architecture and running it twice: within an IoT Edge system (think of that as an operating system for a small NUMA machine or a cluster, running close to the sensors, and then relaying data to the main cloud if it can't handle the events out near the sensor device).

IoT Edge Edge Fcn IoT Function Micro

Sensor ---------------> Hub ---> Server -> F======> Hub ---> Server -> CF -> Service

Notice that now we have two user-supplied functions on the path. The first one will have decided that the event can't be handled out at the edge, and forwarded the request to the cloud, probably via a message queuing layer that I haven't actually shown, but represented using a double-arrow: ===>. This could have chosen to store the request and send it later, but with luck the link was up and it was passed to the cloud instantly, didn't need to sit in an arrival queue, and was instantly given to the cloud's IoT Hub, which in turn finally passed it to the cloud function server, the cloud function (CF) and the Micro Service.

The Micro Service may actually be a whole graph of mutually supporting Micro Services, each running on a pool of nodes, and each interacting with some of the others. The cloud's "App Server" probably hosts these and provides elasticity if a backlog forms for one of them.

We also have the difficulty that many sensors capture images and videos. These are initially stored on the device itself, which has substantial capacity but limited compute power. The big issue is that the first link, from sensor to the edge hub, would often be bandwidth limited. So we can't upload everything. Very likely what travels from sensor to hub is just a thumbnail and other meta-data. Then the edge function concludes that a download is needed (hopefully without too much delay), sends back a download request to the imaging device, and then the device moves the image to the cloud.

Moreover, there are industry standards for uploading photos and videos to a cloud, and those put the uploaded objects into the edge version of the blob store (short for "binary large objects"), which in turn is edge aware ands will mirror them to the main cloud blob store. Thus we have a whole pathway from IoT sensor to the edge blob server, which will eventually generate another event later to tell us that the data is ready. And as noted, for data that needs to reach the actual cloud and can't be processed at the edge, we replicate this path too, moving that image via the queuing service to the cloud.

So how long will all of this take? Latencies are high and bandwidth low for the first hop, because sensors rarely have great connectivity, and almost never have the higher levels of power required for really fast data transfers (even with 5G). So perhaps we will see a 10ms delay at that stop, plus more if the data is large. Inside the edge we should have a NUMA machine or perhaps a small cluster, and can safely assume 10G connections with latencies of 10us or less, although of course software like TCP will often impose its own delays. The big delay will probably be the handoff to the user-defined function, F.

My guess is that for an event that requires downloading a small photo, the very best performance will be something like 50ms before F sees the event (maybe even 100ms), then another 50-100 for F to request a download, then perhaps 200ms for the camera to upload the image to the blob server, and then a small delay (25ms?) for the blob server to trigger another event, F', saying "your image is ready!". We're up near 350ms and haven't done any work at all yet!

Because the function server is limited to lightweight computing, it hands off to our micro-service (a quick handoff because the service is already running; the main delay will be the binding action by which the function connects to it, and perhaps this can be done off the critical path). Call this 10ms? And then the micro service can decide what to do with this image.

Add another 75ms or so if we have to forward the request to the cloud. So the cloud might not be able to react to a photo in less than about 500ms, today.

None of this involved a Jim Gray kind of analysis of contention and backoff and retry. If you took my advice and used Derecho for any data replication, the 500ms might be the end of the story. But if you were to use a database solution like MongoDB (CosmosDB on Azure), it seems to me that you might easily see a further 250ms right there.

What should one do about these snowballing costs? One answer is that many of the early IoT applications just won't care: if the goal is to just journal that "Ken entered Gates Hall at 10am on Tuesday", a 1s delay isn't a big deal. But if the goal is to be reactive, we need to do a lot better.

I'm thinking that this is a great setting for various forms of shortcut datapaths, that could be set up after the first interaction and offer direct bypass options to move IoT events or data from the source directly to the real target. Then with RDMA in the cloud, and Derecho used to build your micro service, the 500ms could drop to perhaps 25 or 30ms, depending on the image size, and even less if the photo can be fully handled on the IoT Edge server itself.

On the other hand, if you don't use Derecho but you do need consistency, you'll get into trouble quickly: with scale (lots of these pipelines all running concurrently), and contention, it is easy to see how you could trigger Jim's "naive replication" concerns. So designers of smart highways had better beware: if they don't heed Jim's advice (and mine), by the time that smart highway warns that a car should "watch out for that reckless motorcycle approaching on your left!" it will already have zoomed past...

These are exciting times to work in computer systems. Of course a bit more funding wouldn't hurt, but we certainly will have our work cut out for us!

Sunday, 20 January 2019

Derecho status update

As we swing into 2019 mode, I wanted to share a quick Derecho status report and point to some goals for the coming months.

First, our ACM Transactions on Computer Sysms paper will be appearing sometime soon, which should give nice visibility for the work, and also the validation that comes from a tough peer-to-peer reviewing process. The paper has hugely improved through the pushback our reviews provided, so it was a challenge but, I think, worth it. The system itself works really well!

Next, we are starting to focus on a stronger integration with Azure IoT, where Derecho could be used either as a tool for creating new micro-services with strong fault tolerance and consistency guarantees, or as an ultra fast RDMA-capable object store. Microsoft has been supportive of this and Derecho should be available from their third party portal, still as a free and open source technology.

But that portal isn’t available yet. So right now, use Derecho via the v0.9 release of the system, which will be available by February 1 (we are saving v1.0 for later in the year, after we have a reasonable amount of end user experience). As of today, we still have one or two bugs we want to fix before doing that release.

Some key points:

We are urging people to use the Ubuntu Linux version, because this interoperates between normal Linux environments and Azure (the Microsoft cloud). On our release site (download here), you can find the source code but also some VMs (a container and then a true VM) with the library preinstalled. But in fact Derecho should work on any Linux-compatible system.
Right now, Derecho has only been tested in a single cluster, cloud (Azure, including Azure IoT, AWS, etc). We have some limited experience with virtualization, and with ROCE as opposed to pure Infiniband.
The easiest path to using Derecho is via the new key-value store. In this both keys and values can be any serializable object type you like, and we offer a wide range of features: Put and Get, but also a conditional put, which checks that the version you are writing was based on the most current version of the underlying object (useful for atomic replace, like in Zookeeper), plus a watch operation that works just like a topic based pub-sub or DDS. Objects can be stateless, stateful but not versioned, or versioned and persistent with strong consistency and extremely accurate temporal indexing. On this we will eventually support pub-sub (think of Kafka or OpenSplice), file systems (HDFS, Ceph), and maybe even a genuine Zookeeper look-alike. The only caution is that the watch feature isn’t designed to support huge numbers of watched topics. So if you would have more than 50 or 100 active topics, consider using Dr. Multicast to squeeze that set down.
The full system can only be used directly from our templates library API in C++, but you can easily build a wired-down library with no templated methods and then load it from Java or Python or whatever.
Runs on RDMA, OMNIPath, and even on normal TCP with no special hardware help at all. You just configure it via a configuration file, to tell the system how to set itself up. We use LibFabrics for this mapping to the underlying hardware.
Right now, all the Derecho group members need to be on hardware with identical endian and byte alignment policies, but clients not in the group can use RESTful RPC, the OMG DDS stack, WCF or JNI to issue RPCs to Derecho group members, which can then relay the request as appropriate.

Later this year we will extend the system in various ways. The API should stay stable, but the new features would include:

Hierarchically structured WAN layer that does read-only content mirroring for the object store.
A form of ultra fast and scalable LAN and WAN BlockChain support.
Machine checked correctness proofs, and a reference-version of the core Derecho protocols both in a high level form, and as proved and then re-extracted from those proofs in C or C++.
External client access to our API via RDMA, supporting point-to-point send and query.
Integration with Matt Milano’s mobile code language, MixT, allowing a client to send code to data residing in the object store.

Tuesday, 27 November 2018

So what's the story about "disaggregated" cloud computing?

The hardware community has suddenly gone wild over a new idea they call "disaggregated" cloud computing infrastructures. I needed to talk about this in my graduate class, so I've been reading some of the papers and asking around.

If you come at the topic from distributed/cloud computing, as I do, you may have been using disaggregation to refer to a much older trend, namely the "decoupling" of control software from the hardware that performs various data and compute-intensive tasks that would otherwise burden a general purpose processor. And this is definitely a legitimate use of the term, but as it turns out, the architecture folks are focused on a different scenario: for them, disaggregation relates to a new way of building your data center compute nodes in which you group elements with similar functionalities, using some form of micro-controller with minimal capabilities to manage components that handle storage, memory, FPGA devices, GPU, TPU, etc. Recent work on quantum computing uses a disaggregated model too: the quantum device is like an experimental apparatus that a control program carefully configures into a specific state, then "runs", and then collects output from.

The argument is that disaggregation reduces wasted power by supporting a cleaner form of multitenancy. For example, consider DRAM. The “d” stands for dynamic, meaning that power is expended to keep every bit continuously refreshed. This is a costly, heat-intensive activity, and yet your DRAM may not even be fully used (studies suggest that 40-50% would be pretty good). If all the DRAM was in some shared rack, bin-packing could easily give you >95% utilization on the active modules, plus you could power down any unused modules.

So disaggregation as a term really can refer to pretty much any situation in which some resources are controlled from programs running "elsewhere", and where you end up with a distributed application that conceptually is a single thing, but has multiple distinct moving parts, with different subsystems specialized for different roles and often using specialized hardware as part of those roles.

At this point I'm seeing the idea as a spectrum, so perhaps it would make sense to start with the older style of control-plane/data-plane separation, say a few words about the value of such a model, and then we can look more closely at this extreme form and ask how it fits into the picture.

Control/data disaggregation is really pretty common idea and dates back quite far: even the earliest IBM mainframes had dedicated I/O coprocessors to operate disks and tapes, and talked about "I/O channels". I could easily list five or ten other examples that have popped up during the past decade or two: very few devices lack some form of general purpose on-board computer, and we can view any such system as a form of disaggregated hardware. But the one that interests me most is the connection to Mach, at the stage when the project was publishing very actively at SOSP and was run from CMU, with various people heading it: Rick Rashid (who ultimately left to launch Microsoft Research), then Brian Bershad, and then Tom Anderson.

Mach was an amazing system, but what I find fascinating is that it happens to also have anticipated many aspects of this new push into disaggregation. As you perhaps will recall (from all those Mach papers you've read...) the basic idea was to have a single micro-kernel that would host various O/S presentation frameworks: a few versions of Linux, perhaps versions of legacy IBM operating systems, etc. Internally, one of those frameworks consisted of a set of processes sharing DLLs, some acting as servers and others as applications. Memory was organized into segments with read/write/execute permissions, and all the underlying interactions centered on an ultra-fast form of RPC in which a single CPU core could run an application thread in segment A, then briefly suspend that thread and activate a thread context in segment B (perhaps A was an application and B is the file server, for example), run in B for a short period, then return to A. The puzzle was to handle permissions (the file server can see the raw disk, but A shouldn't be allowed to, and conversely, the file server can DMA from a specific set of memory areas in A, but shouldn't be able to scan A's memory for passwords). Then because all of this yielded a really huge address space, Mach ran into page table fragmentation issues that were ultimately solved with a novel software inverted page table. And capabilities were used for the low-level message passing primitives. Really cool stuff!

As you might expect, all of these elements had costs: the messaging layer, the segmentation model, context switching, etc. Mach innovated by optimizing those steps, but keep these kinds of costs in mind, because disaggregation will turn out to run into similar issues.

Which brings us to today! The modern version of the story was triggered by the emergence of RDMA as an option in general purpose data centers. As you'll know if you've followed these postings, I'm an RDMA nut, but it isn't just me. RDMA is one of those rare cases where a technology became very mature in a different setting (namely high-performance computing, where it serves as the main communications backbone for packages like the MPI library, and runs on InfiniBand networks), and yet somehow went mostly unnoticed by the general computing community.... until recently, when people suddenly "discovered" it and went bonkers.

As I've discussed previously, one reason that RDMA matured on Infiniband and yet wasn't used on Ethernet centered on RDMA's seeming need for a special style of credit-based I/O model, in which a sender NIC would only transfer data if the receiver had space waiting to receive the data. It is only recently that the emergence of RoCE allowed RDMA to jump to more standard datacenter environments. Today, we do have RDMA in a relatively stable form within a set of racks sharing a TOR switch, assuming that the TOR switch has full bisection bandwidth. But RDMA isn't yet equally mature across routers, particularly in a COS/Spine model where the core network would often be oversubscribed and hence at risk of congestion. There are several competing schemes for deploying RDMA over RoCE v2 with PPF at data-center-wide scale (DCQCN and TIMELY), but work is still underway to understand how those can be combined with Diffsrv or Enterprise VLAN routing) to avoid unwanted interactions between standard TCP/IP and the RDMA layer. It may be a few years before we see "mature" datacenter-scale deployments in which RDMA runs side-by-side with TCP/IP in a non-disruptive way over RoCE v2 (in fact, by then we may be talking about RoCE v5)! Until then, RDMA remains a bit delicate, and must be used with great care.

So why is RDMA even relevant to disaggregation? There is a positive aspect to the answer, but it comes with a cautionary sidebar. The big win is that RDMA has a simple API visible directly to the end-user program (a model called "qpairs" that focuses on "I/O verbs". These are lock-free, and with the most current hardware, permit reliable DMA data transfers from machine to machine at speeds of up to 200Gbps, and latencies as low as .75us. The technology comes in several forms, all rather similar: Mellanox leads on the pure RDMA model (the company has been acquired and will soon be renamed Xylinx, by the way), while Intel is pushing a variant they call OMNI-Path, and other companies have other innovations. These latencies are 10x to 100x better than what you see with TCP, and the data rates are easily 4x better than the fastest existing datacenter TCP solutions. As RDMA pushes up the performance curve to 400Gbps and beyond, the gaps will get even larger. So RDMA is the coming thing, even it hasn't quite become commonplace today.

RDMA is also reliable in its point-to-point configuration, enabling it to replace TCP in many applications, and also offers a way to support relatively low latency remote memory access with cache-line atomicity. Now, it should probably be underscored that with latencies of .75us at the low end to numbers more like 3us for more common devices, RDMA offers the most NUMA of NUMA memory models, but even so, the technology is very fast compared to anything we had previously.

The cautions are the obvious ones. First, RDMA on RoCE is really guaranteed to offer those numbers only if the endpoints live under a TOR switch that can support full bidirectional communication. Beyond that, we need to traverse a router that might experience congestion, and might even drop packets. In these cases, RDMA is on slightly weaker ground.

A second caution is simply that applications can't really handle long memory-access delays, so this NUMA thing becomes a conceptual barrier to casual styles of programming. You absolutely can't view RDMA data-center memory as a gigantic form of local memory. On the contrary, the developer needs to be acutely aware of potential delays, and hence will need to have full control over memory layout. Today, we lack really easily-used tools for that form of control, so massive RDMA applications are the domain of specialists who have a great deal of arcane hardware insight.

So how does this story relate to disaggregation? A first comment is that RDMA is just one of several new hardware options that force us to take control logic out of the main data path. There are a few other examples: with the next generation of NVRAM storage (like Intel's Optane), you can DMA directly from a video camera into a persistent storage card, and with RDMA, you'll be able to do so even if the video isn't on the same machine as the place you plan to store the data. With FPGA accelerators, we often pursue bump-in-the-wire placements: data streams into a memory unit associated to the FPGA at wire rates, it does something, and then we might not even need to touch the main high-volume stream from general purpose code, at all. So the RDMA story is pointing to a broader lack of abstractions covering a variety of data-path scenarios. Moreover, our second caution expands to a broader statement: these different RDMA-like technologies may each bring different constraints and different best-case use scenarios.

Meanwhile, the disaggregation folks are taking the story even further: they start by arguing that we can reduce the power footprint of rack-scale computing systems if we build processors with minimal per-core memory and treat that memory more like a cache than as a full-scale memory. Because memory runs quite hot but the heat dissipation is a function of the memory size, this can slash the power need and heat generation. Then if the processor didn't really need much memory, we are golden, and if it did, we can start to think about ways to intelligently manage the on-board cache, paging data in and out from remote memories hosted in racks dedicated to memory resources and shared by the processor racks (yes, this raises all sorts of issues of protection, but that's the idea). We end up with racks of processors, racks of memory units, racks of storage, racks of FPGA, etc. Each rack will achieve a much higher density of the corresponding devices, which brings some efficiencies too, and we also potentially have multi-tenancy benefits if some applications are memory-intensive but others need relatively little DRAM, some use FPGA but some don't, etc.

The main issue -- the obvious one -- is latency. To appreciate the point, consider standard NUMA machines. The bottom line, as you'll instantly discover should you ever run into it, is that on a NUMA machine, DRAM is split into a few modules, each supporting a couple of local cores, but interconnected by a bus. The issue is that although a NUMA machine is capable of offering transparent memory sharing across DRAM modules, the costs are wildly different when a thread accesses the closest DRAM than when it reaches across the bus to access a "remote" DRAM module (in quotes because this is a DRAM module on the same motherboard). At the best, you'll see a 2x delay: your local DRAM might run with 65ns memory fetch speeds, but a "remote" access could cost 125ns. These delays will balloon if there are also cache-coherency or locking delays: in such cases the NUMA hardware enforces the consistency or locking model, but it runs backplane protocols to do so, and those are costly. NUMA latencies cause havoc.

Not long ago, researchers at MIT wrote a whole series of papers on optimizing Linux to run at a reasonable speed on NUMA platforms.

The cross-product of RDMA with NUMA-ness makes life even stranger. Even when we consider a single server, an RDMA transfer in or out will run in slow motion if the RDMA NIC happens to be far from the DRAM module the transfer touches. You may even get timeouts, and I wonder if there aren't even cases where it would be faster to just copy the data to the local DRAM first, and then use RDMA purely from and to the memory unit closest to the NIC!

This is why those of us who teach courses related to modern NUMA and OS architectures are forced to stress that writing shared memory multicore code can be a terrible mistake from a performance point of view. At best, you need to take full control: you'll have to pin your threads to a set of cores to some single DRAM unit (this helps: you'll end up with faster code. But... it will be highly architecture-specific and on some other NUMA system, it might perform poorly).

In fact NUMA is probably best viewed as a hardware feature ideal for virtualized systems with standard single-threaded programs (and this is true both for genuine virtualization and for containers). Non-specialists should just avoid the model!

So here we have this new disaggregation trend, pushing for the most extreme form of NUMA-ness imaginable. With a standard Intel server, local DRAM will be accessible with a latency of about 50-75ns. Non-local jumps to 125ns, but RDMA could easily exceed 750ns, and 1500ns might not be unusual even for a nearby node, accessible with just one TOR switch hop. (Optical networking could help, driving this down to perhaps 100ns, but that is still far in the future and involves many assumptions.)

There may be an answer, but it isn't obvious people will like it: we will need a new programming style that systematically separates data flows from control, so that this new style of out of band coding would be easier to pull off. Why a new programming style? Well, today's prevailing style of code centers on a program that "owns" the data and holds it in locally malloc-ed memory: I might write code to read frames of video, then ask my GPU to segment the data -- and the data transits through my DRAM (in fact the GPU would often use my DRAM as its memory source). Similarly for FPGA. In a disaggregated environment, we would give the GPU or the FPGA memory of its own, then wire it directly to the data source, and control all of this from an out of band control program running on a completely different processor. The needed abstractions for that style of computing are simply lacking in most of today's programming languages and systems. You do see elements of the story -- for example, Tensor Flow can be told to run portions of its graph on a designated TPU cluster -- but I would argue that we haven't yet found the right OS and programming language coding styles to make this easy at data-center scale.

Worse, we need to focus on applications that would be inherently latency-tolerant: latency barriers will translate to a slowdown for the end user. To avoid that, we need a way of coding that promotes asynchronous flows and doesn't require a lot of hopping around, RPCs or locking. And I'll simply say that we lack that style of coding now -- we haven't really needed it, and the work on this model hasn't really taken off. So this is a solvable problem, for sure, but it isn't there yet.

Beyond all this, we have the issues that Mach ran into. If we disaggregate, all of those same questions will arise: RPC on the critical path, context switch delays, mismatch of the hardware MMU models with the new virtualized disaggregated process model, permissions, you name it. Today's hardware is nothing like the hardware when Mach was created, and I bet there are all sorts of cool new options. But it won't be trivial to figure them all out.

I have colleagues who work in this area, and I think the early signs are very exciting. It looks entirely plausible that the answers are going to be there, perhaps in five years and certainly in ten. The only issue is that the industry seems to think disaggregation is already happening, and yet those answers simply aren't in hand today.

It seems to me that we also risk an outcome where researchers solve the problem, and yet the intended end-users just don't like the solution. The other side of the Tensor Flow story seems to center on the popularity of just using languages everyone already knows, like Python or Java, as the basis for whatever it is we plan to do. But Python has no real concept of resource locations: Tensor Flow takes Python and bolts on the concept of execution in some specific place, but here we are talking about flows, execution contexts, memory management, you name it. So you run some risk that smart young researchers will demonstrate amazing potential, but that the resulting model won't work for legacy applications (unless a miracle occurs and someone finds a way to start with a popular language like Tensor Flow and "compile" it to the needed abstractions). Can such a thing be done? Perhaps so! But when? That strikes me a puzzle.

So I'll stop here, but I will say that disaggregation seems like a cool and really happening opportunity: definitely one of the next big deals. Researchers are wise to jump in. But people headed into the industry, though, who are likely to be technology users rather than inventors, need to appreciate that it is still too early to speculate about what might be successful, and when, and definitely too early for them to get excited: the bottom line message in the lecture I just gave on this topic!

Thursday, 9 August 2018

Magical Thinking and the Logical Foundations of BlockChains

During the past few years, I've been exposed to an unrelenting drumbeat for BlockChains. The level of enthusiasm for this model, and the commercial mania around it, have all the elements of a "market bubble". Just yesterday I saw a quote from a BlockChain/CyberCoin billionaire who believes that "BlockChain will replace the Internet." Really? But search for that phrase and you'll actually find that this guy is saying something many people believe. Rational or not, there is a huge community totally convinced that the future will be a giant BlockChain.

The BlockChain buzz was evident at the recent conference I attended, where one speaker told us about a Berkeley spin-off based on BlockChain: Oasis, which just landed $45M in first-round "seed" funding. Just think about that for a moment: how can such a number be justified? I'm a skeptic.

Oasis apparently plans to build a secure infrastructure using BlockChain as the storage solution, but entirely secure from the ground up. Presumably this is one reason Oasis needed so much cash: most companies these days just run on a public cloud like Azure or Amazon (or any of the others). Building from the ground up will be expensive, but would let Oasis avoid a pitfall other startups share: because its competitors depend on existing datacenters, they are exposed to whatever security flaws those cloud platforms embody.

But can one build a secure data center from the ground up? Let's focus purely on storage, since this seems to be the essence of the Oasis plan. Could one build a new kind of secure data center that (only) provides secure storage using BlockChain, built from the ground up?

The first step is to reject the permissionless BlockChain model, which is too weak to guarantee freedom from rollbacks even years after a transaction seemingly commits: permissionless BlockChain systems with anonymous servers are insecure by design. We want a minimal BlockChain solution, but if you take minimal to mean "anonymous, globally replicated, permissionless", my answer is that "it can't be done: it is impossible to create a trustworthy platform with that mix of properties and technologies."

Fortunately, the permissioned model avoids this risk. Moreover, it seems reasonable to assume that if the goal is to offer a data-center BlockChain product, the data center operator can control which machines can operate the solution. This moves us from the BitCoin style of anonymous, amorphic, fully decentralized BlockChain to a more standard model of append-only files used by customers like banks.

Next, we should perhaps reject a standard cloud that offers encrypted append-only files. This is an interesting step in the analysis because a block chain really is just a secure append-only file, no matter what anyone might tell you (secured using SHA 256 hashes or similar block hashes, with proof-of-work if the system is permissionless, and then with the signatures entangled to prevent tampering). Any file system could play that role, if you code the needed logic to generate records formatted this way and with the required chain of attestation. Amazon and Azure and other cloud companies already offer extremely secure storage layers, including BlockChain services. But as noted, they do depend on other elements of the respective datacenter systems. So out with the old, and in with the new!

Now, without knowing anything about the proprietary protocols that Oasis is presumably designing, I can't say anything about how they plan to guarantee correctness. But I can tell you how I would do it. I would use a form of Paxos, and because I would want extreme speed, would go with the version of durable Paxos that we implemented in Cornell's Derecho system. If I were chief architect at Oasis, I might want to build my own software (hence not use Derecho), but I would certainly adopt the Paxos specification, and prove that my software implements it.

Of course being a Derecho zealot, I'm going to make the case that using Derecho might be the smart move, even if you might have preferred to roll your own.

First, I should note that by using Paxos, Derecho is able to leverage decades of work on proofs of correctness -- Derecho was implemented by fusing a proved-correct version of Paxos integrated with a proved-correct version of the virtual synchrony membership management model and a new reliable (but not atomic) multicast layer. Then all of the data movement steps were mapped to the available storage (SSD or 3-D XPoint) and communications technologies (RDMA or TCP) in an efficient way. Finally, Derecho uses specialized optimizations for code paths that turn out to be performance-critical.

Next, it is worth noting that Derecho is open source, hence can benefit from community contributions over time, and also that the system is extremely fast -- the fast such system ever created. Further, it actually achieves minimal bounds for Paxos (Derecho's protocols are "optimal" in a formal sense). Of course one can talk about smaller constants and so forth, but given the speed of the system, there is an obvious case to be made that the constants aren't showing any sign of being unreasonably large. So Derecho is a genuinely interesting option! Heck, maybe my students and I should try and raise $50M or so and jump into the commercial space!

But now we run into the first of a series of issues: the protocols and proofs we are leveraging were all created in the usual way: on paper, by hand. The mapping didn't modify the underlying logic, yet it required to adapt them to modern networking is by hand, too, and hence also needed to be proved by hand. The same can be said about our dozens of optimizations.

Is there really a sensible way to argue that all of these hand-written proofs be accepted as somehow being "more trustworthy" than Amazon AWS or Azure? After all, those are companies are serious about specifications too, and further, both have invested hundreds of millions of dollars in their testing and Q/A process. Derecho is remarkably robust, and we would point to all those proofs as part of the reason, but even so, we do find bugs in our logic.

Now, if I had the money, one option might be to harden Derecho. Aggressive testing, deep quality assurance and better documentation would be a breeze. Starting with such a strong prototype, we could quickly end up with a professional-quality tool. In fact I actually hope to do this, in the next year or so, if Derecho users are able to chip in to help with the costs.

But perhaps you still wouldn't be satisfied. And indeed, today's state of the art goes much further: the very best approach is to only run code extracted from a machine-checked proof. In effect, we compile the proof into the running system and take the developers entirely off the development path.

This, though, turns out to be infeasible for software as elaborate as Derecho, and would be even harder for whatever Oasis really plans to build. The issue here is that as of today, the provers can't handle the style of code that we needed to use in order to create Derecho, and any full data-center infrastructure would have 10x more such code, and perhaps far more than just 10x.

Today's best provers actually can handle automated extraction of Paxos code from correctness proofs (this has been done using NuPRL here at Cornell, or IOA and Larch in Nancy Lynch's group at MIT, or IronFleet, using Dafny at Microsoft). The resulting solutions are very robust in the sense that the logical proof structures that result are rock solid -- for what they cover. Unfortunately, however, the resulting code is incredibly slow compared to Derecho. Moreover, infrastructure aspects are much harder to formalize than data replication and consistency, so in some sense, Paxos is the "easy" part of the story. Can we specify the system, fully? Is the specification itself bullet-proof? Bootstrap? Other management tasks? These steps are much harder than many people would expect. They may not even be possible, in some cases!

This same pattern is evident in many projects that have formalized aspects of operating systems or runtime environments. At MIT, Nick Zeldovich famously proved a very simple file system correct a few years ago. It ran on Linux, but there is a Linux u-kernel called SEL4 that has been proved correct, and while it doesn't cover all of Linux, SEL4 probably has enough stuff in it to support Nick's provably correct file system.

Then you could perhaps use the proved version of the C compiler to compile the thing (C++ is considered to be too complex for existing provers). Even better would be to just build the whole thing in a language like RUST or Dafny, where proof is much more central to the coding and compilation model. With such help, you may actually manage to create a proved solution, for parts of your full system.

But without exception, you'll end up with slow code compared to the best possible, and will have solved just a portion of the entire datacenter infrastructure. We are decades from having data-center scale, provably correct , ultra-performant software systems. Perhaps we'll never get there.

Moreover, even if we did manage to overcome these barriers, we would run into further questions.

One big issue is the hardware. Think about any hardware component in the entire data center. The routers. Printers. Digital telephone systems. Storage devices.

$45M may seem like a huge amount of money, but in fact it is a tiny drop in the bucket when you consider that companies like Intel spend billions on their VLSI chip fabrication factories. So there is simply no question that Oasis will end up using components some other company created.

The problem is that these components tend to be software-defined, and this is becoming more and more of a standard story: Almost all hardware components have general-purpose, highly capable processors within them. An entire separate computer, with its own memory, network interfaces and code.

Indeed, if you needed one CPU and DRAM unit just to operate the hardware, why not include two on the same board, or chip? Who would notice? And if you do this, why not drop malicious code into the second unit? Your printer company may not even realize that it is selling compromised devices: I've heard of network chips that had entire secondary networking infrastructures built in, operated by entire stealth operating systems. You can monitor your network as closely as you like. You would never notice the stealth network, and its control logic!

So you can perhaps make the higher layers provably secure, but this issue of trust in the hardware limits the degree that the resulting system could be secure.

If you run a secure solution on compromised hardware, all bets are off. The situation is a bit better if you trust the hardware: Intel has an approach called SGX that can do a bit better, and perhaps Oasis plans to leverage it, but if so, they will face performance challenges. Sadly, SGX is quite slow.

But suppose they pull all of this off: a ground-up datacenter solution, minimal trust in the hardware, offering a BlockChain storage layer to customers. Now we run into a new puzzle: the issue arises of how to draw the boundary between the trusted storage solution and the customer's application.

The problem here has to do with composing a trusted application with a trusted storage layer through some form of less-trusted layer of intermediary logic, like the runtime associated with the programming language. Modern applications are coded in standard languages like Java, Python, C++, Ruby. They use databases from big vendors like Oracle, web servers like Apache, and so forth... and each of brings its own millions of lines of logic and its own runtime environment. Those presumably have flaws, hence an attacker could potentially compromise the application by compromising some element of the compiler or runtime, even if other aspects of the datacenter BlockChain storage solution magically were iron-clad.

And this is why I'm a skeptic: I don't see how such a picture can be secured, not today, and probably not in my lifetime.

I actually do believe in security, but I think that in modern systems, we get much more protection from diversity and multiple layers of monitoring than we do from automated techniques like verification. We should definitely do verification where we can -- for systems like Derecho, it offers a path that can slash bug rates and greatly improve confidence in the correctness of the system relative to the promises it can legitimately make. But to me we oversell the power of verification and proofs if we go further and allow people to believe that we have discovered a magic way to carry this idea to the limits and "prove the whole thing", whatever that thing may be. BlockChains don't change this reality.

The Oasis folks will presumably read this blog, and I should emphasis that it isn't a personal criticism. I'm a huge fan of the Berkeley security team and have been amazed by their accomplishments. Amazing work has been done. But even so, and without being hostile in any way at all, I do think we all share a need to be frank with the public about what technology can, and cannot, accomplish.

BlockChains are being oversold as a trivial way to get perfect security -- not by Oasis, but by the popular technology press, which seems to have a very dim understanding of the concept. The press seems to think BlockChains are somehow "more" than secure file systems. Yet if anything, they are "less"!

Perhaps we are seeing a superposition of two elements. Clearly, we do have a community that was unaware of the concept of a tamper-proof append-only log protected using cryptography. All sorts of folks who work in government, health care, manufacturing clearly feel that this is a revelation, and offers an answer to all their security worries. And quite possibly some were actually not aware that we can use file systems in secure ways -- I'm glad they know, now, and if BlockChain helps them conceive of this, I'm all for it.

Then we have hype, driven by cryptocurrency valuations in the stratosphere, a technology press endlessly eager for the next big thing, and investors keen to make a killing. In this marketplace, I see a lot of value that companies can bring to the table, particularly ones with exciting ideas for packaging this stuff to be useful, and some of this value is very real: HyperLedger and Ethereum and related tools are genuinely powerful.

But just the same, we need to stop claiming that BlockChain is a revolutionary invention.

My worry is that by overselling the same old file systems to a naïve community of infrastructure owners, we may end up with systems that are actually less secure than what they replace. After all, a platform like Azure is actually remarkably secure today, managed professionally by a company obsessed with security (and nobody is paying me to say this, by the way!), and has such a diversity of technologies deployed that compromising it in a major way would be quite hard. If some hospital were to abandon that and jump to BlockChain as its file system, the illusion of security may have replaced much stronger real security. True, today's reality has its limits, but in fact I would be far more comfortable seeing medical records hosted on Azure or AWS, than abandoning these professional solutions in favor of a storage product that uses BlockChain as a magic wand to solve all problems.

But clearly, the current climate (especially the technology press) is prone to magical thinking, and a bit weak on just what BlockChains are, and how they work, and what their logical foundations turn out to be. And in light of that, I suppose that the amazingly high first round of Oasis investment makes a kind of sense. A bubble? Definitely. And yet all valuations are measures of market sentiment. So perhaps not an unreasonable bubble, given the modern business climate and the craze that BlockChain has engendered.