Monday, 15 October 2018

Improving asynchronous communication in u-service systems.

I've been mostly blogging on high-level topics related to the intelligent edge, IoT, and BlockChain lately.  In my work, though, there are also a number of deeper tradeoffs that arise.  I thought I might share one, because it has huge performance implications in today's systems and yet seems not to get much "press".

Context.  To appreciate the issue I want to talk about, think about a data pipeline moving some sort of stream of bytes at very high speed from point A to point B.  If you want to be very concrete, maybe A is a video camera and B is a machine that runs some form of video analytic.  Nothing very exotic.  Could run on TCP -- or it could use RDMA in some way.  For our purposes here, you have my encouragement to think of RDMA as a hardware-accelerated version of TCP.

Now, what will limit performance?  One obvious issue is that if there is some form of chain of relaying between A and B, any delays in relaying could cause a performance hiccup.  Why might a chain arise?  Well, one obvious answer would be network routing, but modern cloud systems use what we call a micro-services (u-service) model, and for these, we deliberately break computations down into stages, and then each stage is implemented by attaching a small "function" program (a normal program written in C++ or Java or a similar language, running in a lightweight container setting like Kubanetes, and with "services" like file storage coming from services it talks to in the same data center, but mostly running on different machines).  You code these much as you might define a mouse-handler method: there is a way to associate events with various kinds of inputs or actions, and then to associate an event handler with those events, and then to provide the code for the handler.  The events themselves are mostly associated with HTML web pages, and because these support a standard called "web services", there is a way to pass arguments in, and even to get a response: a form of procedure call that runs over web pages, and invokes a function handler program coded in this particular style.

So why might a u-services model generate long asynchronous chains of actions?  Well, if you consider a typical use case, it might involve something like a person snapping a photo, which uploads into the cloud, where it is automatically analyzed to find faces, and then where the faces are automatically tagged with best matches in the social network of the camera-owner, etc.  Each of these steps would occur as a function event in the current functional cloud model, so that single action of taking a photo (on Facebook, WhatsApp, Instagram, or whatever) generated a chain.  In my work on Derecho, we are concerned with chains too.  Derecho is used to replicate data, and we often would see a pipeline of photos or videos or other objects, which then are passed through layers of our system (a chain of processing elements) before they finally reach their delivery target.

Chains of any kind can cause real issues.  If something downstream pauses for a while, huge backlogs can form, particularly if the system configures its buffers generously.  So what seems like a natural mechanism to absorb small delays turns out to sometimes cause huge delays and even total collapse!

Implications.  With the context established, we can tackle the real point, namely that for peak performance the chain has to run steadily and asynchronously: we need to see an efficient, steady, high-rate of events that flow through the system with minimal cost.  This in turn means that we want some form of buffering between the stages, to smooth out transient rate mismatches or delays.  But when we buffer large objects, like videos, we quickly fill the in-memory buffer capacity, and data will then spill to a disk.  The stored data will later need to be reread when it is finally time to process it: a double overhead that can incur considerable delay and use quite a lot of resources.  With long pipelines, these delays can be multiplied by the pipeline length.  And even worse, modern standards (like HTML), often use a text-based data representation when forwarding information, but use a binary one internally: the program will be handed a photo, but the photo was passed down the wire in an ascii form.  Those translations cost a fortune!

What alternative did we have?  Well, one option to consider is back-pressure: preventing the source from sending the photo unless there is adequate space ("credits") available in the receiver.  While we do push the delay all the way back to the camera, the benefit is that these internal overheads are avoided, so the overall system capacity increases.  The end-to-end system may perform far better even though we've seemingly reduced the stage-by-stage delay tolerance.

But wait, perhaps we can do even better.   Eliminating the ascii conversions would really pay off.  So why not trick the system into taking the original photo, storing it into a blob ("binary large object") store, and substituting a URL in the pipeline.  So now our pipeline isn't carrying much data at all -- the objects being shipped around might be just a few tens of bytes in size, instead of many megabytes.   Now the blob store can offer a more optimized photo transfer service, and the need for buffering in the pipeline itself is eliminated because these tiny objects would be so cheap to ship.

This would be a useless optimization if most stages need to touch the image data itself, unless the stage by stage protocol is hopelessly tied to ascii representations and the blob-store-transfer is super-fast (but wait... that might be true!).

However, there is even more reason to try this: In a u-services world, most stages do things like key-value storage, queuing or message-bus communication, indexing.  Quite possibly the majority of stages aren't data-touching at all, and hence wouldn't fetch the image itself.  The tradeoff here would be to include meta-data actually needed by the intermediary elements in the pipeline while storing rarely-needed bits in the blob store, for retrieval only when actually needed.  We could aspire to a "zero copy" story: one that never copies unnecessarily (and that uses RDMA for the actually needed data transfers, of course!)

Tuesday, 25 September 2018

Blockchain verification and the dangers of branching temporal logic models (very technical)

Many years ago, I worked with some colleagues on an ill-fated topic: we tried to write down a logical statement concerning the guarantees provided by atomic multicast systems that manage their own membership.  Today, we know how to do that, courtesy of Lamport’s Paxos specification and the proof methods he introduced.

But those were the Wild West days, and that particular project occurred before the Paxos specification was created.  Moreover, our atomic multicast (which actually could be configured as a Paxos protocol), also included some optional features Leslie omitted from Paxos, for good reason.  Those centered on a form of optimistic early delivery combined with barrier synchronization (analogous to a memory fence).

Our puzzle centered on expressions describing "all the possible" future behaviors that could arise with this rather complex optimistic form of early delivery.  The problem was that the set of future scenarios grew exponentially as new members join and existing members failed (or exited voluntarily).  Our logic needed to "account" for all of these possibilities.  In fact, the logic itself had a flaw, but even had we managed to write it down properly, we still would have had a semantic in which statements can only be interpreted within the “space” of future scenarios.  Since it is intractable to do model checking in an exponentially growing state space, such statements are often undecideable: they can have a true or false value and yet no decision procedure can be created that will terminate in bounded time.

A total mess.  We ultimately abandoned the effort, tails between our legs, and came to view it as an embarrassment.  The less said about it, the better!

Except... I sometimes tell the story, as a cautionary tale.

Not every technology suffers from the issue we encountered.  Lamport's way of formalizing Paxos avoids the issue by avoiding speculative "early delivery", which was the real issue in our early work.  This is one reason that Lamport's Paxos specification was such a success.

Transactional database models also have a better way of handling such problems: when we ask whether a database system is in a serializable state, the rule is to start by erasing all the uncommitted transactions, at which point serializability is defined as a property of the committed state.  This approach accepts that transactions could glimpse inconsistent states while executing: it isn't a problem so long as those transactions can't commit.  Moreover, it erases all the events that depend on future outcomes, neatly avoiding the whole issue our unfortunate effort ran up against.

Which brings us to BlockChain.  I'm intrigued by the recent work that seeks to put a kind of transactional database behavior "into" the BlockChain, by incorporating SQL-like statements into the transactions themselves, but then reevaluating them as the BlockChain steadily grows.

To appreciate why this poses the same problem I struggled with twenty years ago, think about a smart contract that says something along the following lines: "John agrees to sell me his horse, Bucky, for the sum of $1,500, and has accepted a $150 deposit.  If I haven't completed the purchase within a week, John agrees to return the deposit.  But in the meanwhile, John can continue to try and sell Bucky.  If he finds another buyer, he can cancel my transaction, but in that case must both return the deposit and also pay me an addition $100, to compensate me for my trouble."

The world is full of contracts like these.   Smart contracts can express things like rules for computing interest that depend on global interest rates.   We probably all remember 2008, when the world financial system melted down over issues with mortgage-backed  securities split into interest and principle.  The claim is that the expressive power of smart contracts is a good thing, because smart contracts can be analyzed by tools (compiler-style tools), and hence it should be possible to automatically identify risks. Risk managers would then have robust ways to compute their risk and hedge against it, and those hedging contracts (insurance deals) could also be carefully evaluated, etc.  Back in 2008, it wasn't the mortgage-backed securities that collapsed, per-se.  It was the insurance companies that insured them, but didn't hedge the "secondary" risk properly.

So... how does one attach a "formal meaning" to a smart contract?  Let's go back to John's sale of Bucky.  Notice that this contract depends on how things play out into the future.  For the coming week, the BlockChain will grow, and each new block added to the chain could bring events relevant to the contract.  John could decide to cancel the contract and sign a new contract with Sally (perhaps she is ready to pay more -- enough to more than compensate John for the $100 he'll forfeit).   I could show up with the remaining $1350, and head home with Bucky.  A week could pass, and John would have to return my $150 deposit.  And it gets worse: John could sign with Sally, but then Sally might cancel her deal, and perhaps John would then want to reinstate his deal with me.

Much as in that early work I tried to do, John's smart contract with me has a meaning that can depend on a branching future state: some large (maybe exponential) space of possibilities, each leading to its own proper interpretation of the correct "thing to do".  Should John hand Bucky over to me, or not?  Do I owe him $1,350, or does he owe me $150, or should it be $250?

Without much trouble, we can design sequences of smart contracts in which to know the proper outcome for my contract, I need to figure out the outcome of Sally's contract (and this then becomes an induction, because Sally's contract may depend on the outcome of Carl's contract).  This is precisely how my early work failed: you end up with scenarios that can be arbitrarily prolonged, and the total space of scenarios grows exponentially in the length of the future chain,  because of an endlessly longer sequence of new events that each depends on its own future outcomes.

Beyond all of which we have the issue of rollbacks: even if you accept the common wisdom and adopt the view that a BlockChain prefix has magically "committed" once it has been extended by six or more blocks, we still run into the problem that the suffix is unstable.  So we could have one suffix in which Sally's transaction finalizes, but it might then rollback, aborting that outcome and perhaps replacing it with one in which Sally cancels her purchase.

Should it trouble us that smart contracts on BlockChains might not have a tractable meaning -- a reduction to temporal logic -- if they include future references?  For that matter, even without future references, some of the aspects just mentioned would still arise.   Is this bad?

I think so: it seems to me that in computing, if we're learned one thing over the decades, it is the overarching importance of rigorous semantics.  If BlockChains with smart contracts can't be reduced to a stable logical framework in which proofs can be carried out without solving infeasible problems (evaluating logical formulas within exponentially growing state spaces is a well-known infeasible problem), then we are looking at a profoundly untrustworthy framework.

So beware, all of you rabid BlockChain investors!  If you are betting big on smart contracts, you owe it to yourselves to figure out a way to reduce the statements those contracts make to a stable, computationally feasible form.  You know what they say: those who fail to learn from the past are doomed to repeat it.  If you don't find a robust and tractable semantics for your technology, then someday, twenty years from you, too will be writing little blog postings about how your work once took a hopelessly wrong turn...  and that Professor Birman's sad story of his unfortunate foray into the theory of branching future executions should have warned you!

Thursday, 13 September 2018

Will HPC survive the cloud?

I just got back a from an HPC workshop, where a lot of the discussion was focused on the impact of cloud computing on HPC.  Here are a few of the main take-aways.
  • First, to get this up in front, HPC is alive and well.  A whole slew of amazing new computers are about to be powered up, operating at speeds that just defy human understanding.  So HPC isn't about to collapse and die tomorrow.  (Ten years out, though, is a more complex question).
  • Some of the really big financial drivers for HPC are things that genuinely need massive compute infrastructures: tasks like weather prediction, scientific computing from experiments like the LIGO gravitational-wave observatory, modelling the air flow around a supersonic jet.
  • But more and more HPC tasks have an embarrassingly parallel structure and really map down to huge numbers of subtasks that don't need massive computers to perform.  One person estimated that 90 to 95% of the workload on today's biggest computers consists of vast numbers of smaller jobs that run as a batch, but could easily be performed on smaller machines if they had the right hardware.
  • And several speakers put pictures of big cloud computing data centers up, and pointed out that no matter how exciting those new HPC systems will be, even a small cloud data center has vastly more compute power in it, and vastly more storage capacity.
  • On top of this, we have the runaway success of Microsoft's Azure HPC, which has become a genuinely hot cloud platform -- the demand far exceeds what people had expected, based on industry articles I've followed over the past few years.  Azure HPC offers smallish clusters that might have, say, 48 machines, but those machines would then be running the same "bare metal" platforms you see on massive HPC supercomputers.  And nothing stops Azure HPC from ramping up and starting to offer larger and larger configurations.  Rather than run MPI over RoCE, Microsoft just puts a second network infrastructure on their Azure HPC clusters, using InfiniBand for MPI and treating the standard ethernet as a control network for general TCP/IP uses.

So this is the looming threat to the HPC community: not so much that HPC might suddenly loose its steam, but rather that we could see some non-trivial percentage of the HPC jobs migrate towards platforms like Azure HPC.  And in fact one speaker at the workshop was the head of computing for a large research university, who told us about a consortium being formed to promote just that transition.  What he explained was that while really big HPC still needs the big data centers, like the U. Texas XSEDE systems, most of the campus needs could be adequately served with smaller resources.  This makes it appealing for universities to rent, rather than own, and by forming consortia, they could have the bargaining power to make financially compelling deals with big cloud HPC operators like Microsoft (and not just Microsoft -- he pointed out that as a buyer shopping around, he was getting bids from quite a few cloud providers).

The issue this raises is that it redirects money that would in the past have flowed to the HPC data centers towards those big providers.  Imagine a world in which, say five years from now, 30% of today's HPC has moved to cloud solutions.  The loss of that income base could make it very hard for the big data centers to continue to invest and upgrade.  Meanwhile, all that cash flowing to the data center owners would incent them to explore more and more ambitious cloud-hosted HPC products, trying to find the sweet spot that maximizes income without overstretching them. 

The second issue I'm seeing relates to my new favorite topic: the intelligent, reactive cloud edge.  Recall from my past few blog postings that I've been fascinated by the evolution of the first tier of the cloud: machines inside the data center, but that are on the front line, running services that directly handle incoming data from IoT devices, real-time uses like smart cars or smart homes, or other time-critical, highly demanding, applications.  Part of my interest is that I'm really not fond of just working on web servers, and these intelligent IoT applications need the mix of fault-tolerance and consistency that my group specializes in: they seem like the right home for our Derecho technology and the file system that runs over it, Freeze Frame.

But this has an HPC ramification too: if companies like Microsoft want Azure HPC to be a player in their cloud applications, they will invest to strengthen the options for using HPC solutions as part of real-time edge applications.  We'll see a growing range of HPC platforms that can tie deeply right into the Azure IoT Edge, for example, and HPC could start to perform demanding tasks under real-time pressure.

Right now, those standards aren't at all mature -- HPC systems are casual about endlessly slow startup (I did one experiment with MPI and was shocked to realize that multi-minute delays are totally common between when a job "starts" and when the full configuration of the job is actually up and ready to run my application).  We could talk about why this is the case: they do silly things like pulling the container images one by one on the nodes as they launch, and sometimes actually pull DLLs in one by one as needed too, so the causes are totally mundane.  Derecho (or even its RDMC component) could be "life transforming" for this kind of thing!  But the real point is that it can be fixed.

So imagine that over a five year period, the Azure edge, and similar systems from Amazon and other providers, start to really do a great job of integrating HPC into the cloud.  The rich and extensive tool set the HPC community has developed suddenly becomes available to cloud application creators, for use in real-time situations, and it becomes easy to capture data and "farm it out" to HPC with no significant delays at all (I mean milliseconds, whereas today, that might be minutes...).  Wow, what an exciting thing this could yield!!!

For example, in the electric power grid settings I've worked on, one could do micro-predictions of wind patterns or sunshine patterns and use that to anticipate the power output from wind farms or solar farms.   You could adjust the wind turbines dynamically to maximize their productivity. Someday, with enough knowledge of the communities connected to the grid, we could even predict the exact power output from city-scale rooftop solar deployments.  Just that one use case could be transformative!

Then you can imagine all kinds of image processing and data fusion tasks that would be feasible today in offline settings, but way out of reach for real-time applications.  Suddenly they could become HPC subtasks in this hybrid cloud built as a fast reactive edge with HPC clusters as a module available to the edge applications.  HPC could become a major player in the cloud ecosystem.

This is the bigger threat to the traditional HPC community, as I see it: a threat of explosive innovation that could win by just being more exciting, faster growing, lucrative, and massive in scale.  It wouldn't take long before HPC on the cloud would be the hot setting for young researchers to tackle, and HPC on traditional supercomputers would begin to starve simply because it would look more and more like a legacy world.

At the workshop, we actually had one speaker who made the case that HPC supercomputers were in a "race" to own time-critical (real-time) HPC compute tasks.  But there were many speakers, myself included, who argued that no, the race is already over -- the cloud won before the HPC community even knew that the opportunity existed.  Today, the real race is the race to be a player in this new thing: the intelligent IoT-based edge.  And HPC as a component of that story clearly has a very bright future.  

Wednesday, 22 August 2018

Inventing the intelligent, active, edge

At the recent Microsoft faculty research summit, I was thrilled (and also somewhat relieved) to see that the company decided to run a whole workshop emphasizing systems (including networking), and the ways that systems can support machine learning.  It seemed clear that Microsoft has decided to become a major player in what they call "intelligent edge" computing and is urging all of us to jump on board.

These terms may be new, so I thought I might summarize the trend, based on my current understanding of it.  A good place to start is with a little mini-tutorial on what cloud computing infrastructures have been doing up to now, because the intelligent active edge really builds on the current architecture (over time, it will be more and more differentiated, but today, the overlap is substantial).

So: Today, we have a cloud dominated by a style of computing that prevailed in the 2000-2010 web server and services period.  In a first draft of this essay I wrote a page or so about this topic but it got heavy on jargon and I felt that it was taking too long to get to the point.  So I'll get there very quickly:
  • Many people think of the cloud as a way to rent Linux containers, but the bigger and more exciting trend focuses on elastic platforms that are event driven, connected by various ways to pass objects from layer to layer, and customizable by providing little event handlers: "functions".
  • Examples of platforms like this include Amazon Lambda, Microsoft Azure Functions, Google Tensor Flow, Spark/DataBricks RDDs.
  • The connections tend to be via some form of queuing service (Amazon SQS, Apache Kafka, Azure Service Queues, IBM's MQSeries, OpenSplice, etc).  Big objects are often just stored into a large file system (S3, Google GFS, Hadoop HDFS, etc).
  • Everything is sharded from start to finish.  Data shows up on HTTP connections (web requests to web services), but programmable edge routers (like Amazon Route 53) extract keys and use standard distributed hashing schemes to vector the requests into "shards", within which they may additionally load-balance.
  • We cache everything in sight, using DHTs like Amazon Dynamo, Cassandra, Microsoft FaRM.
  • The long-term storage layers are increasingly smart, like Azure Cosmos-DB.  They may do things like deduplication, compression, image segmentation and tagging, creation of secondary objects, etc.  Often they are backed by massive long-term storage layers like Azure Pelican.
  • Then of course we also have standard ways to talk to databases, pre-computed files, other kinds of servers and services, back-end systems that can run MapReduce (Hadoop) or do big-data tasks, etc.
  • The heavy lifting is hardware accelerated using GPU, TPU, FPGA and similar technologies, and as much as possible, we move data using RDMA and store it into memory-mapped non-volatile memory units (SSD or the newer 3D-XPoint NVMs like Optane).
Whew!  I hope you are still with me...

The nice thing about this complex but rather "standard" structure is that the developer simply writes a few event handlers for new web requests and most of the rest is automated by the AWS Lambda, Google Tensor Flow or Azure Functions environment.  Learning to work in this model is a bit of a challenge because there is a near total lack of textbooks (my friend Kishore Kumar is thinking of writing one), and because the technologies are still evolving at an insane pace.

The big players have done what they can to make these things a little easier to use.  One common approach is to publish a whole suite of case-study "demos" with nice little pictures showing the approach, like you would find here for Azure, or here for AWS.  In the best cut-and-paste fashion, the developer just selects a design template similar to what he or she has in mind, downloads the prebuilt demo, then customizes it to solve their own special problem by replacing the little functions with new event handlers of his or her own design, coded in any language that feels right (Python is popular, but you typically get a choice of as many as 40 popular options including JavaScript), and that will run in a little containerized VM with very fast startup -- often 1ms or less to launch for a new request.

This is the opposite of what we teach in our undergraduate classes, but for the modern cloud is probably the only feasible way to master the enormous complexity of the infrastructures.

So... with this out of the way, what's the excitement about the intelligent edge (aka active edge, reactive edge, IoT edge...)?

The key insight to start with is that the standard cloud isn't a great fit for the emerging world of live machine-learning solutions like support for self-driving cars, smart homes and power grids and farms, you name it.  First, if you own a huge number of IoT devices, it can be an enormous headache to register them and set them up (provisioning), securely monitor them, capture data privately (and repel attacks, which can happen at many layers).  Next, there is an intense real-time puzzle here: to control self-driving cars or drones or power grids, we need millisecond reaction times plus accurate, consistent data.  The existing cloud is more focused on end-to-end web page stuff where consistency can be weak and hence the fast reactions can use stale data.  So CAP is out the window here.  Next, we run into issues of how to program all of this.  And if you solve all of this in the cloud, you run into the question of what to do if your farmer happens to have poor connectivity back to the cloud.

So the exciting story about Azure IoT Edge was that Microsoft seems to have tackled all of this, and has a really coherent end-to-end vision that touches on every element of the puzzle.  This would be endless if I shared everything I learned, but I'll summarize a few big points:
  • They have a clean security and provisioning solution, and a concept of IoT life cycle with monitoring, visualization of "broken stuff", ways to push updates, etc.
  • They work with ISPs to arrange for proper connectivity and adequate bandwidth, so that applications can safely assume that the first-hop networking won't be a huge barrier to success.
  • They have a concept of an Azure IoT "hub" that can run as a kind of point-of-presence.  Apparently it will eventually even be able to have a disconnected mode.  Many companies operate huge numbers of PoP clusters and a smaller number of cloud data centers, so here Azure IoT Edge is taking that to the next level and helping customers set those up wherever they like.  You could imagine a van driving to a farm somewhere with a small rack of computers in it and running Azure IoT "hub" in the van, with a connection while the van is back at the home office, but temporarily disconnected for the couple of hours the company is working at that particular farm.
  • Then the outer level of Azure itself would also run the Azure IoT edge framework (same APIs) but now with proper connectivity to the full cloud.  And the framework has a strong emphasis o topics dear to me like real-time, ways of offering consistency, replication for parallelism or fault-tolerance, etc.  I'm looking at porting Derecho into this setting so that we can be part of this story, as a 3rd party (open source!) add-on.  They have a plan to offer a "marketplace" for such solutions.
As a big believer in moving machine learning to the edge, this is the kind of enabler I've been hoping someone would build - right now, we've lacked anything even close, although people are cobbling solutions together on top of Amazon AWS Lambda (which perhaps actually is close, although to me has less of a good story around the IoT devices themselves), or Google Tensor Flow (which is more of a back-end story, but has some of the same features).  As much as I love Spark/Databricks RDDs, I can't see how that could be an IoT story anytime soon.

So I plan to dive deep on this stuff, and will share what I learn in the coming year or so!  Stay tuned...

Thursday, 9 August 2018

Magical Thinking and the Logical Foundations of BlockChains

During the past few years, I've been exposed to an unrelenting drumbeat for BlockChains.  The level of enthusiasm for this model, and the commercial mania around it, have all the elements of a "market bubble".  Just yesterday I saw a quote from a BlockChain/CyberCoin billionaire who believes that "BlockChain will replace the Internet."  Really?  But search for that phrase and you'll actually find that this guy is saying something many people believe.  Rational or not, there is a huge community totally convinced that the future will be a giant BlockChain.

The BlockChain buzz was evident at the recent conference I attended, where one speaker told us about a Berkeley spin-off based on BlockChain: Oasis, which just landed $45M in first-round "seed" funding.  Just think about that for a moment: how can such a number be justified?  I'm a skeptic.

Oasis apparently plans to build a secure infrastructure using BlockChain as the storage solution, but entirely secure from the ground up.  Presumably this is one reason Oasis needed so much cash: most companies these days just run on a public cloud like Azure or Amazon (or any of the others).  Building from the ground up will be expensive, but would let Oasis avoid a pitfall other startups share: because its competitors depend on existing datacenters, they are exposed to whatever security flaws those cloud platforms embody.

But can one build a secure data center from the ground up?  Let's focus purely on storage, since this seems to be the essence of the Oasis plan.  Could one build a new kind of secure data center that (only) provides secure storage using BlockChain, built from the ground up?

The first step is to reject the permissionless BlockChain model, which is too weak to guarantee freedom from rollbacks even years after a transaction seemingly commits: permissionless BlockChain systems with anonymous servers are insecure by design.  We want a minimal BlockChain solution, but if you take minimal to mean "anonymous, globally replicated, permissionless", my answer is that "it can't be done: it is impossible to create a trustworthy platform with that mix of properties and technologies."

Fortunately, the permissioned model avoids this risk. Moreover, it seems reasonable to assume that if the goal is to offer a data-center BlockChain product, the data center operator can control which machines can operate the solution.  This moves us from the BitCoin style of anonymous, amorphic, fully decentralized BlockChain to a more standard model of append-only files used by customers like banks.

Next, we should perhaps reject a standard cloud that offers encrypted append-only files. This is an interesting step in the analysis because a block chain really is just a secure append-only file, no matter what anyone might tell you (secured using SHA 256 hashes or similar block hashes, with proof-of-work if the system is permissionless, and then with the signatures entangled to prevent tampering).  Any file system could play that role, if you code the needed logic to generate records formatted this way and with the required chain of attestation.  Amazon and Azure and other cloud companies already offer extremely secure storage layers, including BlockChain services. But as noted, they do depend on other elements of the respective datacenter systems.  So out with the old, and in with the new!

Now, without knowing anything about the proprietary protocols that Oasis is presumably designing,  I can't say anything about how they plan to guarantee correctness.  But I can tell you how I would do it.  I would use a form of Paxos, and because I would want extreme speed, would go with the version of durable Paxos that we implemented in Cornell's Derecho system.  If I were chief architect at Oasis, I might want to build my own software (hence not use Derecho), but I would certainly adopt the Paxos specification, and prove that my software implements it.

Of course being a Derecho zealot, I'm going to make the case that using Derecho might be the smart move, even if you might have preferred to roll your own.

First, I should note that by using Paxos, Derecho is able to leverage decades of work on  proofs of correctness -- Derecho was implemented by fusing a  proved-correct version of Paxos  integrated with a proved-correct version of the virtual synchrony membership management model and a new reliable (but not atomic) multicast layer.  Then all of the data movement steps were mapped to the available storage (SSD or 3-D XPoint) and communications technologies (RDMA or TCP) in an efficient way.  Finally, Derecho uses specialized optimizations for code paths that turn out to be performance-critical.

Next, it is worth noting that Derecho is open source, hence can benefit from community contributions over time, and also that the system is extremely fast -- the fast such system ever created.  Further, it actually achieves minimal bounds for Paxos (Derecho's protocols are "optimal" in a formal sense).  Of course one can talk about smaller constants and so forth, but given the speed of the system, there is an obvious case to be made that the constants aren't showing any sign of being unreasonably large.  So Derecho is a genuinely interesting option!  Heck, maybe my students and I should try and raise $50M or so and jump into the commercial space!  

But now we run into the first of a series of issues: the protocols and proofs we are leveraging were all created in the usual way: on paper, by hand.  The mapping didn't modify the underlying logic, yet it required to adapt them to modern networking is by hand, too, and hence also needed to be proved by hand.  The same can be said about our dozens of optimizations.

Is there really a sensible way to argue that all of these hand-written proofs be accepted as somehow being "more trustworthy" than Amazon AWS or Azure?  After all, those are companies are serious about specifications too, and further, both have invested hundreds of millions of dollars in their testing and Q/A process.  Derecho is remarkably robust, and we would point to all those proofs as part of the reason, but even so, we do find bugs in our logic.

Now, if I had the money, one option might be to harden Derecho.  Aggressive testing, deep quality assurance and better documentation would be a breeze.   Starting with such a strong prototype, we could quickly end up with a professional-quality tool.  In fact I actually hope to do this, in the next year or so, if Derecho users are able to chip in to help with the costs.

But perhaps you still wouldn't be satisfied.  And indeed, today's state of the art goes much further: the very best approach is to only run code extracted from a machine-checked proof.  In effect, we compile the proof into the running system and take the developers entirely off the development path.

This, though, turns out to be infeasible for software as elaborate as Derecho, and would be even harder for whatever Oasis really plans to build.  The issue here is that as of today, the provers can't handle the style of code that we needed to use in order to create Derecho, and any full data-center infrastructure would have 10x more such code, and perhaps far more than just 10x.

Today's best provers actually can handle automated extraction of Paxos code from correctness proofs (this has been done using NuPRL here at Cornell, or IOA and Larch in Nancy Lynch's group at MIT, or IronFleet, using Dafny at Microsoft).  The resulting solutions are very robust in the sense that the logical proof structures that result are rock solid -- for what they cover.  Unfortunately, however, the resulting code is incredibly slow compared to Derecho.  Moreover, infrastructure aspects are much harder to formalize than data replication and consistency, so in some sense, Paxos is the "easy" part of the story.  Can we specify the system, fully?  Is the specification itself bullet-proof?  Bootstrap?  Other management tasks?  These steps are much harder than many people would expect.  They may not even be possible, in some cases!

This same pattern is evident in many projects that have formalized aspects of operating systems or runtime environments.  At MIT, Nick Zeldovich famously proved a very simple file system correct a few years ago.  It ran on Linux, but there is a Linux u-kernel called SEL4 that has been proved correct, and while it doesn't cover all of Linux, SEL4 probably has enough stuff in it to support Nick's provably correct file system.

Then you could perhaps use the proved version of the C compiler to compile the thing (C++ is considered to be too complex for existing provers).  Even better would be to just build the whole thing in a language like RUST or Dafny, where proof is much more central to the coding and compilation model.  With such help, you may actually manage to create a proved solution, for parts of your full system.  

But without exception, you'll end up with slow code compared to the best possible, and will have solved just a portion of the entire datacenter infrastructure.  We are decades from having data-center scale, provably correct , ultra-performant software systems.  Perhaps we'll never get there.

Moreover, even if we did manage to overcome these barriers, we would run into further questions.

One big issue is the hardware.  Think about any hardware component in the entire data center.  The routers.  Printers. Digital telephone systems.  Storage devices.

$45M may seem like a huge amount of money, but in fact it is a tiny drop in the bucket when you consider that companies like Intel spend billions on their VLSI chip fabrication factories.  So there is simply no question that Oasis will end up using components some other company created.

The problem is that these components tend to be software-defined, and this is becoming more and more of a standard story: Almost all hardware components have general-purpose, highly capable processors within them.  An entire separate computer, with its own memory, network interfaces and code.

Indeed, if you needed one CPU and DRAM unit just to operate the hardware, why not include two on the same board, or chip?  Who would notice?  And if you do this, why not drop malicious code into the second unit?  Your printer company may not even realize that it is selling compromised devices: I've heard of network chips that had entire secondary networking infrastructures built in, operated by entire stealth operating systems.  You can monitor your network as closely as you like.  You would never notice the stealth network, and its control logic!

So you can perhaps make the higher layers provably secure, but this issue of trust in the hardware limits the degree that the resulting system could be secure.

If you run a secure solution on compromised hardware, all bets are off.  The situation is a bit better if you trust the hardware: Intel has an approach called SGX that can do a bit better, and perhaps Oasis plans to leverage it, but if so, they will face performance challenges.  Sadly, SGX is quite slow.

But suppose they pull all of this off: a ground-up datacenter solution, minimal trust in the hardware, offering a BlockChain storage layer to customers.  Now we run into a new puzzle: the issue arises of how to draw the boundary between the trusted storage solution and the customer's application.

The problem here has to do with composing a trusted application with a trusted storage layer through some form of less-trusted layer of intermediary logic, like the runtime associated with the programming language.  Modern applications are coded in standard languages like Java, Python, C++, Ruby.  They use databases from big vendors like Oracle, web servers like Apache, and so forth... and each of brings its own millions of lines of logic and its own runtime environment.  Those presumably have flaws, hence an attacker could potentially compromise the application by compromising some element of the compiler or runtime, even if other aspects of the datacenter BlockChain storage solution magically were iron-clad.

And this is why I'm a skeptic: I don't see how such a picture can be secured, not today, and probably not in my lifetime.

I actually do believe in security, but I think that in modern systems, we get much more protection from diversity and multiple layers of monitoring than we do from automated techniques like verification.  We should definitely do verification where we can -- for systems like Derecho, it offers a path that can slash bug rates and greatly improve confidence in the correctness of the system relative to the promises it can legitimately make.  But to me we oversell the power of verification and proofs if we go further and allow people to believe that we have discovered a magic way to carry this idea to the limits and "prove the whole thing", whatever that thing may be.  BlockChains don't change this reality.

The Oasis folks will presumably read this blog, and I should emphasis that it isn't a personal criticism.  I'm a huge fan of the Berkeley security team and have been amazed by their accomplishments.  Amazing work has been done.  But even so, and without being hostile in any way at all, I do think we all share a need to be frank with the public about what technology can, and cannot, accomplish.

BlockChains are being oversold as a trivial way to get perfect security -- not by Oasis, but by the popular technology press, which seems to have a very dim understanding of the concept.  The press seems to think BlockChains are somehow "more" than secure file systems.  Yet if anything, they are "less"!

Perhaps we are seeing a superposition of two elements.  Clearly, we do have a community that was unaware of the concept of a tamper-proof append-only log protected using cryptography.  All sorts of folks who work in government, health care, manufacturing clearly feel that this is a revelation, and offers an answer to all their security worries.  And quite possibly some were actually not aware that we can use file systems in secure ways -- I'm glad they know, now, and if BlockChain helps them conceive of this, I'm all for it.

Then we have hype, driven by cryptocurrency valuations in the stratosphere, a technology press endlessly eager for the next big thing, and investors keen to make a killing.  In this marketplace, I see a lot of value that companies can bring to the table, particularly ones with exciting ideas for packaging this stuff to be useful, and some of this value is very real: HyperLedger and Ethereum and related tools are genuinely powerful.

But just the same, we need to stop claiming that BlockChain is a revolutionary invention.

My worry is that by overselling the same old file systems to a naïve community of infrastructure owners, we may end up with systems that are actually less secure than what they replace.  After all, a platform like Azure is actually remarkably secure today, managed professionally by a company obsessed with security (and nobody is paying me to say this, by the way!), and has such a diversity of technologies deployed that compromising it in a major way would be quite hard.  If some hospital were to abandon that and jump to BlockChain as its file system, the illusion of security may have replaced much stronger real security.  True, today's reality has its limits, but in fact  I would be far more comfortable seeing medical records hosted on Azure or AWS, than abandoning these professional solutions in favor of a storage product that uses BlockChain as a magic wand to solve all problems.

But clearly, the current climate (especially the technology press) is prone to magical thinking, and a bit weak on just what BlockChains are, and how they work, and what their logical foundations turn out to be.  And in light of that, I suppose that the amazingly high first round of Oasis investment makes a kind of sense.  A bubble?  Definitely.  And yet all valuations are measures of market sentiment.  So perhaps not an unreasonable bubble, given the modern business climate and the craze that BlockChain has engendered.

Wednesday, 11 July 2018

Why we need a "Bell's test" for BlockChains

Last fall, Scott Aaronson visited Cornell and gave a series of talks on quantum computing.  I asked him about quantum-encrypted fiber-optic communication: how can users be sure that the technology actually uses entanglement, and isn't just some form of standard communication link set up to mimic the API to a quantum one?

Background: Quantum cryptographic methods basically mix classic communication with a quantum source of completely secure "noise".  They use a normal PKI for endpoint authentication, but introduce a quantum device that sends entangled photos to the two endpoints.  By measuring some unknown property (usually, polarization), the endpoints extract a genuinely random sequence of 0 and 1 bits that both observe identically (entanglement),  Any intruder who attempts to spy on the system would disrupt the entanglement.  The shared sequence of random bits can then be used as the basis for end-to-end symmetric cryptography.

A vendor offering a product in this space needs to do much more than to provide an optical fiber that carries entangled photons.  One issue is that the endpoints need to synchronize to ensure that they perform the same test on the same photons.  This isn't easy at high data rates.  To work around the limit, you would probably use quantum entanglement to create a symmetric key shared by the endpoints, then employ that symmetric key as input to DES or some other high-speed symmetric cryptographic solution.

But suppose we  don't trust the vendor.  Could the hardware be designed to "cheat"?

Scott's answer:  Thus there are many ways to cheat.   For example, notice that the scheme outlined above starts with a completely unknown property: entangled photos with totally random polarization.  One could instead generate an entangled sequence with known polarization.

The user will have been fooled into using a key that the evil-doer generated (and hence, knows).  The user's secrets will actually be out in the open, for anyone (at least, anyone who knows the sequence) to read.

In fact, why even bother to entangle the photons? Why not just take a laser, polarize the output (again in a known way), and then beam the resulting (non-random, non-entangled) output through a half-silvered mirror, so that both endpoints can see the same signal.  A naïve user would measure polarizations, extract the same sequence from each end, and think that the device was working flawlessly.

Beyond this, one can imagine endpoint hardware that genuinely goes to all the trouble of extracting  random data from quantum-entangled photons, but then ignores the random input and substitutes some form of pre-computed key that the evil-doer computed months ago, and stored in a table for later use.  Here, the buyer can actually go to the trouble of verifying the entanglement, confirm that the device is genuinely intrusion-tolerance, and so forth.   Yet we would have zero security, because the broken endpoint logic ignores the quantum-random input.

Bell's Theorem.  Setting such cases to the side, Scott also pointed out that for the entangled photons on the fiber-optic cable, there actually is a good way to test that the device is working.   He explained that in the lab, physicists test such technologies by running a "Bell's Inequality" experiment.

As you may know, Bell's Theorem was proposed by John Stewart Bell as a way to test one of the theories about quantum entanglement -- some believed at the time that "hidden variables" were at the root of entanglement, and were actively exploring ways that such variables could arise.  Bell showed that true entanglement could be distinguished from a hidden variable system using a series of measurements that would yield different results for the two cases.  Scott's point was that could run a Bell's inequality experiment on the fiber.  It would give unambiguous evidence that the photons emerging from our fiber are genuinely entangled.

But a Bell's test would only cover the technology to the endpoints of the fiber carrying the entangled photons, and we could only run such a test with "unrestricted" access to the medium.  Very few products could possibly be deconstructed in this way.

Bottom line?  Clearly, it is vitally important that quantum encrypted communications technology be from a full-trusted vendor.  A compromised vendor could sell an undetectably flawed technology.

Why is this relevant to BlockChains?  A BlockChain technology is only as secure as the  cryptographic package used in its block-entanglement step.  Suppose, for example, that I created a cryptographic package called SHA 256, but instead of using the actual SHA 256 algorithm, implemented it in some trivial and insecure way.  As long as that package produces an random-looking hash of the input, of the proper length, one that varies for different inputs, you might be fooled.

What's the risk?  Well, if I could trick you into using my product, it would look like a BlockChain and seem secure.  Yet suppose that the chain included block X that has a transaction I find "awkward".  If my fake hashing system lacks a cryptographic property called "strong collision resistance", I could substitute block Y for X, modifying the stable body of the chain, and you wouldn't be able to prove that this tampering had occurred.  Obviously this defeats the whole point.

Now, if you were to check the output against a different, more trusted SHA 256 hash solution the values would differ.  Yet how many people audit their BlockChain products using a technology totally independent of anything the BlockChain vendor provided?  In this example, even using the SHA 256 code provided by your vendor is a mistake: the SHA 256 code is broken.

Moreover, there are other ways that one could potentially trick a user.  A SHA 256 hash computed on just a portion of the transaction record could look completely valid (would be valid, for that portion of the block), and yet would leave room for tampering with any bytes not covered by the hash.  Your audit would thus need to really understand the BlockChain data structure, which isn't as simple as you might expect.  Many BlockChain vendors use fairly complex data structures, and it isn't totally trivial to extract just the chain of blocks and hashes to audit that the hash actually covers the data in the block.  Any vendor-supplied code you use for this step, at all, would expose you to a risk that when you go to audit the chain, the vendor tool covers up any tampering.

My point?  This is a genuine risk.  An immense number of companies are jumping to use BlockChain for diverse mission-critical purposes.  These companies are relying on the guarantee that once the blocks in the chain become stable, nobody could tamper with the record.  Yet what we see here is that a BlockChain is only as good as the vendor and the cryptographic package, and that the chain can only be trusted if you have some way to independently test its integrity.  And you had better really run that test, often.

My advice to anyone working with BlockChain is to hire a trusted independent consultant to build a BlockChain test program that would audit the actual chain daily, using completely independent technology.  If the vendor is using AES 256 for hashing, your auditing company should find a very trustworthy AES 256 and base the audit on that.  If the chain uses some other hashing method, same goes for it -- this can work for any standards.

What if your vendor is offering a non-standard BlockChain that runs super-fast by virtue of using a new but proprietary hashing technology, or a new but non-standard secret data structure?  My advice is simple: if the vendor won't supply you with enough detail to let you audit the chain, don't trust it.

Thursday, 28 June 2018

When open source is the right model

At DSN, I found myself in conversation with some entrepreneurs who were curious to know why in an era when people are making billions on relatively small ideas, we aren't adopting a more mercenary IP stance with Derecho.  For them, our focus on papers at DSN and TOCS and open software that really works was a strange choice, given that we do have a shot at products and startups that could be pretty lucrative.

Here are some dimensions of the question worth pondering.
  • Academic research is judged by impact, meaning broad adoption, lots of citations, etc.  
  • We started our project with DARPA MRC funding.  DARPA insisted that we use open source licensing from the start, but we can pretend it didn’t and still reach the same conclusion.
  • Publicly funded research should benefit the taxpayers who wrote the checks.  For a system like Derecho, this means the system needs to be useful, adopted by US companies that can leverage our ideas in their products, and help them hire lots of people in high-paying jobs.  Derecho should enable high value applications that would not have been possible without it.   
Should Derecho be patented and licensed for a fee?
  • Patents don’t protect mathematical proofs or theorems (or algorithms, or protocols).  Patents protect artifacts.  I often end up in debate with theory people who find this frustrating.  Yet it is the way the system works.  Patents simply cannot be used to protect conceptual aspects, even those fundamental to engineering artifacts that use those concepts in basic ways.  They protect the actual realization: the physical embodiment, the actual lines of code.  
  • Thus my group could perhaps patent Derecho itself, the actual software, through Cornell (this ownership assignment is defined under the US Bayh-Dole act).  But we cannot pursue a patent on state machine replication, the 1980’s model (the theory) underlying Derecho. Our patent would be narrow, and would not stop you from creating your own system, Tornado, with similar algorithms inspired directly by our papers.   Sure, we could work with a lawyer to arrive at tricky patent-claim wording that a naive reader might believe to cover optimal state machine replication.  Yet even if the USPTO were to allow the resulting claims no judge would uphold the broad interpretation and rule in our favor against Tornado, because patents are defined to cover artifacts, not mathematical theories.  Software patent claims must be interpreted as statements about Derecho embodiment of the principle, not the principle itself.  This is just how patents work.
  • Wait, am I some kind of expert on patents?  How would I know this stuff about patent law?  Without belaboring the point, yes, I actually am an expert on software IP and patents, although I am not an IP lawyer.  I got my expertise by fighting lawsuits starting in the 1990’s, most recently was the lead expert witness in a case with a lot of money at stake ($Bs), and I’ve worked with some of the country’s best legal IP talent.  My side in these cases never lost, not once.  I also helped Cornell develop its software IP policies.
  • Can open source still be monetized?  Sure.  Just think about Linux.  RedHat and other companies add high value, yet Linux itself remains open and free.  Or DataBricks (Spark).  Nothing stops us from someday following that path.
So, why should this imply that Derecho should be free, open source?
  • There are software systems that nobody wants to keep secret.  Thousands of people know every line of the Linux kernel source code, maybe even tens of thousands, and this is good because it enables Linux to play a universal role: the most standard “device driver” available to the industry, taking the whole machine to be the device.  We need this form of universal standard, and we learned decades ago that without standards, we end up with a Tower of Babel: components that simply don’t interoperate.  The key enabler is open source.
  • That same issue has denied us a standard, universal solution for state machine replication.  We want this to change, and for Derecho to be the standard.
  • There are already a ton of open source, free, software libraries for group communication.  Linux even has a basic block replication layer, built right in.  You need to value an artifact by first assessing the value of other comparable things, then asking what the value-add of your  new technology is, then using the two to arrive at a fair market value and sales proposition.  
  • But this suggests that because Derecho competes with viable options (“incumbents”) that have a dollar value of zero, then even if it has a high differentiated value as a much better tool, the highest possible monetary value for it would need to be quite low, or the market would reject it.  So yes, we could perhaps charge $5 for a license, but we would be foolish to try and charge $500K.  
  • You might still be able to construct a logic for valuing Derecho very high and then licensing is just once.  The broader market would reject the offering, but some single company might consider taking an exclusive license.  So you could protect Derecho with a patent, then sell it.  The system would end up as a proprietary product owned fully by the buyer: your US tax dollars hard at work on behalf of that one lucky buyer.  But now what’s happens?  In fact,   someone else could simply create a new free version.  Examples?  Think about HDFS and Hadoop and Zookeeper (created to mimic GFS, MapReduce and Chubby, all proprietary).  To me the writing is on the wall: if Derecho isn’t offered for free, the market will  reject it in favor of less performent free software, and then if the speed issue becomes a problem, ultimately someone else would build a Derecho clone, offering it as free software to fill the gap.  They would find this fairly easy, given our detailed papers, and it would be legal: recall that you can’t patent protocols.  This would be totally legal.
Conclusion? To maximize impact, Derecho needs to be open source.