A Few Thoughts on Distributed Computing: Magical Thinking and the Logical Foundations of BlockChains

During the past few years, I've been exposed to an unrelenting drumbeat for BlockChains. The level of enthusiasm for this model, and the commercial mania around it, have all the elements of a "market bubble". Just yesterday I saw a quote from a BlockChain/CyberCoin billionaire who believes that "BlockChain will replace the Internet." Really? But search for that phrase and you'll actually find that this guy is saying something many people believe. Rational or not, there is a huge community totally convinced that the future will be a giant BlockChain.

The BlockChain buzz was evident at the recent conference I attended, where one speaker told us about a Berkeley spin-off based on BlockChain: Oasis, which just landed $45M in first-round "seed" funding. Just think about that for a moment: how can such a number be justified? I'm a skeptic.

Oasis apparently plans to build a secure infrastructure using BlockChain as the storage solution, but entirely secure from the ground up. Presumably this is one reason Oasis needed so much cash: most companies these days just run on a public cloud like Azure or Amazon (or any of the others). Building from the ground up will be expensive, but would let Oasis avoid a pitfall other startups share: because its competitors depend on existing datacenters, they are exposed to whatever security flaws those cloud platforms embody.

But can one build a secure data center from the ground up? Let's focus purely on storage, since this seems to be the essence of the Oasis plan. Could one build a new kind of secure data center that (only) provides secure storage using BlockChain, built from the ground up?

The first step is to reject the permissionless BlockChain model, which is too weak to guarantee freedom from rollbacks even years after a transaction seemingly commits: permissionless BlockChain systems with anonymous servers are insecure by design. We want a minimal BlockChain solution, but if you take minimal to mean "anonymous, globally replicated, permissionless", my answer is that "it can't be done: it is impossible to create a trustworthy platform with that mix of properties and technologies."

Fortunately, the permissioned model avoids this risk. Moreover, it seems reasonable to assume that if the goal is to offer a data-center BlockChain product, the data center operator can control which machines can operate the solution. This moves us from the BitCoin style of anonymous, amorphic, fully decentralized BlockChain to a more standard model of append-only files used by customers like banks.

Next, we should perhaps reject a standard cloud that offers encrypted append-only files. This is an interesting step in the analysis because a block chain really is just a secure append-only file, no matter what anyone might tell you (secured using SHA 256 hashes or similar block hashes, with proof-of-work if the system is permissionless, and then with the signatures entangled to prevent tampering). Any file system could play that role, if you code the needed logic to generate records formatted this way and with the required chain of attestation. Amazon and Azure and other cloud companies already offer extremely secure storage layers, including BlockChain services. But as noted, they do depend on other elements of the respective datacenter systems. So out with the old, and in with the new!

Now, without knowing anything about the proprietary protocols that Oasis is presumably designing, I can't say anything about how they plan to guarantee correctness. But I can tell you how I would do it. I would use a form of Paxos, and because I would want extreme speed, would go with the version of durable Paxos that we implemented in Cornell's Derecho system. If I were chief architect at Oasis, I might want to build my own software (hence not use Derecho), but I would certainly adopt the Paxos specification, and prove that my software implements it.

Of course being a Derecho zealot, I'm going to make the case that using Derecho might be the smart move, even if you might have preferred to roll your own.

First, I should note that by using Paxos, Derecho is able to leverage decades of work on proofs of correctness -- Derecho was implemented by fusing a proved-correct version of Paxos integrated with a proved-correct version of the virtual synchrony membership management model and a new reliable (but not atomic) multicast layer. Then all of the data movement steps were mapped to the available storage (SSD or 3-D XPoint) and communications technologies (RDMA or TCP) in an efficient way. Finally, Derecho uses specialized optimizations for code paths that turn out to be performance-critical.

Next, it is worth noting that Derecho is open source, hence can benefit from community contributions over time, and also that the system is extremely fast -- the fast such system ever created. Further, it actually achieves minimal bounds for Paxos (Derecho's protocols are "optimal" in a formal sense). Of course one can talk about smaller constants and so forth, but given the speed of the system, there is an obvious case to be made that the constants aren't showing any sign of being unreasonably large. So Derecho is a genuinely interesting option! Heck, maybe my students and I should try and raise $50M or so and jump into the commercial space!

But now we run into the first of a series of issues: the protocols and proofs we are leveraging were all created in the usual way: on paper, by hand. The mapping didn't modify the underlying logic, yet it required to adapt them to modern networking is by hand, too, and hence also needed to be proved by hand. The same can be said about our dozens of optimizations.

Is there really a sensible way to argue that all of these hand-written proofs be accepted as somehow being "more trustworthy" than Amazon AWS or Azure? After all, those are companies are serious about specifications too, and further, both have invested hundreds of millions of dollars in their testing and Q/A process. Derecho is remarkably robust, and we would point to all those proofs as part of the reason, but even so, we do find bugs in our logic.

Now, if I had the money, one option might be to harden Derecho. Aggressive testing, deep quality assurance and better documentation would be a breeze. Starting with such a strong prototype, we could quickly end up with a professional-quality tool. In fact I actually hope to do this, in the next year or so, if Derecho users are able to chip in to help with the costs.

But perhaps you still wouldn't be satisfied. And indeed, today's state of the art goes much further: the very best approach is to only run code extracted from a machine-checked proof. In effect, we compile the proof into the running system and take the developers entirely off the development path.

This, though, turns out to be infeasible for software as elaborate as Derecho, and would be even harder for whatever Oasis really plans to build. The issue here is that as of today, the provers can't handle the style of code that we needed to use in order to create Derecho, and any full data-center infrastructure would have 10x more such code, and perhaps far more than just 10x.

Today's best provers actually can handle automated extraction of Paxos code from correctness proofs (this has been done using NuPRL here at Cornell, or IOA and Larch in Nancy Lynch's group at MIT, or IronFleet, using Dafny at Microsoft). The resulting solutions are very robust in the sense that the logical proof structures that result are rock solid -- for what they cover. Unfortunately, however, the resulting code is incredibly slow compared to Derecho. Moreover, infrastructure aspects are much harder to formalize than data replication and consistency, so in some sense, Paxos is the "easy" part of the story. Can we specify the system, fully? Is the specification itself bullet-proof? Bootstrap? Other management tasks? These steps are much harder than many people would expect. They may not even be possible, in some cases!

This same pattern is evident in many projects that have formalized aspects of operating systems or runtime environments. At MIT, Nick Zeldovich famously proved a very simple file system correct a few years ago. It ran on Linux, but there is a Linux u-kernel called SEL4 that has been proved correct, and while it doesn't cover all of Linux, SEL4 probably has enough stuff in it to support Nick's provably correct file system.

Then you could perhaps use the proved version of the C compiler to compile the thing (C++ is considered to be too complex for existing provers). Even better would be to just build the whole thing in a language like RUST or Dafny, where proof is much more central to the coding and compilation model. With such help, you may actually manage to create a proved solution, for parts of your full system.

But without exception, you'll end up with slow code compared to the best possible, and will have solved just a portion of the entire datacenter infrastructure. We are decades from having data-center scale, provably correct , ultra-performant software systems. Perhaps we'll never get there.

Moreover, even if we did manage to overcome these barriers, we would run into further questions.

One big issue is the hardware. Think about any hardware component in the entire data center. The routers. Printers. Digital telephone systems. Storage devices.

$45M may seem like a huge amount of money, but in fact it is a tiny drop in the bucket when you consider that companies like Intel spend billions on their VLSI chip fabrication factories. So there is simply no question that Oasis will end up using components some other company created.

The problem is that these components tend to be software-defined, and this is becoming more and more of a standard story: Almost all hardware components have general-purpose, highly capable processors within them. An entire separate computer, with its own memory, network interfaces and code.

Indeed, if you needed one CPU and DRAM unit just to operate the hardware, why not include two on the same board, or chip? Who would notice? And if you do this, why not drop malicious code into the second unit? Your printer company may not even realize that it is selling compromised devices: I've heard of network chips that had entire secondary networking infrastructures built in, operated by entire stealth operating systems. You can monitor your network as closely as you like. You would never notice the stealth network, and its control logic!

So you can perhaps make the higher layers provably secure, but this issue of trust in the hardware limits the degree that the resulting system could be secure.

If you run a secure solution on compromised hardware, all bets are off. The situation is a bit better if you trust the hardware: Intel has an approach called SGX that can do a bit better, and perhaps Oasis plans to leverage it, but if so, they will face performance challenges. Sadly, SGX is quite slow.

But suppose they pull all of this off: a ground-up datacenter solution, minimal trust in the hardware, offering a BlockChain storage layer to customers. Now we run into a new puzzle: the issue arises of how to draw the boundary between the trusted storage solution and the customer's application.

The problem here has to do with composing a trusted application with a trusted storage layer through some form of less-trusted layer of intermediary logic, like the runtime associated with the programming language. Modern applications are coded in standard languages like Java, Python, C++, Ruby. They use databases from big vendors like Oracle, web servers like Apache, and so forth... and each of brings its own millions of lines of logic and its own runtime environment. Those presumably have flaws, hence an attacker could potentially compromise the application by compromising some element of the compiler or runtime, even if other aspects of the datacenter BlockChain storage solution magically were iron-clad.

And this is why I'm a skeptic: I don't see how such a picture can be secured, not today, and probably not in my lifetime.

I actually do believe in security, but I think that in modern systems, we get much more protection from diversity and multiple layers of monitoring than we do from automated techniques like verification. We should definitely do verification where we can -- for systems like Derecho, it offers a path that can slash bug rates and greatly improve confidence in the correctness of the system relative to the promises it can legitimately make. But to me we oversell the power of verification and proofs if we go further and allow people to believe that we have discovered a magic way to carry this idea to the limits and "prove the whole thing", whatever that thing may be. BlockChains don't change this reality.

The Oasis folks will presumably read this blog, and I should emphasis that it isn't a personal criticism. I'm a huge fan of the Berkeley security team and have been amazed by their accomplishments. Amazing work has been done. But even so, and without being hostile in any way at all, I do think we all share a need to be frank with the public about what technology can, and cannot, accomplish.

BlockChains are being oversold as a trivial way to get perfect security -- not by Oasis, but by the popular technology press, which seems to have a very dim understanding of the concept. The press seems to think BlockChains are somehow "more" than secure file systems. Yet if anything, they are "less"!

Perhaps we are seeing a superposition of two elements. Clearly, we do have a community that was unaware of the concept of a tamper-proof append-only log protected using cryptography. All sorts of folks who work in government, health care, manufacturing clearly feel that this is a revelation, and offers an answer to all their security worries. And quite possibly some were actually not aware that we can use file systems in secure ways -- I'm glad they know, now, and if BlockChain helps them conceive of this, I'm all for it.

Then we have hype, driven by cryptocurrency valuations in the stratosphere, a technology press endlessly eager for the next big thing, and investors keen to make a killing. In this marketplace, I see a lot of value that companies can bring to the table, particularly ones with exciting ideas for packaging this stuff to be useful, and some of this value is very real: HyperLedger and Ethereum and related tools are genuinely powerful.

But just the same, we need to stop claiming that BlockChain is a revolutionary invention.

My worry is that by overselling the same old file systems to a naïve community of infrastructure owners, we may end up with systems that are actually less secure than what they replace. After all, a platform like Azure is actually remarkably secure today, managed professionally by a company obsessed with security (and nobody is paying me to say this, by the way!), and has such a diversity of technologies deployed that compromising it in a major way would be quite hard. If some hospital were to abandon that and jump to BlockChain as its file system, the illusion of security may have replaced much stronger real security. True, today's reality has its limits, but in fact I would be far more comfortable seeing medical records hosted on Azure or AWS, than abandoning these professional solutions in favor of a storage product that uses BlockChain as a magic wand to solve all problems.

But clearly, the current climate (especially the technology press) is prone to magical thinking, and a bit weak on just what BlockChains are, and how they work, and what their logical foundations turn out to be. And in light of that, I suppose that the amazingly high first round of Oasis investment makes a kind of sense. A bubble? Definitely. And yet all valuations are measures of market sentiment. So perhaps not an unreasonable bubble, given the modern business climate and the craze that BlockChain has engendered.

A Few Thoughts on Distributed Computing

Thursday, 9 August 2018

Magical Thinking and the Logical Foundations of BlockChains

No comments:

Post a Comment