Tuesday, 27 November 2018

So what's the story about "disaggregated" cloud computing?

The hardware community has suddenly gone wild over a new idea they call "disaggregated" cloud computing infrastructures.  I needed to talk about this in my graduate class, so I've been reading some of the papers and asking around.

If you come at the topic from distributed/cloud computing, as I do, you may have been using disaggregation to refer to a much older trend, namely the "decoupling" of control software from the hardware that performs various data and compute-intensive tasks that would otherwise burden a general purpose processor.  And this is definitely a legitimate use of the term, but as it turns out, the architecture folks are focused on a different scenario: for them, disaggregation relates to a new way of building your data center compute nodes in which you group elements with similar functionalities, using some form of micro-controller with minimal capabilities to manage components that handle storage, memory, FPGA devices, GPU, TPU, etc.  Recent work on quantum computing uses a disaggregated model too: the quantum device is like an experimental apparatus that a control program carefully configures into a specific state, then "runs", and then collects output from.

The argument is that disaggregation reduces wasted power by supporting a cleaner form of multitenancy.  For example, consider DRAM.  The “d” stands for dynamic, meaning that power is expended to keep every bit continuously refreshed.  This is a costly, heat-intensive activity, and yet your DRAM may not even be fully used (studies suggest that 40-50% would be pretty good).  If all the DRAM was in some shared rack, bin-packing could easily give you >95% utilization on the active modules, plus you could power down any unused modules.

So disaggregation as a term really can refer to pretty much any situation in which some resources are controlled from programs running "elsewhere", and where you end up with a distributed application that conceptually is a single thing, but has multiple distinct moving parts, with different subsystems specialized for different roles and often using specialized hardware as part of those roles.

At this point I'm seeing the idea as a spectrum, so perhaps it would make sense to start with the older style of control-plane/data-plane separation, say a few words about the value of such a model, and then we can look more closely at this extreme form and ask how it fits into the picture.

Control/data disaggregation is really pretty common idea and dates back quite far: even the earliest IBM mainframes had dedicated I/O coprocessors to operate disks and tapes, and talked about "I/O channels".   I could easily list five or ten other examples that have popped up during the past decade or two: very few devices lack some form of general purpose on-board computer, and we can view any such system as a form of disaggregated hardware. But the one that interests me most is the connection to Mach, at the stage when the project was publishing very actively at SOSP and was run from CMU, with various people heading it: Rick Rashid (who ultimately left to launch Microsoft Research), then Brian Bershad, and then Tom Anderson.

Mach was an amazing system, but what I find fascinating is that it happens to also have anticipated many aspects of this new push into disaggregation.  As you perhaps will recall (from all those Mach papers you've read...) the basic idea was to have a single micro-kernel that would host various O/S presentation frameworks: a few versions of Linux, perhaps versions of legacy IBM operating systems, etc.  Internally, one of those frameworks consisted of a set of processes sharing DLLs, some acting as servers and others as applications.  Memory was organized into segments with read/write/execute permissions, and all the underlying interactions centered on an ultra-fast form of RPC in which a single CPU core could run an application thread in segment A, then briefly suspend that thread and activate a thread context in segment B (perhaps A was an application and B is the file server, for example), run in B for a short period, then return to A.  The puzzle was to handle permissions (the file server can see the raw disk, but A shouldn't be allowed to, and conversely, the file server can DMA from a specific set of memory areas in A, but shouldn't be able to scan A's memory for passwords).  Then because all of this yielded a really huge address space, Mach ran into page table fragmentation issues that were ultimately solved with a novel software inverted page table.  And capabilities were used for the low-level message passing primitives.  Really cool stuff!

As you might expect, all of these elements had costs: the messaging layer, the segmentation model, context switching, etc.  Mach innovated by optimizing those steps, but keep these kinds of costs in mind, because disaggregation will turn out to run into similar issues.

Which brings us to today!  The modern version of the story was triggered by the emergence of RDMA as an option in general purpose data centers.  As you'll know if you've followed these postings, I'm an RDMA nut, but it isn't just me.  RDMA is one of those rare cases where a technology became very mature in a different setting (namely high-performance computing, where it serves as the main communications backbone for packages like the MPI library, and runs on InfiniBand networks), and yet somehow went mostly unnoticed by the general computing community.... until recently, when people suddenly "discovered" it and went bonkers.

As I've discussed previously, one reason that RDMA matured on Infiniband and yet wasn't used on Ethernet centered on RDMA's seeming need for a special style of credit-based I/O model, in which a sender NIC would only transfer data if the receiver had space waiting to receive the data.   It is only recently that the emergence of RoCE allowed RDMA to jump to more standard datacenter environments.  Today, we do have RDMA in a relatively stable form within a set of racks sharing a TOR switch, assuming that the TOR switch has full bisection bandwidth.  But RDMA isn't yet equally mature across routers, particularly in a COS/Spine model where the core network would often be oversubscribed and hence at risk of congestion.  There are several competing schemes for deploying RDMA over RoCE v2 with PPF at data-center-wide scale (DCQCN and TIMELY), but work is still underway to understand how those can be combined with Diffsrv or Enterprise VLAN routing) to avoid unwanted interactions between standard TCP/IP and the RDMA layer.  It may be a few years before we see "mature" datacenter-scale deployments in which RDMA runs side-by-side with TCP/IP in a non-disruptive way over RoCE v2 (in fact, by then we may be talking about RoCE v5)!  Until then, RDMA remains a bit delicate, and must be used with great care.

So why is RDMA even relevant to disaggregation?  There is a positive aspect to the answer, but it comes with a cautionary sidebar. The big win is that RDMA has a simple API visible directly to the end-user program (a model called "qpairs" that focuses on "I/O verbs".  These are lock-free, and with the most current hardware, permit reliable DMA data transfers from machine to machine at speeds of up to 200Gbps, and latencies as low as .75us.  The technology comes in several forms, all rather similar: Mellanox leads on the pure RDMA model (the company has been acquired and will soon be renamed Xylinx, by the way), while Intel is pushing a variant they call OMNI-Path, and other companies have other innovations.   These latencies are 10x to 100x better than what you see with TCP, and the data rates are easily 4x better than the fastest existing datacenter TCP solutions.  As RDMA pushes up the performance curve to 400Gbps and beyond, the gaps will get even larger.  So RDMA is the coming thing, even it hasn't quite become commonplace today.

RDMA is also reliable in its point-to-point configuration, enabling it to replace TCP in many applications, and also offers a way to support relatively low latency remote memory access with cache-line atomicity.  Now, it should probably be underscored that with latencies of .75us at the low end to numbers more like 3us for more common devices, RDMA offers the most NUMA of NUMA memory models, but even so, the technology is very fast compared to anything we had previously.

The cautions are the obvious ones.  First, RDMA on RoCE is really guaranteed to offer those numbers only if the endpoints live under a TOR switch that can support full bidirectional communication.  Beyond that, we need to traverse a router that might experience congestion, and might even drop packets.  In these cases, RDMA is on slightly weaker ground.

A second caution is simply that applications can't really handle long memory-access delays, so this NUMA thing becomes a conceptual barrier to casual styles of programming.   You absolutely can't view RDMA data-center memory as a gigantic form of local memory.  On the contrary, the developer needs to be acutely aware of potential delays, and hence will need to have full control over memory layout. Today, we lack really easily-used tools for that form of control, so massive RDMA applications are the domain of specialists who have a great deal of arcane hardware insight.

So how does this story relate to disaggregation?  A first comment is that RDMA is just one of several new hardware options that force us to take control logic out of the main data path.  There are a few other examples: with the next generation of NVRAM storage (like Intel's Optane), you can DMA directly from a video camera into a persistent storage card, and with RDMA, you'll be able to do so even if the video isn't on the same machine as the place you plan to store the data.  With FPGA accelerators, we often pursue bump-in-the-wire placements: data streams into a memory unit associated to the FPGA at wire rates, it does something, and then we might not even need to touch the main high-volume stream from general purpose code, at all.  So the RDMA story is pointing to a broader lack of abstractions covering a variety of data-path scenarios.  Moreover, our second caution expands to a broader statement: these different RDMA-like technologies may each bring different constraints and different best-case use scenarios.

Meanwhile, the disaggregation folks are taking the story even further: they start by arguing that we can reduce the power footprint of rack-scale computing systems if we build processors with minimal per-core memory and treat that memory more like a cache than as a full-scale memory.  Because memory runs quite hot but the heat dissipation is a function of the memory size, this can slash the power need and heat generation.  Then if the processor didn't really need much memory, we are golden, and if it did, we can start to think about ways to intelligently manage the on-board cache, paging data in and out from remote memories hosted in racks dedicated to memory resources and shared by the processor racks (yes, this raises all sorts of issues of protection, but that's the idea).  We end up with racks of processors, racks of memory units, racks of storage, racks of FPGA, etc.  Each rack will achieve a much higher density of the corresponding devices, which brings some efficiencies too, and we also potentially have multi-tenancy benefits if some applications are memory-intensive but others need relatively little DRAM, some use FPGA but some don't, etc.

The main issue -- the obvious one -- is latency.  To appreciate the point, consider standard NUMA machines.  The bottom line, as you'll instantly discover should you ever run into it, is that on a NUMA machine, DRAM is split into a few modules, each supporting a couple of local cores, but interconnected by a bus.  The issue is that although a NUMA machine is capable of offering transparent memory sharing across DRAM modules, the costs are wildly different when a thread accesses the closest DRAM than when it reaches across the bus to access a "remote" DRAM module (in quotes because this is a DRAM module on the same motherboard).  At the best, you'll see a 2x delay: your local DRAM might run with 65ns memory fetch speeds, but a "remote" access could cost 125ns.  These delays will balloon if there are also cache-coherency or locking delays: in such cases the NUMA hardware enforces the consistency or locking model, but it runs backplane protocols to do so, and those are costly.  NUMA latencies cause havoc.

Not long ago, researchers at MIT wrote a whole series of papers on optimizing Linux to run at a reasonable speed on NUMA platforms.

The cross-product of RDMA with NUMA-ness makes life even stranger.  Even when we consider a single server, an RDMA transfer in or out will run in slow motion if the RDMA NIC happens to be far from the DRAM module the transfer touches. You may even get timeouts, and I wonder if there aren't even cases where it would be faster to just copy the data to the local DRAM first, and then use RDMA purely from and to the memory unit closest to the NIC!

This is why those of us who teach courses related to modern NUMA and OS architectures are forced to stress that writing shared memory multicore code can be a terrible mistake from a performance point of view.  At best, you need to take full control: you'll have to pin your threads to a set of cores to some single DRAM unit (this helps: you'll end up with faster code.  But... it will be highly architecture-specific and on some other NUMA system, it might perform poorly).

In fact NUMA is probably best viewed as a hardware feature ideal for virtualized systems with standard single-threaded programs (and this is true both for genuine virtualization and for containers). Non-specialists should just avoid the model!

So here we have this new disaggregation trend, pushing for the most extreme form of NUMA-ness imaginable.  With a standard Intel server, local DRAM will be accessible with a latency of about 50-75ns.  Non-local jumps to 125ns, but RDMA could easily exceed 750ns, and 1500ns might not be unusual even for a nearby node, accessible with just one TOR switch hop.  (Optical networking could help, driving this down to perhaps 100ns, but that is still far in the future and involves many assumptions.)

There may be an answer, but it isn't obvious people will like it: we will need a new programming style that systematically separates data flows from control, so that this new style of out of band coding would be easier to pull off.  Why a new programming style?   Well, today's prevailing style of code centers on a program that "owns" the data and holds it in locally malloc-ed memory: I might write code to read frames of video, then ask my GPU to segment the data -- and the data transits through my DRAM (in fact the GPU would often use my DRAM as its memory source).  Similarly for FPGA.  In a disaggregated environment, we would give the GPU or the FPGA memory of its own, then wire it directly to the data source, and control all of this from an out of band control program running on a completely different processor.  The needed abstractions for that style of computing are simply lacking in most of today's programming languages and systems.  You do see elements of the story -- for example, Tensor Flow can be told to run portions of its graph on a designated TPU cluster -- but I would argue that we haven't yet found the right OS and programming language coding styles to make this easy at data-center scale.

Worse, we need to focus on applications that would be inherently latency-tolerant: latency barriers will translate to a slowdown for the end user.  To avoid that, we need a way of coding that promotes asynchronous flows and doesn't require a lot of hopping around, RPCs or locking.  And I'll simply say that we lack that style of coding now -- we haven't really needed it, and the work on this model hasn't really taken off.  So this is a solvable problem, for sure, but it isn't there yet.

Beyond all this, we have the issues that Mach ran into.  If we disaggregate, all of those same questions will arise: RPC on the critical path, context switch delays, mismatch of the hardware MMU models with the new virtualized disaggregated process model, permissions, you name it.  Today's hardware is nothing like the hardware when Mach was created, and I bet there are all sorts of cool new options.  But it won't be trivial to figure them all out.

I have colleagues who work in this area, and I think the early signs are very exciting.  It looks entirely plausible that the answers are going to be there, perhaps in five years and certainly in ten.  The only issue is that the industry seems to think disaggregation is already happening, and yet those answers simply aren't in hand today.

It seems to me that we also risk an outcome where researchers solve the problem, and yet the intended end-users just don't like the solution.  The other side of the Tensor Flow story seems to center on the popularity of just using languages everyone already knows, like Python or Java, as the basis for whatever it is we plan to do.  But Python has no real concept of resource locations: Tensor Flow takes Python and bolts on the concept of execution in some specific place, but here we are talking about flows, execution contexts, memory management, you name it.  So you run some risk that smart young researchers will demonstrate amazing potential, but that the resulting model won't work for legacy applications (unless a miracle occurs and someone finds a way to start with a popular language like Tensor Flow and "compile" it to the needed abstractions).  Can such a thing be done?  Perhaps so!  But when?  That strikes me a puzzle.

So I'll stop here, but I will say that disaggregation seems like a cool and really happening opportunity: definitely one of the next big deals.   Researchers are wise to jump in.  But people headed into the industry, though, who are likely to be technology users rather than inventors, need to appreciate that it is still too early to speculate about what might be successful, and when, and definitely too early for them to get excited: the bottom line message in the lecture I just gave on this topic!

Friday, 16 November 2018

Forgotten lessons and their relevance to the cloud

While giving a lecture in my graduate course on the modern cloud, and the introduction of sensors and hardware accelerators into machine-learning platforms, I had a sudden sense of deja-vu.

In today's cloud computing systems, there is a tremendous arms-race underway to deploy hardware as a way to compute more cost-effectively, more data faster, or simply to offload very repetitious tasks into specialized subsystems that are highly optimized for those tasks.  My course covers quite a bit of this work: we look at RDMA, new memory options, FPGA, GPU and TPU clusters, challenges of dealing with NUMA architectures and their costly memory coherence models, and similar topics.  The focus is nominally on the software gluing all of this together ("the future cloud operating system") but honestly, since we don't really know what that will be, the class is more of a survey of the current landscape with a broad agenda of applying it to emerging IoT uses that bring new demands into the cloud edge.

So why would this give me a sense of deja-vu?  Well, grant me a moment for a second tangent and then I'll link my two lines of thought into a single point.  Perhaps you recall the early days of client-server computing, or the early web.  Both technologies took off explosively, only to suddenly sag a few years later as a broader wave of adoption revealed deficiencies.

If you take client-server as your main example, we had an early period of disruptive change that you can literally track to a specific set of papers: Hank Levy's first papers on the Vax Cluster architecture when he was still working with DEC.  (It probably didn't hurt that Hank was the main author: in systems, very few people are as good as Hank at writing papers on topics of that kind.)  And in a few tens of pages, Hank upended the mainframe mindset and introduced us to this other vision: clustered computing systems in which lots of components somehow collaborate to perform scalable tasks, and it was a revelation.  Meanwhile, mechanisms like RPC were suddenly becoming common (CORBA was in its early stages), so all of this was accessible.  For people accustomed to file transfer models and batch computing, it was a glimpse of the promised land.

But the pathway to the promised land turned out to be kind of challenging.  DEC, the early beneficiary of this excitement, got overwhelmed and sort of bogged down: rather than being a hardware company selling infinite numbers of VAX clusters (which would have made them the first global titan of the industry), they somehow got dragged further and further into a morass of unworkable software that needed a total rethinking.  Hank's papers were crystal clear and brilliant, but a true client-server infrastructure needed 1000x more software components, and not everyone can function at the level Hank's paper more or less set as the bar.  So, much of the DEC infrastructure was incomplete and buggy, and for developers, this translated to a frustrating experience: a fast on-ramp followed by a very bumpy, erratic experience.  The ultimate customers felt burned and many abandoned DEC for Sun Microsystems, where Bill Joy managed to put together a client-server "V2" that was somewhat more coherent and complete.  Finally, Microsoft swept in and did a really professional job, but by then DEC had vanished entirely, and Sun was struggling with its own issues of overreach.

I could repeat this story using examples from the web, but you can see where I'm going: early technologies, especially disruptive, revolutionary ones, often take on a frenetic life of their own that can get far ahead of the real technical needs.  The vendor then becomes completely overwhelmed and unless they somehow can paper over the issues, collapses.

Back in that period, a wonderful little book came out on this: Crossing the Chasm, by Geoffrey Moore.  It talked about how technologies often have a bumpy adoption curve.  Moore talks about an adoption curve over time.  The first bump is associated with the early adopters (the kind of people who live to be the first to use a new technology, back before it even becomes stable).  But conservative organizations prefer to be "first to be last" as David Bakken says.  They hold back waiting for the technology to mature and hoping that they can avoid the pain but also not miss the actual surge of mainstream adoption.  Meanwhile, the pool of early adopters dries up and some of them wander off for the next even newer thing, so you see the adoption curve sag, perhaps for years.  Wired writes articles about the "failure of client server" (well, back then it would have been ComputerWorld).

Finally, for the lucky few, the really sustainable successes, you see a second surge in adoption and this one would typically play out over a much longer period, without sagging in that same way, or at least not for many years.  So we see a kind of S-curve, but with a bump in the early period.

All of which leads me back to today's cloud and this craze for new accelerators.  When you consider any one of them, you quickly discover that they are extremely hard devices to program. FPGA pools in Microsoft's setting, for example, are clearly going be expert-only technologies (I'm thinking about the series of papers associated with Catapult).  It is easy to see why a specialized cloud micro-service might benefit, particularly because the FPGA performance to power-cost ratio is quite attractive.  Just the same, though, creation of an FPGA is really an experts-only undertaking.  Anyhow, a broken FPGA could be quite disruptive to the data center.  So we may see use of these pools by subsystems doing things like feature ranking for Bing search, crypto for the Microsoft Azure VPC, or data compression and other similar tasks in Cosmos.  But I don't imagine that my students here at Cornell will be creating new services with new FPGA accelerators anytime soon.

GPU continues to be a domain where CUDA programming dominates every other option.  This is awesome for the world's CUDA specialists, and because they are good at packaging their solutions in libraries we can call from the attached general purpose machine, we end up with great specialized accelerators for graphics, vision, and similar tasks.  In my class we actually do read about a generalized tool for leveraging GPU: a language invented at MSR called Dandelion. The real programming language was easy: C# with LINQ, about as popular a technology as you could name.  Then they mapped the LINQ queries to GPU, if available.  I loved that idea... but Dandelion work stalled several years ago without really taking off in a big way.

TPU is perhaps easier to use: With Google's Tensor Flow, the compiler does the hard work (like with Dandelion), but the language is just Python.  To identify the objects a TPU could compute on, the whole model focuses on creating functions that have vectors or matrices or higher-dimensional tensors as their values.  This really works well and is very popular, particular on a NUMA machine with an attached TPU accelerator, particularly for Google's heavy lifting in their ML subsystems.  But it is hard to see Tensor Flow as a general purpose language or even as a general purpose technology.

And the same goes with research in my own area.  When I look at Derecho or Microsoft's FaRM or other RDMA technology, I find it hard not to recognize that we are creating specialist solutions, using RDMA in sophisticated ways, and supporting extensible models that are probably best viewed as forms of PaaS infrastructures even if you tend to treat them as libraries.  They are sensational tools for what they do.  But they aren't "general purpose".  (For distributed computing, general purpose might lead you to an RPC package like the OMG's IDL-based solutions, or to REST, or perhaps to Microsoft's WCF).

So where does this leave us?  Many people who look at the modern cloud are predicting that the cloud operating system will need to change in dramatic ways.  But if you believe that difficulty of use and fragility and lack of tools makes the accelerators "off limits" except for a handful of specialists, and that the pre-built PaaS services will ultimately dominate, than what's wrong with today's micro-service models?  As I see it, not much: they are well-supported, scale nicely (although some of the function-server solutions really need to work on their startup delays!), and there are more and more recipes to guide new users from problem statement to a workable, scalable, high performance solution.  These recipes often talk to pre-built microservices and sure, those use hardware accelerators, but the real user is shielded from their complexity.  And this is a good thing, because otherwise, we would be facing a new instance of that same client-server issue.

Looking at this as a research area, we can reach some conclusions about how one should approach research on the modern cloud infrastructure.

A first observation is that the cloud has evolved into a world of specialized elastic micro-services and that the older style of "rent a pile of Linux machines and customize them" is fading slowly into the background.  This makes a lot of sense, because it isn't easy to end up with a robust, elastic solution.  Using a pre-designed and highly optimized microservice benefits everyone: the cloud vendor gets better performance from the data center and better multi-tenancy behavior, and the user doesn't have to reinvent these very suble mechanisms.

A second is that specialized acceleration solutions will probably live mostly within the specialized microservices that they were created to support.  Sure, Azure will support pools of FPGAs.  But those will exist mostly to speed up things like Cosmos or Bing, simply because using them is extremely complex, and misusing them can disrupt the entire cloud fabric.  This can also make up for the dreadful state of the supporting tools for most if not all cloud-scale elastic mechanisms.  Like early client-server computing, home-brew use of technologies like DHTs, FPGA and GPU and TPU accelerators, RDMA, Optane memory -- none of that makes a lot of sense right now.  You could perhaps pull it off, but more likely, the larger market will reject such things... except when they result in ultra-fast, ultra-cheap micro-services that they can treat as black boxes.

A third observation is that as researchers, if we hope to be impactful, we shouldn't fight this wave.  Take my own work on Derecho.  Understanding that Derecho will be used mostly to build new microservices helps me understand how to shape the APIs to look natural to the people who would be likely to use it.  Understanding that those microservices might be used mostly from Azure's function server or Amazon's AWS Lambda, tells me what a typical critical path would look like, and can focus me on ensuring that this particular path is very well-supported, leverages RDMA at every hop if RDMA is available, lets me add auto-configuration logic to Derecho based on the environment one would find at runtime, etc.

We should also be looking at the next generation of applications and by doing so, should try to understand and abstract by recognizing their needs and likely data access and computational patterns.  On this, I'll point to work like the new paper on Ray from OSDI: a specialized microservice for a style of computing common in gradient-descent model training, or Tensor Flow: ultimately, a specialized microservice for leveraging TPUs, or Spark: a specialized microservice to improve the scheduling and caching of Hadoop jobs.  Each technology is exquisitely matched to context, and none can simply be yanked out and used elsewhere.  For example, you would be unwise to try and build a new Paxos service using Tensor Flow: it might work, but it wouldn't make a ton of sense.  You might manage to publish a paper, but it is hard to imagine such a thing having broad impact.  Spark is just not an edge caching solution: it really makes sense only in the big-data repositories where the DataBricks product line lives.  And so forth.

Monday, 15 October 2018

Improving asynchronous communication in u-service systems.

I've been mostly blogging on high-level topics related to the intelligent edge, IoT, and BlockChain lately.  In my work, though, there are also a number of deeper tradeoffs that arise.  I thought I might share one, because it has huge performance implications in today's systems and yet seems not to get much "press".

Context.  To appreciate the issue I want to talk about, think about a data pipeline moving some sort of stream of bytes at very high speed from point A to point B.  If you want to be very concrete, maybe A is a video camera and B is a machine that runs some form of video analytic.  Nothing very exotic.  Could run on TCP -- or it could use RDMA in some way.  For our purposes here, you have my encouragement to think of RDMA as a hardware-accelerated version of TCP.

Now, what will limit performance?  One obvious issue is that if there is some form of chain of relaying between A and B, any delays in relaying could cause a performance hiccup.  Why might a chain arise?  Well, one obvious answer would be network routing, but modern cloud systems use what we call a micro-services (u-service) model, and for these, we deliberately break computations down into stages, and then each stage is implemented by attaching a small "function" program (a normal program written in C++ or Java or a similar language, running in a lightweight container setting like Kubanetes, and with "services" like file storage coming from services it talks to in the same data center, but mostly running on different machines).  You code these much as you might define a mouse-handler method: there is a way to associate events with various kinds of inputs or actions, and then to associate an event handler with those events, and then to provide the code for the handler.  The events themselves are mostly associated with HTML web pages, and because these support a standard called "web services", there is a way to pass arguments in, and even to get a response: a form of procedure call that runs over web pages, and invokes a function handler program coded in this particular style.

So why might a u-services model generate long asynchronous chains of actions?  Well, if you consider a typical use case, it might involve something like a person snapping a photo, which uploads into the cloud, where it is automatically analyzed to find faces, and then where the faces are automatically tagged with best matches in the social network of the camera-owner, etc.  Each of these steps would occur as a function event in the current functional cloud model, so that single action of taking a photo (on Facebook, WhatsApp, Instagram, or whatever) generated a chain.  In my work on Derecho, we are concerned with chains too.  Derecho is used to replicate data, and we often would see a pipeline of photos or videos or other objects, which then are passed through layers of our system (a chain of processing elements) before they finally reach their delivery target.

Chains of any kind can cause real issues.  If something downstream pauses for a while, huge backlogs can form, particularly if the system configures its buffers generously.  So what seems like a natural mechanism to absorb small delays turns out to sometimes cause huge delays and even total collapse!

Implications.  With the context established, we can tackle the real point, namely that for peak performance the chain has to run steadily and asynchronously: we need to see an efficient, steady, high-rate of events that flow through the system with minimal cost.  This in turn means that we want some form of buffering between the stages, to smooth out transient rate mismatches or delays.  But when we buffer large objects, like videos, we quickly fill the in-memory buffer capacity, and data will then spill to a disk.  The stored data will later need to be reread when it is finally time to process it: a double overhead that can incur considerable delay and use quite a lot of resources.  With long pipelines, these delays can be multiplied by the pipeline length.  And even worse, modern standards (like HTML), often use a text-based data representation when forwarding information, but use a binary one internally: the program will be handed a photo, but the photo was passed down the wire in an ascii form.  Those translations cost a fortune!

What alternative did we have?  Well, one option to consider is back-pressure: preventing the source from sending the photo unless there is adequate space ("credits") available in the receiver.  While we do push the delay all the way back to the camera, the benefit is that these internal overheads are avoided, so the overall system capacity increases.  The end-to-end system may perform far better even though we've seemingly reduced the stage-by-stage delay tolerance.

But wait, perhaps we can do even better.   Eliminating the ascii conversions would really pay off.  So why not trick the system into taking the original photo, storing it into a blob ("binary large object") store, and substituting a URL in the pipeline.  So now our pipeline isn't carrying much data at all -- the objects being shipped around might be just a few tens of bytes in size, instead of many megabytes.   Now the blob store can offer a more optimized photo transfer service, and the need for buffering in the pipeline itself is eliminated because these tiny objects would be so cheap to ship.

This would be a useless optimization if most stages need to touch the image data itself, unless the stage by stage protocol is hopelessly tied to ascii representations and the blob-store-transfer is super-fast (but wait... that might be true!).

However, there is even more reason to try this: In a u-services world, most stages do things like key-value storage, queuing or message-bus communication, indexing.  Quite possibly the majority of stages aren't data-touching at all, and hence wouldn't fetch the image itself.  The tradeoff here would be to include meta-data actually needed by the intermediary elements in the pipeline while storing rarely-needed bits in the blob store, for retrieval only when actually needed.  We could aspire to a "zero copy" story: one that never copies unnecessarily (and that uses RDMA for the actually needed data transfers, of course!)

Tuesday, 25 September 2018

Blockchain verification and the dangers of branching temporal logic models (very technical)

Many years ago, I worked with some colleagues on an ill-fated topic: we tried to write down a logical statement concerning the guarantees provided by atomic multicast systems that manage their own membership.  Today, we know how to do that, courtesy of Lamport’s Paxos specification and the proof methods he introduced.

But those were the Wild West days, and that particular project occurred before the Paxos specification was created.  Moreover, our atomic multicast (which actually could be configured as a Paxos protocol), also included some optional features Leslie omitted from Paxos, for good reason.  Those centered on a form of optimistic early delivery combined with barrier synchronization (analogous to a memory fence).

Our puzzle centered on expressions describing "all the possible" future behaviors that could arise with this rather complex optimistic form of early delivery.  The problem was that the set of future scenarios grew exponentially as new members join and existing members failed (or exited voluntarily).  Our logic needed to "account" for all of these possibilities.  In fact, the logic itself had a flaw, but even had we managed to write it down properly, we still would have had a semantic in which statements can only be interpreted within the “space” of future scenarios.  Since it is intractable to do model checking in an exponentially growing state space, such statements are often undecideable: they can have a true or false value and yet no decision procedure can be created that will terminate in bounded time.

A total mess.  We ultimately abandoned the effort, tails between our legs, and came to view it as an embarrassment.  The less said about it, the better!

Except... I sometimes tell the story, as a cautionary tale.

Not every technology suffers from the issue we encountered.  Lamport's way of formalizing Paxos avoids the issue by avoiding speculative "early delivery", which was the real issue in our early work.  This is one reason that Lamport's Paxos specification was such a success.

Transactional database models also have a better way of handling such problems: when we ask whether a database system is in a serializable state, the rule is to start by erasing all the uncommitted transactions, at which point serializability is defined as a property of the committed state.  This approach accepts that transactions could glimpse inconsistent states while executing: it isn't a problem so long as those transactions can't commit.  Moreover, it erases all the events that depend on future outcomes, neatly avoiding the whole issue our unfortunate effort ran up against.

Which brings us to BlockChain.  I'm intrigued by the recent work that seeks to put a kind of transactional database behavior "into" the BlockChain, by incorporating SQL-like statements into the transactions themselves, but then reevaluating them as the BlockChain steadily grows.

To appreciate why this poses the same problem I struggled with twenty years ago, think about a smart contract that says something along the following lines: "John agrees to sell me his horse, Bucky, for the sum of $1,500, and has accepted a $150 deposit.  If I haven't completed the purchase within a week, John agrees to return the deposit.  But in the meanwhile, John can continue to try and sell Bucky.  If he finds another buyer, he can cancel my transaction, but in that case must both return the deposit and also pay me an addition $100, to compensate me for my trouble."

The world is full of contracts like these.   Smart contracts can express things like rules for computing interest that depend on global interest rates.   We probably all remember 2008, when the world financial system melted down over issues with mortgage-backed  securities split into interest and principle.  The claim is that the expressive power of smart contracts is a good thing, because smart contracts can be analyzed by tools (compiler-style tools), and hence it should be possible to automatically identify risks. Risk managers would then have robust ways to compute their risk and hedge against it, and those hedging contracts (insurance deals) could also be carefully evaluated, etc.  Back in 2008, it wasn't the mortgage-backed securities that collapsed, per-se.  It was the insurance companies that insured them, but didn't hedge the "secondary" risk properly.

So... how does one attach a "formal meaning" to a smart contract?  Let's go back to John's sale of Bucky.  Notice that this contract depends on how things play out into the future.  For the coming week, the BlockChain will grow, and each new block added to the chain could bring events relevant to the contract.  John could decide to cancel the contract and sign a new contract with Sally (perhaps she is ready to pay more -- enough to more than compensate John for the $100 he'll forfeit).   I could show up with the remaining $1350, and head home with Bucky.  A week could pass, and John would have to return my $150 deposit.  And it gets worse: John could sign with Sally, but then Sally might cancel her deal, and perhaps John would then want to reinstate his deal with me.

Much as in that early work I tried to do, John's smart contract with me has a meaning that can depend on a branching future state: some large (maybe exponential) space of possibilities, each leading to its own proper interpretation of the correct "thing to do".  Should John hand Bucky over to me, or not?  Do I owe him $1,350, or does he owe me $150, or should it be $250?

Without much trouble, we can design sequences of smart contracts in which to know the proper outcome for my contract, I need to figure out the outcome of Sally's contract (and this then becomes an induction, because Sally's contract may depend on the outcome of Carl's contract).  This is precisely how my early work failed: you end up with scenarios that can be arbitrarily prolonged, and the total space of scenarios grows exponentially in the length of the future chain,  because of an endlessly longer sequence of new events that each depends on its own future outcomes.

Beyond all of which we have the issue of rollbacks: even if you accept the common wisdom and adopt the view that a BlockChain prefix has magically "committed" once it has been extended by six or more blocks, we still run into the problem that the suffix is unstable.  So we could have one suffix in which Sally's transaction finalizes, but it might then rollback, aborting that outcome and perhaps replacing it with one in which Sally cancels her purchase.

Should it trouble us that smart contracts on BlockChains might not have a tractable meaning -- a reduction to temporal logic -- if they include future references?  For that matter, even without future references, some of the aspects just mentioned would still arise.   Is this bad?

I think so: it seems to me that in computing, if we're learned one thing over the decades, it is the overarching importance of rigorous semantics.  If BlockChains with smart contracts can't be reduced to a stable logical framework in which proofs can be carried out without solving infeasible problems (evaluating logical formulas within exponentially growing state spaces is a well-known infeasible problem), then we are looking at a profoundly untrustworthy framework.

So beware, all of you rabid BlockChain investors!  If you are betting big on smart contracts, you owe it to yourselves to figure out a way to reduce the statements those contracts make to a stable, computationally feasible form.  You know what they say: those who fail to learn from the past are doomed to repeat it.  If you don't find a robust and tractable semantics for your technology, then someday, twenty years from you, too will be writing little blog postings about how your work once took a hopelessly wrong turn...  and that Professor Birman's sad story of his unfortunate foray into the theory of branching future executions should have warned you!

Thursday, 13 September 2018

Will HPC survive the cloud?

I just got back a from an HPC workshop, where a lot of the discussion was focused on the impact of cloud computing on HPC.  Here are a few of the main take-aways.
  • First, to get this up in front, HPC is alive and well.  A whole slew of amazing new computers are about to be powered up, operating at speeds that just defy human understanding.  So HPC isn't about to collapse and die tomorrow.  (Ten years out, though, is a more complex question).
  • Some of the really big financial drivers for HPC are things that genuinely need massive compute infrastructures: tasks like weather prediction, scientific computing from experiments like the LIGO gravitational-wave observatory, modelling the air flow around a supersonic jet.
  • But more and more HPC tasks have an embarrassingly parallel structure and really map down to huge numbers of subtasks that don't need massive computers to perform.  One person estimated that 90 to 95% of the workload on today's biggest computers consists of vast numbers of smaller jobs that run as a batch, but could easily be performed on smaller machines if they had the right hardware.
  • And several speakers put pictures of big cloud computing data centers up, and pointed out that no matter how exciting those new HPC systems will be, even a small cloud data center has vastly more compute power in it, and vastly more storage capacity.
  • On top of this, we have the runaway success of Microsoft's Azure HPC, which has become a genuinely hot cloud platform -- the demand far exceeds what people had expected, based on industry articles I've followed over the past few years.  Azure HPC offers smallish clusters that might have, say, 48 machines, but those machines would then be running the same "bare metal" platforms you see on massive HPC supercomputers.  And nothing stops Azure HPC from ramping up and starting to offer larger and larger configurations.  Rather than run MPI over RoCE, Microsoft just puts a second network infrastructure on their Azure HPC clusters, using InfiniBand for MPI and treating the standard ethernet as a control network for general TCP/IP uses.

So this is the looming threat to the HPC community: not so much that HPC might suddenly loose its steam, but rather that we could see some non-trivial percentage of the HPC jobs migrate towards platforms like Azure HPC.  And in fact one speaker at the workshop was the head of computing for a large research university, who told us about a consortium being formed to promote just that transition.  What he explained was that while really big HPC still needs the big data centers, like the U. Texas XSEDE systems, most of the campus needs could be adequately served with smaller resources.  This makes it appealing for universities to rent, rather than own, and by forming consortia, they could have the bargaining power to make financially compelling deals with big cloud HPC operators like Microsoft (and not just Microsoft -- he pointed out that as a buyer shopping around, he was getting bids from quite a few cloud providers).

The issue this raises is that it redirects money that would in the past have flowed to the HPC data centers towards those big providers.  Imagine a world in which, say five years from now, 30% of today's HPC has moved to cloud solutions.  The loss of that income base could make it very hard for the big data centers to continue to invest and upgrade.  Meanwhile, all that cash flowing to the data center owners would incent them to explore more and more ambitious cloud-hosted HPC products, trying to find the sweet spot that maximizes income without overstretching them. 

The second issue I'm seeing relates to my new favorite topic: the intelligent, reactive cloud edge.  Recall from my past few blog postings that I've been fascinated by the evolution of the first tier of the cloud: machines inside the data center, but that are on the front line, running services that directly handle incoming data from IoT devices, real-time uses like smart cars or smart homes, or other time-critical, highly demanding, applications.  Part of my interest is that I'm really not fond of just working on web servers, and these intelligent IoT applications need the mix of fault-tolerance and consistency that my group specializes in: they seem like the right home for our Derecho technology and the file system that runs over it, Freeze Frame.

But this has an HPC ramification too: if companies like Microsoft want Azure HPC to be a player in their cloud applications, they will invest to strengthen the options for using HPC solutions as part of real-time edge applications.  We'll see a growing range of HPC platforms that can tie deeply right into the Azure IoT Edge, for example, and HPC could start to perform demanding tasks under real-time pressure.

Right now, those standards aren't at all mature -- HPC systems are casual about endlessly slow startup (I did one experiment with MPI and was shocked to realize that multi-minute delays are totally common between when a job "starts" and when the full configuration of the job is actually up and ready to run my application).  We could talk about why this is the case: they do silly things like pulling the container images one by one on the nodes as they launch, and sometimes actually pull DLLs in one by one as needed too, so the causes are totally mundane.  Derecho (or even its RDMC component) could be "life transforming" for this kind of thing!  But the real point is that it can be fixed.

So imagine that over a five year period, the Azure edge, and similar systems from Amazon and other providers, start to really do a great job of integrating HPC into the cloud.  The rich and extensive tool set the HPC community has developed suddenly becomes available to cloud application creators, for use in real-time situations, and it becomes easy to capture data and "farm it out" to HPC with no significant delays at all (I mean milliseconds, whereas today, that might be minutes...).  Wow, what an exciting thing this could yield!!!

For example, in the electric power grid settings I've worked on, one could do micro-predictions of wind patterns or sunshine patterns and use that to anticipate the power output from wind farms or solar farms.   You could adjust the wind turbines dynamically to maximize their productivity. Someday, with enough knowledge of the communities connected to the grid, we could even predict the exact power output from city-scale rooftop solar deployments.  Just that one use case could be transformative!

Then you can imagine all kinds of image processing and data fusion tasks that would be feasible today in offline settings, but way out of reach for real-time applications.  Suddenly they could become HPC subtasks in this hybrid cloud built as a fast reactive edge with HPC clusters as a module available to the edge applications.  HPC could become a major player in the cloud ecosystem.

This is the bigger threat to the traditional HPC community, as I see it: a threat of explosive innovation that could win by just being more exciting, faster growing, lucrative, and massive in scale.  It wouldn't take long before HPC on the cloud would be the hot setting for young researchers to tackle, and HPC on traditional supercomputers would begin to starve simply because it would look more and more like a legacy world.

At the workshop, we actually had one speaker who made the case that HPC supercomputers were in a "race" to own time-critical (real-time) HPC compute tasks.  But there were many speakers, myself included, who argued that no, the race is already over -- the cloud won before the HPC community even knew that the opportunity existed.  Today, the real race is the race to be a player in this new thing: the intelligent IoT-based edge.  And HPC as a component of that story clearly has a very bright future.  

Wednesday, 22 August 2018

Inventing the intelligent, active, edge

At the recent Microsoft faculty research summit, I was thrilled (and also somewhat relieved) to see that the company decided to run a whole workshop emphasizing systems (including networking), and the ways that systems can support machine learning.  It seemed clear that Microsoft has decided to become a major player in what they call "intelligent edge" computing and is urging all of us to jump on board.

These terms may be new, so I thought I might summarize the trend, based on my current understanding of it.  A good place to start is with a little mini-tutorial on what cloud computing infrastructures have been doing up to now, because the intelligent active edge really builds on the current architecture (over time, it will be more and more differentiated, but today, the overlap is substantial).

So: Today, we have a cloud dominated by a style of computing that prevailed in the 2000-2010 web server and services period.  In a first draft of this essay I wrote a page or so about this topic but it got heavy on jargon and I felt that it was taking too long to get to the point.  So I'll get there very quickly:
  • Many people think of the cloud as a way to rent Linux containers, but the bigger and more exciting trend focuses on elastic platforms that are event driven, connected by various ways to pass objects from layer to layer, and customizable by providing little event handlers: "functions".
  • Examples of platforms like this include Amazon Lambda, Microsoft Azure Functions, Google Tensor Flow, Spark/DataBricks RDDs.
  • The connections tend to be via some form of queuing service (Amazon SQS, Apache Kafka, Azure Service Queues, IBM's MQSeries, OpenSplice, etc).  Big objects are often just stored into a large file system (S3, Google GFS, Hadoop HDFS, etc).
  • Everything is sharded from start to finish.  Data shows up on HTTP connections (web requests to web services), but programmable edge routers (like Amazon Route 53) extract keys and use standard distributed hashing schemes to vector the requests into "shards", within which they may additionally load-balance.
  • We cache everything in sight, using DHTs like Amazon Dynamo, Cassandra, Microsoft FaRM.
  • The long-term storage layers are increasingly smart, like Azure Cosmos-DB.  They may do things like deduplication, compression, image segmentation and tagging, creation of secondary objects, etc.  Often they are backed by massive long-term storage layers like Azure Pelican.
  • Then of course we also have standard ways to talk to databases, pre-computed files, other kinds of servers and services, back-end systems that can run MapReduce (Hadoop) or do big-data tasks, etc.
  • The heavy lifting is hardware accelerated using GPU, TPU, FPGA and similar technologies, and as much as possible, we move data using RDMA and store it into memory-mapped non-volatile memory units (SSD or the newer 3D-XPoint NVMs like Optane).
Whew!  I hope you are still with me...

The nice thing about this complex but rather "standard" structure is that the developer simply writes a few event handlers for new web requests and most of the rest is automated by the AWS Lambda, Google Tensor Flow or Azure Functions environment.  Learning to work in this model is a bit of a challenge because there is a near total lack of textbooks (my friend Kishore Kumar is thinking of writing one), and because the technologies are still evolving at an insane pace.

The big players have done what they can to make these things a little easier to use.  One common approach is to publish a whole suite of case-study "demos" with nice little pictures showing the approach, like you would find here for Azure, or here for AWS.  In the best cut-and-paste fashion, the developer just selects a design template similar to what he or she has in mind, downloads the prebuilt demo, then customizes it to solve their own special problem by replacing the little functions with new event handlers of his or her own design, coded in any language that feels right (Python is popular, but you typically get a choice of as many as 40 popular options including JavaScript), and that will run in a little containerized VM with very fast startup -- often 1ms or less to launch for a new request.

This is the opposite of what we teach in our undergraduate classes, but for the modern cloud is probably the only feasible way to master the enormous complexity of the infrastructures.

So... with this out of the way, what's the excitement about the intelligent edge (aka active edge, reactive edge, IoT edge...)?

The key insight to start with is that the standard cloud isn't a great fit for the emerging world of live machine-learning solutions like support for self-driving cars, smart homes and power grids and farms, you name it.  First, if you own a huge number of IoT devices, it can be an enormous headache to register them and set them up (provisioning), securely monitor them, capture data privately (and repel attacks, which can happen at many layers).  Next, there is an intense real-time puzzle here: to control self-driving cars or drones or power grids, we need millisecond reaction times plus accurate, consistent data.  The existing cloud is more focused on end-to-end web page stuff where consistency can be weak and hence the fast reactions can use stale data.  So CAP is out the window here.  Next, we run into issues of how to program all of this.  And if you solve all of this in the cloud, you run into the question of what to do if your farmer happens to have poor connectivity back to the cloud.

So the exciting story about Azure IoT Edge was that Microsoft seems to have tackled all of this, and has a really coherent end-to-end vision that touches on every element of the puzzle.  This would be endless if I shared everything I learned, but I'll summarize a few big points:
  • They have a clean security and provisioning solution, and a concept of IoT life cycle with monitoring, visualization of "broken stuff", ways to push updates, etc.
  • They work with ISPs to arrange for proper connectivity and adequate bandwidth, so that applications can safely assume that the first-hop networking won't be a huge barrier to success.
  • They have a concept of an Azure IoT "hub" that can run as a kind of point-of-presence.  Apparently it will eventually even be able to have a disconnected mode.  Many companies operate huge numbers of PoP clusters and a smaller number of cloud data centers, so here Azure IoT Edge is taking that to the next level and helping customers set those up wherever they like.  You could imagine a van driving to a farm somewhere with a small rack of computers in it and running Azure IoT "hub" in the van, with a connection while the van is back at the home office, but temporarily disconnected for the couple of hours the company is working at that particular farm.
  • Then the outer level of Azure itself would also run the Azure IoT edge framework (same APIs) but now with proper connectivity to the full cloud.  And the framework has a strong emphasis o topics dear to me like real-time, ways of offering consistency, replication for parallelism or fault-tolerance, etc.  I'm looking at porting Derecho into this setting so that we can be part of this story, as a 3rd party (open source!) add-on.  They have a plan to offer a "marketplace" for such solutions.
As a big believer in moving machine learning to the edge, this is the kind of enabler I've been hoping someone would build - right now, we've lacked anything even close, although people are cobbling solutions together on top of Amazon AWS Lambda (which perhaps actually is close, although to me has less of a good story around the IoT devices themselves), or Google Tensor Flow (which is more of a back-end story, but has some of the same features).  As much as I love Spark/Databricks RDDs, I can't see how that could be an IoT story anytime soon.

So I plan to dive deep on this stuff, and will share what I learn in the coming year or so!  Stay tuned...

Thursday, 9 August 2018

Magical Thinking and the Logical Foundations of BlockChains

During the past few years, I've been exposed to an unrelenting drumbeat for BlockChains.  The level of enthusiasm for this model, and the commercial mania around it, have all the elements of a "market bubble".  Just yesterday I saw a quote from a BlockChain/CyberCoin billionaire who believes that "BlockChain will replace the Internet."  Really?  But search for that phrase and you'll actually find that this guy is saying something many people believe.  Rational or not, there is a huge community totally convinced that the future will be a giant BlockChain.

The BlockChain buzz was evident at the recent conference I attended, where one speaker told us about a Berkeley spin-off based on BlockChain: Oasis, which just landed $45M in first-round "seed" funding.  Just think about that for a moment: how can such a number be justified?  I'm a skeptic.

Oasis apparently plans to build a secure infrastructure using BlockChain as the storage solution, but entirely secure from the ground up.  Presumably this is one reason Oasis needed so much cash: most companies these days just run on a public cloud like Azure or Amazon (or any of the others).  Building from the ground up will be expensive, but would let Oasis avoid a pitfall other startups share: because its competitors depend on existing datacenters, they are exposed to whatever security flaws those cloud platforms embody.

But can one build a secure data center from the ground up?  Let's focus purely on storage, since this seems to be the essence of the Oasis plan.  Could one build a new kind of secure data center that (only) provides secure storage using BlockChain, built from the ground up?

The first step is to reject the permissionless BlockChain model, which is too weak to guarantee freedom from rollbacks even years after a transaction seemingly commits: permissionless BlockChain systems with anonymous servers are insecure by design.  We want a minimal BlockChain solution, but if you take minimal to mean "anonymous, globally replicated, permissionless", my answer is that "it can't be done: it is impossible to create a trustworthy platform with that mix of properties and technologies."

Fortunately, the permissioned model avoids this risk. Moreover, it seems reasonable to assume that if the goal is to offer a data-center BlockChain product, the data center operator can control which machines can operate the solution.  This moves us from the BitCoin style of anonymous, amorphic, fully decentralized BlockChain to a more standard model of append-only files used by customers like banks.

Next, we should perhaps reject a standard cloud that offers encrypted append-only files. This is an interesting step in the analysis because a block chain really is just a secure append-only file, no matter what anyone might tell you (secured using SHA 256 hashes or similar block hashes, with proof-of-work if the system is permissionless, and then with the signatures entangled to prevent tampering).  Any file system could play that role, if you code the needed logic to generate records formatted this way and with the required chain of attestation.  Amazon and Azure and other cloud companies already offer extremely secure storage layers, including BlockChain services. But as noted, they do depend on other elements of the respective datacenter systems.  So out with the old, and in with the new!

Now, without knowing anything about the proprietary protocols that Oasis is presumably designing,  I can't say anything about how they plan to guarantee correctness.  But I can tell you how I would do it.  I would use a form of Paxos, and because I would want extreme speed, would go with the version of durable Paxos that we implemented in Cornell's Derecho system.  If I were chief architect at Oasis, I might want to build my own software (hence not use Derecho), but I would certainly adopt the Paxos specification, and prove that my software implements it.

Of course being a Derecho zealot, I'm going to make the case that using Derecho might be the smart move, even if you might have preferred to roll your own.

First, I should note that by using Paxos, Derecho is able to leverage decades of work on  proofs of correctness -- Derecho was implemented by fusing a  proved-correct version of Paxos  integrated with a proved-correct version of the virtual synchrony membership management model and a new reliable (but not atomic) multicast layer.  Then all of the data movement steps were mapped to the available storage (SSD or 3-D XPoint) and communications technologies (RDMA or TCP) in an efficient way.  Finally, Derecho uses specialized optimizations for code paths that turn out to be performance-critical.

Next, it is worth noting that Derecho is open source, hence can benefit from community contributions over time, and also that the system is extremely fast -- the fast such system ever created.  Further, it actually achieves minimal bounds for Paxos (Derecho's protocols are "optimal" in a formal sense).  Of course one can talk about smaller constants and so forth, but given the speed of the system, there is an obvious case to be made that the constants aren't showing any sign of being unreasonably large.  So Derecho is a genuinely interesting option!  Heck, maybe my students and I should try and raise $50M or so and jump into the commercial space!  

But now we run into the first of a series of issues: the protocols and proofs we are leveraging were all created in the usual way: on paper, by hand.  The mapping didn't modify the underlying logic, yet it required to adapt them to modern networking is by hand, too, and hence also needed to be proved by hand.  The same can be said about our dozens of optimizations.

Is there really a sensible way to argue that all of these hand-written proofs be accepted as somehow being "more trustworthy" than Amazon AWS or Azure?  After all, those are companies are serious about specifications too, and further, both have invested hundreds of millions of dollars in their testing and Q/A process.  Derecho is remarkably robust, and we would point to all those proofs as part of the reason, but even so, we do find bugs in our logic.

Now, if I had the money, one option might be to harden Derecho.  Aggressive testing, deep quality assurance and better documentation would be a breeze.   Starting with such a strong prototype, we could quickly end up with a professional-quality tool.  In fact I actually hope to do this, in the next year or so, if Derecho users are able to chip in to help with the costs.

But perhaps you still wouldn't be satisfied.  And indeed, today's state of the art goes much further: the very best approach is to only run code extracted from a machine-checked proof.  In effect, we compile the proof into the running system and take the developers entirely off the development path.

This, though, turns out to be infeasible for software as elaborate as Derecho, and would be even harder for whatever Oasis really plans to build.  The issue here is that as of today, the provers can't handle the style of code that we needed to use in order to create Derecho, and any full data-center infrastructure would have 10x more such code, and perhaps far more than just 10x.

Today's best provers actually can handle automated extraction of Paxos code from correctness proofs (this has been done using NuPRL here at Cornell, or IOA and Larch in Nancy Lynch's group at MIT, or IronFleet, using Dafny at Microsoft).  The resulting solutions are very robust in the sense that the logical proof structures that result are rock solid -- for what they cover.  Unfortunately, however, the resulting code is incredibly slow compared to Derecho.  Moreover, infrastructure aspects are much harder to formalize than data replication and consistency, so in some sense, Paxos is the "easy" part of the story.  Can we specify the system, fully?  Is the specification itself bullet-proof?  Bootstrap?  Other management tasks?  These steps are much harder than many people would expect.  They may not even be possible, in some cases!

This same pattern is evident in many projects that have formalized aspects of operating systems or runtime environments.  At MIT, Nick Zeldovich famously proved a very simple file system correct a few years ago.  It ran on Linux, but there is a Linux u-kernel called SEL4 that has been proved correct, and while it doesn't cover all of Linux, SEL4 probably has enough stuff in it to support Nick's provably correct file system.

Then you could perhaps use the proved version of the C compiler to compile the thing (C++ is considered to be too complex for existing provers).  Even better would be to just build the whole thing in a language like RUST or Dafny, where proof is much more central to the coding and compilation model.  With such help, you may actually manage to create a proved solution, for parts of your full system.  

But without exception, you'll end up with slow code compared to the best possible, and will have solved just a portion of the entire datacenter infrastructure.  We are decades from having data-center scale, provably correct , ultra-performant software systems.  Perhaps we'll never get there.


Moreover, even if we did manage to overcome these barriers, we would run into further questions.

One big issue is the hardware.  Think about any hardware component in the entire data center.  The routers.  Printers. Digital telephone systems.  Storage devices.

$45M may seem like a huge amount of money, but in fact it is a tiny drop in the bucket when you consider that companies like Intel spend billions on their VLSI chip fabrication factories.  So there is simply no question that Oasis will end up using components some other company created.

The problem is that these components tend to be software-defined, and this is becoming more and more of a standard story: Almost all hardware components have general-purpose, highly capable processors within them.  An entire separate computer, with its own memory, network interfaces and code.

Indeed, if you needed one CPU and DRAM unit just to operate the hardware, why not include two on the same board, or chip?  Who would notice?  And if you do this, why not drop malicious code into the second unit?  Your printer company may not even realize that it is selling compromised devices: I've heard of network chips that had entire secondary networking infrastructures built in, operated by entire stealth operating systems.  You can monitor your network as closely as you like.  You would never notice the stealth network, and its control logic!

So you can perhaps make the higher layers provably secure, but this issue of trust in the hardware limits the degree that the resulting system could be secure.

If you run a secure solution on compromised hardware, all bets are off.  The situation is a bit better if you trust the hardware: Intel has an approach called SGX that can do a bit better, and perhaps Oasis plans to leverage it, but if so, they will face performance challenges.  Sadly, SGX is quite slow.

But suppose they pull all of this off: a ground-up datacenter solution, minimal trust in the hardware, offering a BlockChain storage layer to customers.  Now we run into a new puzzle: the issue arises of how to draw the boundary between the trusted storage solution and the customer's application.

The problem here has to do with composing a trusted application with a trusted storage layer through some form of less-trusted layer of intermediary logic, like the runtime associated with the programming language.  Modern applications are coded in standard languages like Java, Python, C++, Ruby.  They use databases from big vendors like Oracle, web servers like Apache, and so forth... and each of brings its own millions of lines of logic and its own runtime environment.  Those presumably have flaws, hence an attacker could potentially compromise the application by compromising some element of the compiler or runtime, even if other aspects of the datacenter BlockChain storage solution magically were iron-clad.

And this is why I'm a skeptic: I don't see how such a picture can be secured, not today, and probably not in my lifetime.

I actually do believe in security, but I think that in modern systems, we get much more protection from diversity and multiple layers of monitoring than we do from automated techniques like verification.  We should definitely do verification where we can -- for systems like Derecho, it offers a path that can slash bug rates and greatly improve confidence in the correctness of the system relative to the promises it can legitimately make.  But to me we oversell the power of verification and proofs if we go further and allow people to believe that we have discovered a magic way to carry this idea to the limits and "prove the whole thing", whatever that thing may be.  BlockChains don't change this reality.

The Oasis folks will presumably read this blog, and I should emphasis that it isn't a personal criticism.  I'm a huge fan of the Berkeley security team and have been amazed by their accomplishments.  Amazing work has been done.  But even so, and without being hostile in any way at all, I do think we all share a need to be frank with the public about what technology can, and cannot, accomplish.

BlockChains are being oversold as a trivial way to get perfect security -- not by Oasis, but by the popular technology press, which seems to have a very dim understanding of the concept.  The press seems to think BlockChains are somehow "more" than secure file systems.  Yet if anything, they are "less"!

Perhaps we are seeing a superposition of two elements.  Clearly, we do have a community that was unaware of the concept of a tamper-proof append-only log protected using cryptography.  All sorts of folks who work in government, health care, manufacturing clearly feel that this is a revelation, and offers an answer to all their security worries.  And quite possibly some were actually not aware that we can use file systems in secure ways -- I'm glad they know, now, and if BlockChain helps them conceive of this, I'm all for it.

Then we have hype, driven by cryptocurrency valuations in the stratosphere, a technology press endlessly eager for the next big thing, and investors keen to make a killing.  In this marketplace, I see a lot of value that companies can bring to the table, particularly ones with exciting ideas for packaging this stuff to be useful, and some of this value is very real: HyperLedger and Ethereum and related tools are genuinely powerful.

But just the same, we need to stop claiming that BlockChain is a revolutionary invention.

My worry is that by overselling the same old file systems to a na├»ve community of infrastructure owners, we may end up with systems that are actually less secure than what they replace.  After all, a platform like Azure is actually remarkably secure today, managed professionally by a company obsessed with security (and nobody is paying me to say this, by the way!), and has such a diversity of technologies deployed that compromising it in a major way would be quite hard.  If some hospital were to abandon that and jump to BlockChain as its file system, the illusion of security may have replaced much stronger real security.  True, today's reality has its limits, but in fact  I would be far more comfortable seeing medical records hosted on Azure or AWS, than abandoning these professional solutions in favor of a storage product that uses BlockChain as a magic wand to solve all problems.

But clearly, the current climate (especially the technology press) is prone to magical thinking, and a bit weak on just what BlockChains are, and how they work, and what their logical foundations turn out to be.  And in light of that, I suppose that the amazingly high first round of Oasis investment makes a kind of sense.  A bubble?  Definitely.  And yet all valuations are measures of market sentiment.  So perhaps not an unreasonable bubble, given the modern business climate and the craze that BlockChain has engendered.