Wednesday 28 December 2016

Next generation memory technologies

During the past few months I've had time to talk to Dave Patterson (Berkeley), Dave Cohen (Intel) and Pankaj Mehra (Sandisk/WD) about the next generation memory hierarchy. This blog seems like a good way to organize what I'm learning and to jot down a few thoughts.

First, we're about to see another major technology dislocation, similar to that associated with RDMA (and actually it will include an RDMA dimension). Here's why I say this:

There are a LOT of a new technologies in the pipeline, including 3-D XPoint memory, much larger SSD (NAND) storage systems with battery-backed DRAM caches, phase-change memory, and probably more. For those unfamiliar with the terms, normal computer memories are usually referred to as SRAM (the super fast kind used for registers) and DRAM (main memory storage).
They have very different properties. 3-D XPoint is a persistent memory technology that can run nearly as fast as DRAM when evaluated purely in terms of throughput, and with only a very small access delay (a few clock cycles more than for DRAM). It offers full addressability and byte-by-byte atomicity, and it can keep up with RDMA, so we should be able to direct RDMA data streams directly into 3-D XPoint persistent storage. The issue for 3-D XPoint is likely to be its initial price: until volume becomes very large, this will be a fairly costly form of storage, more similar in price to DRAM memory than to SSD/NAND flash.
Battery backed flash (SSD/NAND) has been around for a while. It takes an SSD disk and then puts a large DRAM cache in front of the disk, with a control processor and a battery to keep the unit alive even if the host machine crashes. The idea is that once you've written to the DRAM cache (which you can do at full speeds and even using RDMA), you won't have to worry about persistency because if power is lost or the host crashes, the little control device will finish any pending transfers. But because NAND has lower transfer rates, you run into the issue that at RDMA speeds you can easily fill the entire cache before the system would have time to DMA the data onto the back-end flash storage system, so any high speed solution will need to either limit total transfer rates or stripe data across a bank of these units.
Rotational disk is quite slow by the standards of either of these technologies, but with new coding techniques ("shingling") offers some benefits that neither of the others possesses: it has really massive capacity, and doesn't suffer from "write leveling", a notorious problem for flash (if you rewrite a block too many times, the block degrades and can no longer record the data).

I didn't list phase-change memory here because I don't know as much about it, but people tell me that it has huge potential as well. I mentioned that 3-D XPoint will initially be fairly expensive (unless companies like Intel decide to price it low to push volumes up quickly). In contrast, SSD and rotational disk remain quite cheap, with SSD relatively more expensive than rotational disk, but quite a bit cheaper than 3-D XPoint. On the other hand, if you put a battery-backed cache and an ARM co-processor on your SSD store, it becomes a much more costly storage solution than it would be if you had one battery-backed DRAM cache for the whole system and shared it across your fleet of SSD disks.

We need to think of these memory technologies in the context of other technologies, such as:

RDMA, mentioned above.
NUMA hardware: with modern multicore systems, a memory unit can be close to a core or further away. It is very important to ensure that your memory blocks are allocated from nearby memory; otherwise every access can incur a huge penalty.
NetFPGA coprocessors, and GPU coprocessors: increasingly popular and able to run at line data rates but only if the pipeline is very carefully crafted.
Other forms of data capture and data visualization hardware that might use DRAM as their main way to talk to the host (your display already does this, but we'll see more and more use of 3-D visualization technologies, novel forms of video recording devices and new interactive technologies).
Various forms of caches and pipelines: each technology tends to have its own tricks for keeping its compute engine busy, and those tricks invariably center on pipelining, prefetching, data prediction, etc. If we move data around, we potentially move data from or into memory being computed upon by some device, and then either we might see the "wrong" data (e.g. if a specialized compute engine signals that it has finished, but its results are still in a pipeline being written to the output memory), or we might start the specialized compute engine up prematurely. In fact we even see this with our battery-backed DRAM/NAND mix, or with 3-D XPoint, where is takes a few clock cycles before data written to the device has really become persistent within it.

Some of these technologies are block oriented: the main processor cache operates in terms of blocks that we call "cache lines", an SSD would normally be structured into blocks of data, and a rotating disk invariably is. Others are byte addressable, like 3-D Xpoint and DRAM.

So we have quite a range here of speeds, price points, and addressability models. More complications; to the extent that these technologies have pipelines or caches or other internal memory structures, there isn't any standard way to flush them. Thus when I receive an RDMA data transfer into my machine and the RDMA NIC interrupts to say "data received", it isn't always obvious when it would be safe to use that data, and how. RDMA vendors ensure that CPU access to the received data will operate correctly: I can do any kind of computation on it that I want, or initiate a new outbound RDMA. But I want to turn around and ask my GPU to compute on the received data, it isn't obvious how to ensure that the GPU subsystem won't have cached some form of data of its own, that the RDMA transfer overwrote and hence invalidated, and if I can't flush those caches, I have no good way to be sure the GPU will "see" my new bytes and not the old ones!

A further puzzle is that in a rack-scale setting with NUMA machines, memory management potentially becomes very complex. RDMA makes it pretty much as cheap to pull in data from other machines as from my own: after all, the RDMA network runs as fast or faster than the DMA transfer from an SSD or a rotational disk. So the entire cluster potentially shares a huge amount of persisted memory, subject to just small latencies for accessing the first byte, provided that my transfers are large. With byte addressing, local DRAM is vastly faster than remote 3-D XPoint accessed over RDMA: even if the access will be handled by hardware, we can't eliminate those microseconds of setup delay.

Here's another thought: suppose that we shift perspective and ask ourselves what the application itself is likely to be doing with all of this persisted storage. I like to use two examples to shape my thinking. The first is that of self-driving cars sharing a huge world model that they read and update (the world model would include descriptions of the cars and their "plans" but also of everything they encounter and might need to chat about: other vehicles that aren't using the same self-driving technologies, pedestrians, transient road conditions like black ice or potholes, etc). And my second example is that of a bank logging financial transactions.

Here one instantly notices that there would be a need to keep fresh data easily accessible at low delays, but that older data could either be garbage collected or archived. So beyond the need to manage memory even for immediate uses, there is a broader need to think about "long term data curation". In the case of the bank, protection becomes an issue too: both to protect secrets and also to prevent tampering.

I don't have a magic answer here (nobody does) but I do think the picture highlights a huge opportunity for research in the operating systems and storage communities. Honestly, this looks to me like a situation in which we need a whole need operating system specialized just on these questions: there is an issue of what the proper abstractions should be, and how one would program against them, and how the technologies enable doing so (or prevent us from doing so).

Then we should ask about new opportunities. The self-driving car example points to a deep integration of storage with machine-learning applications. What are the access patterns that arise in such cases, and how does this shape our caching needs? What about replication? We clearly will need to replicate data, both for fault-tolerance and for speedy local access in the case of byte-addressing data where low-latency is key. Get such questions right, and suddenly we open the door to new kinds of medical devices that capture, process and display imaging as the operation advances; get them wrong and the same mix of hardware might be useless in the operating room!

I'll stop here: my blog entries are always way too long. But what an exciting time to do research in systems... I'm jealous of the young researchers who will get a chance to really tackle these open questions. And I'm looking forward to reading your SOSP papers once you answer them!

Wednesday 7 December 2016

The Internet of Things Needs a Realtime Cloud

Cloud computing systems ignore time, and this is a mistake.

Context: Temporal Data Mining is Hard Today

Everyone is talking about the Internet of Things (IoT), but fast, scalable IoT solutions can be difficult to create. IoT applications often require real-time analysis on data streams, but most software is designed to just read data from existing files. As a result, the developer may be forced to create entirely new applications that read data continuously. Developers who prefer to just build scripts that build new solutions from existing analytic tools would find this frustrating and time-consuming.

In fact, this problem has been troublesome to computing professionals for decades. There is an enormous variety of prebuilt software that can do almost any imaginable computation provided that the input is in files (for example tables represented in comma-separated value format, or files that list a series of values from some kind of sensor, or files written one per input, such as photos captured from an image sensor). In contrast, there is little prebuilt software for incremental data feeds (settings in which programs need to run continuously and accept one new input at a time).

This is why so many developers prefer to store data into files, then carry out any needed analysis on the files. But here we run into another issue: in today’s systems, the delay (latency) of the store-then-compute style of analysis is often very high. As an answer to this problem, my team recently created a system we call the Freeze-Frame File System. FFFS is able to accept streams of updates while fulfilling “temporal reads” on demand. By running analytics on the files that FFFS captures, we can get the best of both worlds: high productivity through reuse of existing code, and ultra-low latency. The key is that FFFS bridges between a model of continuously writing data into files and one of performing analysis on snapshots representing instants in time (much as if it was a file system backup system creating a backup after every single file update operation).

The developer sees a standard file system.

Snapshots are available and will materialize past file state at any time desired.

The snapshots don’t need to be planned ahead of time, and look to the application like file system directories (folders), with the snapshot time embedded in the name.

Applications don’t need any modifications at all, and can run on these snapshots with extremely low delay between when data is captured and when the analysis runs.

The file system uses RDMA data transfers for extremely high speed.

The result is that even though the FFFS looks like standard file system, and supports a standard style of program that just stores data into files and then launches analytics that use those files as input, performance is often better than that of a custom data streaming solution. This gives developers the best of two worlds: they can leverage existing file-oriented data analytics, but gain the performance of a custom-built data streaming solution!

Cornell’s Freeze Frame File System in Action

To illustrate the kind of computing I have in mind, here’s an example that arises in the smart grid. We simulated a wave propagating through a very simple mesh network, and generated 100 10x10 image streams, as if each cell in the mesh were monitored by a distinct camera, including a timestamp for each of these tiny images. We streamed the images in a time-synchronized manner to a set of data collectors over TCP, using a separate connection for each stream (mean RTT was about 10ms). Each data collector simply writes incoming data to files. Finally, to create a movie we extracted data for the time appropriate to each frame trusting the file system to read the proper data, fused the data, then repeated. In Figures 1-3 we see representative output.

Example of a real-time IoT application. The full animations are available here.

The image on the left used the widely popular HDFS system to store the files. HDFS has a built-in snapshot feature; we triggered snapshots once every 100 milliseconds (HDFS snapshots must be planned in advance and requested at the exact time the snapshot should cover). Then we rendered the data included in that snapshot to create each frame of the movie. Even from the single frame shown, you should be able to instantly see that this version is of low quality: HDFS is oblivious to timestamps and hence often mixes frames from different times.

In the middle case, we ran the same application but now saved the files into FFFS, configured to assume that each update occurred at the time the data reached the data storage node. Then we used FFFS snapshots to make the movie. You can immediately see the improvement: this is because FFFS understands both time and data consistency (if you run the animation, however, you will see that this version isn't perfect either). Another nice feature is that FFFS snapshots don’t need to be preplanned: you just ask for data at any desired time.

Finally, on the right, we again used FFFS, but this time configured it to extract time directly from the original image by providing a datatype-specific plug-in. (Some programming may be needed to create a new plug-in the format of your sensor data is one we haven’t worked with previously). Now the movie is perfect, although if data showed up very late (in particular, after we made the snapshots that were combined into the movie frames), obviously we could have seen glitches here too. Still, this is getting quite good.

With this image in mind, now imagine that rather than making a movie, the plan was to run an advanced smart-grid data analytic on the captured data. If a file system can’t support making a movie, as in Figure 1, it clearly wouldn’t work well for other time-sensitive computations, either. In effect, the HDFS style of file system fights against the IoT application. FFFS, by understanding time, offers a far superior environment for doing these kinds of smart analytics.

Why the Internet of Things Needs a Real-time Cloud

I hope you’ll check out our paper on FFFS and maybe even download and use the system itself. We believe it opens the door to a whole new style of data capture and computing, and also enables temporal forensics, where you use FFFS to create an archive of data, and then mine on the data later to explain things that happened unexpectedly and leave you puzzled: you could look back in time to see when the issue first arose. But rather than limit myself to FFFS here, I want to offer a totally different thought: maybe it is time for the cloud to start to embrace time, just like we did in FFFS, but in other parts of the cloud too.

Here are some reasons time could matter, a lot, in the future cloud. First, the world is gravitating to a wider and wider range of applications like the smart grid. Smart cars are an example: they will have autonomous onboard controllers but will probably often be paired with cloud-hosted services that run on behalf of the car and help it pick a route, point out nearby restaurants and book tables, notice if a road suddenly gets blocked and replan your travels, book you into a hotel if the weather becomes dangerously bad. And I think these will be highly available cloud-hosted real-time applications, for lots of reasons, not least being that they need to be super responsive.

Banking and other investment trading systems will operate more and more in this mode, and so will the systems controlling smart homes and smart cities and smart factories, and the list really goes on and on.

But once you imagine a whole world of time-based computing, with deadlines and scheduling and priorities, data replicated at high speeds for availability if a fault occurs, parallel computing in the loop so that machine learning can keep up with a rapidly changing external world, you aren’t talking about today’s asynchronous cloud anymore. CAP (the claimed tradeoff that weakens consistency to favor availability and partition tolerance) will be pushed to the side in favor of solutions like FFFS with strong consistency and strong temporal guarantees.

I do see hard puzzles: we need to scale this stuff up and make it handle georeplication too. We also need to integrate FFFS with the data analytic tools in systems like Microsoft Azure, Google and Amazon. So we researchers have our work cut out for us. But I can also see why you’ll definitely want FFFS not just now, but long into the future, and all these other goodies that really need to surround it in a grown-up real-time cloud!

Transactions [4]: How does a system like Derecho fit in a world with transactional DHTs?

Derecho is Cornell’s new RDMA-based data replication solution. It is a simple library coded in C++ 14 (well, maybe not that simple, but simple to use) and it automatically interfaces your code to the local RDMA hardware. It also automates such tasks as initializing new members when an application process joins an active application or restarts after a serious crash (state transfer), membership coordination and update when members join or fail, persistent data management, atomic multicast, and even RPC (but we also allow you to interface to a Derecho application with standard RPC packages like RESTful RPC or WCF).

What I find interesting is that (1) Derecho can be understood in terms of transactional consistency for atomic "one-shot" operations. (2) In fact these consistency guarantees extend to aspects of distributed application structure that are not normally considered to need consistency, (3) The really interesting use cases might actually need both Derecho and something like FaRM.

Derecho Primer (mostly duplicates content from the RDMA Derecho posting)

So let's get there step by first. I'll start with a bit of a duplicate, since I discussed Derecho on the RDMA thread too. But just in case you skipped that, this illustrates an application of the kind Derecho can help you create:

In this example, the really small white circles are processes: instances of a program, each running on some computer. The picture thus could span tens or even thousands of nodes in a cluster or datacenter. We’re trying to help you structure large numbers of processes into cooperating applications where each process plays a specific role within the larger picture.

This particular application has external clients that use browser technologies to talk to it: the red squiggles on the top left. For example they might use RESTful RPC, which encodes RPC requests as web pages (and similarly for the replies). REST is widely support but not very fast, which is why we also have a point-to-point RPC layer internal to Derecho. Most applications using Derecho would use our layer internally even if they talk to the external world using REST or something similar.

The next Derecho concept is that of the top level group. This is the full membership of the whole application. In the picture we can see that the top-level group has (at least) 13 members, because there are 13 explicitly shown tiny little white circles. For most uses the members are actually running identical code, but playing distinct roles. How can that be? Well, membership is a kind of input, and each program in a Derecho application knows its respective membership rank (1 of 13, 2 of 13, etc). So this is enough for them to specialize their behavior.

There is a standard way to specialize behavior, and the picture shows that as well: we associate C++ classes with the top-level group. This example has 3 classes: LoadBalancer, CacheLayer, and BackEnd. Each is a normal C++ class, but registered with Derecho. There is also a notification class, used by the back end to talk to the cache layer (purple circles).

Derecho uses little functions to define subgroup membership. They actually are implemented using C++ lambdas. One such lambda told Derecho that the first three members of the top-level group would be the load-balancer, and we see that at the top. Another indicated that the cache layer is sharded; in this configuration it seems to have at least 3 shards. Further, the shards have 3 members each, and there is a shard generator (built into Derecho) that assigned 9 members to this role. Last, we see that the back end store is also sharded, with 2 replicas per shard, and 4 members in total.

Derecho has all sorts of options for sharding (or at least it will when we finish work on v2 of the system). Shards can overlap, there can be members held in reserve as warm standby processes to jump in instantly after a crash, etc. Subgroups and shards can overlap… whatever makes sense to you as the designer.

Next, Derecho has ways to talk to these groups, subgroups or shards: via atomic multicast, which can also be logged into persistent storage. The latter case matches the specification for Paxos, so in effect, Derecho is a scalable Paxos with an in-memory or a disk execution model, and you can select between the two cases when designing the various components of your application. The system automatically cleans up after crashes and makes sure that messages are seen in the same order by all members and so forth.

So with about 15 lines of C++ code per class, more or less, you can instruct Derecho to set up a structure like the one seen here. And then it just takes a line or so more to send an RPC to some specific member, or issue a multicast to a subgroup or shard, or even to query the members of a subgroup or shard. The actual delivery of new membership reports (we call them views), multicasts and RPCs is via upcall to methods your class defines; the style of coding looks like a kind of procedure call with polymorophic arguments, and in the case of queries (a multicast to a group where each invoked function returns a result object), you can iterate over the replies in a for loop.

Derecho is blazingly fast. The API is quite simple and clean, so there isn’t a lot of infrastructure between your code and the RDMA network, and we’ve done some really interesting work to maximize performance of our RDMA sends and multicast operations. Learn more from the Derecho papers on http://www.cs.cornell.edu/ken, and download the open-source distribution from GitHub.

Derecho plus FaRM

Visiting at MSRC got me thinking about cases where you might want both Derecho and a transactional DHT, like FaRM or HERD. I came up with this:

What you see here is two applications, both using Derecho (task A and task B), and then a transactional DHT sitting between them (more properly, present on the same cloud, since all of these span lots of compute nodes).

The case I'm thinking about might arise in a setting like self-driving cars. Suppose that task A is a helper application for self-driving cars on California highway 101. It scales out (hence needs lots of read-oriented replicas in the caching layer), has state (car positions and plans), and needs consistency. And suppose that task B is handling California 280, which is also quite a busy throughway (so it needs its own smart highway helper), but has some overlap with 101. In fact California might need ten thousand of these tasks just to cover all its busy highway segments.

So we have complex scaled-out applications here, that need replication for high availability, fault-tolerance, and very strong consistency, built using Derecho. The internal state of each lives within it, including persistency. But when these tasks share state between tasks -- with external "users", the need is different. In our illustration, we use a transactional DHT to store state shared between these complex, structured applications.

My thinking here is that the DHT is a very good fit for a kind of anonymous sharing: task A wants to report on vehicles that are likely to leave highway 101 for highway 280, but on the other hand doesn't want to "talk directly" to task B about this, perhaps because the sharing model is simply very convenient this way, or perhaps to avoid interdependencies (we don't want task A jamming up if task B is very busy and slow to respond to direct queries). Anyhow, the manager for route 1, along the coast, might periodically do a quick scan to plan ahead for cars that might be leaving 101 to join route 1 in the future, etc. By using the DHT we can support a kind of unplanned sharing.

In contrast the Derecho form of consistency is a great fit for the availability and fault-tolerance needs of our individual tasks -- busy, scaled-out, complex systems, that need consistency. The DHT wouldn't easily support that sort of application: a DHT like FaRM lacks any notion of a complex, long-lived application with state replicated across its member processes. FaRM uses replication, but only for its own availability, not to replication application roles or functionality.

So my belief is that we are heading towards a very large-scale world of demanding IoT applications that will really need both models: replication in the form a system like Derecho offers for application design and structure, and transactional DHTs for anonymous sharing. But I guess we'll have to wait and see whether the application designers faced with building smart highways agree!

Transactions [3]: Replication with consistency (what problem are we solving)?

So the question now arises: what about data replication?

There are really two issues:

Data replication inside the professional database, or the DHT, or whatever the transactional server might be. This is usually completely invisible to the end-user, who just writes little transactional operations and doesn’t need to be aware of how the DHT will handle crashes or similar issues.
Data replication used in the client systems: either out on the edge (like in a smart car, which might need to have a highly available control system, replicated and continuously active even if some component crashes due to a bug or fails because of a hardware glitch), or in the outer tier of the cloud (e.g. inside the cloud datacenter, but in the layer where the client requests are first received: an example might be an information service launched on behalf of that smart car that will run while the car is actively driving, as its cloud-hosted partner for planning the route, anticipating traffic jams, checking for sales at stores you are passing that sell stuff you seem likely to buy, and so forth).

Both kinds of replication matter, and will often need strong replica consistency, but notice that the replication technology would run in different places:

The first kind of replication runs inside the transactional service itself, to keep the service alive and accessible even if something fails while it is running. The earliest work on Lamport’s famous Paxos protocol[1] was something called Viewstamped Replication and was a solution to precisely this problem: Oki and Liskov were building a database and wanted to use replication within it, and ended up with a version of Paxos (it looks just like Lamport’s later Paxos protocol in his famous theoretical treatment a few years later), but deeply integrated with the data storage layer of the database they were building.
The second form of replication runs in a highly available application, where we may be replicating the active state of the program (perhaps, data or data structures it uses as it runs) across a set of program instances that back each-other up to step in if a failure occurs.

In my view, this gets to the core of the distinction. Think back to those edge processes that might be using transactions: in a stateless model, like the one that prevails in the cloud, such a process isn’t fault-tolerant and even if it is playing an important role, like being the representative within the cloud for a self-driving car that connects periodically for updates, the thread doesn’t have any real option for keeping itself alive in the face of routine cloud stuff like elasticity events that shut nodes down (including, perhaps, the node the thread was on). The cloud does things of that kind all the time, without much consideration of application state, because in the prevailing style of coding any kind of state is supposed to be stored into a database, or a DHT.

This has actually worked very well up to now, but as I see it, the world is shifting because with the growth of Internet of Things applications, multi-user gaming, and other kinds of applications that have continuously managed online state and need to be responsive within a few milliseconds, we can no longer trust the cloud to be fast enough to meet the requirement. That self-driving car might have a fail-safe approach to handling outages in its cloud-hosted controller, but it won’t want to activate fail-safe mode unnecessarily. The same is true for smart power grid systems: they can operate as dumb systems, but you often really want high availability.

When you write a distributed program as a set of processes with some form of consistently replicated state, you can also take advantage of having multiple processes to gain extra CPU power, for example to perform actions in parallel. With modern machine-learning systems, this could matter, so as we move towards a world of active AI and ML applications, that sense real-world input and instantly react, we’ll also move towards a world with greater and greater need for replicated state and consistent coordinated actions by programs that view themselves as team players. And if that program is long-running, you need it to also tolerate failures, be able to regenerate a node lost due to a crash by integrating a replacement node into the running system, etc.

These needs argue that the future may be a world with far greater use of HPC clusters running applications coded in languages like MPI, and perhaps also far greater deployment of multi-node applications running on general platforms (elsewhere in this blog, on the RDMA discussion, I suggest that HPC with MPI is not a very general infrastructure and that we won’t soon see MPI on datacenters that use fast Ethernet and have multitenancy). For those cases, we’ll need something different – Cornell’s Derecho system is a response to this specific need.

Back on message, what does this tell us about transactions “versus” consistent replication in Paxos or using atomic multicast (some call this in-memory Paxos or RAM Paxos, but I tend to view it as a different abstraction because Paxos is focused on a stateful model of the protocol itself, whereas atomic multicast is usually finished with a message once it hands it to the application – it isn’t required to log the thing, replay the log later, etc). Derecho has both.

So in the world I expect will prevail, we’ll probably have a mix of stateless threads on the edge with more stateful, replicated, multi-node services using a replication solution to run in a fault-tolerant way on a set of nodes. Personally, I think Derecho is the best way to pull this off, replicating state in that kind of a service – much as we did in all the Isis applications years ago. But if you prefer to use JGroups, LibPaxos, Vsync, RaFT or whatever, I’m not going to try very hard to talk you out of that (basically, a team should use whatever fits best for its goals).

So our edge would not have these highly available replicated services and applications, side by side with today’s stateless threads. And my belief is that these highly available, strongly consistent edge services will sometimes need to talk to one-another (service instance to service instance, not internal to the single replicated program).

For example, suppose our highly available service is in charge of helping my self-driving car navigate the New York City traffic and hit the open roads up to Ithaca. It runs on the cloud, and my self-driving car has an autonomous onboard control system that talks to the service when connectivity is good, getting routing advice and so forth (“Hey Ken, its already 6:30 and we are coming up on a great little restaurant near the Delaware Water Gap: should I reserve a table?”). And since there might be 1M cars on the road in the US Northeast, there could be 1M instances of this helper running on various US Northeast datacenters.

Where does that transactional DHT or database enter in? Well, that helper program might be storing my routing plan (like a flight plan) and other information into a database or DHT in order to get help from various backend services that run on big data, or to share my trajectory with other route-helpers for other nearby cars, etc. Just as the little stateless threads share data via a database or DHT, so would these things.

In contrast, they use consistent replication internally, inside the application, for other purposes, like for high availability, application persistence (e.g. if a node crashes and then restarts), etc. The benefit compared to storing all the state in the database or DHT is that you get continuous realtime control with the ability to ride out failures or other configuration problems, and you also get an actual programming model for leveraging a set of nodes as part of one service. This kind of thing is hard to do with a DHT: if your lightweight stateless thread crashes, who relaunches it, and how does it track down its state? Can it be done in milliseconds or less? Not obvious.

Think about future banking systems, smart power grid, smart buildings in smart cities, and you get a longer and longer list of possible use cases fitting the pattern. Which is why I think we need both forms of consistency: replicated services as well as transactional storage layers.

[1] For fairness, I should note that this is a much-debated question: the invention of Paxos is one of those events that would normally jet people into contention for a Turing Award, although in this particular case, two of the main contenders already have Turing Awards (for other work, but you can win that award twice).

Barbara Liskov is quite insistent that she and Brian Oki invented Paxos and she points to the viewstamped replication paper, which appeared in PODC in 1988.

But my own Isis Toolkit (a data replication tool that I created years earlier than the work Oki and Liskov did) has a protocol called Gbcast in it, for group management, and it bisimulates Paxos, which is a fancy way of saying that any execution of Gbcast matches precisely with an execution of Paxos and vice versa. So in this mathematical sense, Gbcast really is Paxos, and once one realizes this, the mapping from one to the other becomes more or less evident; there is a Wikipedia article that discusses this, under the heading Gbcast Protocol).

The trouble is that without a fair amount of manipulation, Gbcast doesn’t look much like Paxos; you need to really think about it to figure out how to transform one into the other, as explained in that Wikipedia article. So while Gbcast is certainly in the class of protocols that Paxos is in, it isn't clear that one can call it Paxos. If anything, the opposite seems to be true: Gbcast shouldn't be viewed as a Paxos protocol.

This said, Gbcast definitely solves the same problem that Paxos solves. Moreover, since we're discussing chronology here, papers on Isis started to appear around 1985, and included this protocol, and I even gave invited talks on the work at MIT in that period. By 1987 all the main Isis protocols had appeared in major conferences or journals. But again, the Isis version of Paxos didn’t look much like Paxos, and the proof of correctness was way less elegant than Lamport’s proofs, so even I would have a hard time claiming that this was really the first Paxos. What I would say is that those who came later would, mostly, have seen this work.

Then, continuing with fair treatment for all, there was yet another protocol, by Larry Stockmeyer, Nancy Lynch and Cynthia Dwork. The protocol looked more like Paxos than my Gbcast protocol, and had a proof a lot like the one that Lamport later used (not identical), and it was published in the same PODC conference proceedings as the Viewstamped Replication paper, in 1988! So we have my work in 1985, then these papers which came out simultaneously in 1988, and then Paxos which circulated as a technical report starting around 1990, but didn’t get published until 1996.

So, who invented Paxos? I lean towards giving the nod to Barbara and Brian, provided that Gbcast is explicitly acknowledged:

Gbcast was the first practical protocol to solve the consensus problem in a way that bisimulates what we now think of as the Paxos protocol. It was not specified in the same manner as Paxos, nor was the proof much like the contemporary proofs. Let’s say 1985, since that was the date of the first paper on the work, and also the time period where I gave an invited talk at MIT.

Viewstamped Replication looks like Paxos, and has the famous Synod pattern in it (the core of Paxos), so let’s call this a true Paxos, in 1988. But the protocol is very deeply integrated with the transactional database they were trying to replicate, and the PODC paper lacked any proofs, so we can’t guess from that at the proof structure. I’m told that Brian Oki had the proofs in his thesis, but am also told that they didn’t look much like the Paxos proofs. So: Synod protocol, and a Paxos-style use case, but apparently the proof was not in the style that made Lamport’s Paxos work so famous.

The consensus protocol of Stockmeyer, Lynch and Dwork, also in 1988. Another close match, but you need to do some mental mapping to convert it to Paxos (like for Gbcast). Call it a close runner up.

True Paxos: let’s date this 1990, when the TR version was widely circulated for the first time. Has the Synod protocol, the full specification, and the proofs have a modern form. Definitely the winner if you don’t give full credit to Viewstamped Replication.

As noted, both Lamport and Liskov have already won the Turing Award, but the awards didn’t point to Paxos in either case (Leslie won for his overall body of contributions, starting much earlier with work on causality and time, and Barbara won for early innovations in programming languages and modularity).

Transactions [2]: What about NoSQL?

My quick introduction to transactions may seem harsh to people who love NoSQL, because I more or less brush this approach to the side. What's my problem?

To get the lay of the land straight, here's a quick explanation of what NoSQL is about. The idea is mostly associated with key-value storage (also called DHTs, because we hash the keys and implement a distributed hash table): systems that offer a way to do put/get operations with keys of your own design, and values that fit your programming model. We'll talk about transactional key-value stores separately, but the NoSQL community focuses on non-transactional ones.

Given a DHT but no transactions, what do transactions add to the mix? The ACID properties are atomicity, consistency, isolation and durability. Well, if we treat each put or get as a single operation, atomicity comes down to the policy for handling conflicting updates on a single key. Here, it turns out that cloud systems often solve the problem with a finesse: many don't allow updates and operate in an insert-only model, with automatic garbage collection after a specified delay. Thus, for many DHTs, there can't be conflicting updates at the level of a single put: end of story!

Even when there could be updates, if our key maps to a specific node and that node does operations one by one, concurrent updates would still be done in some sequentialized order.

Let's jump past consistency and isolation. What about durability? Well, most DHTs do worry about failures, typically by replicating each data item on the node to which the data maps, but also onto the next k nodes along the key-value mapping space. This way, if k or fewer nodes happen to crash, data won't be lost. But most commonly, the updates are done by routing the request to the first node in the list of k+1, which then forwards it along the side of the DHT structure (normally, a ring). So if the first node performs operations in some deterministic order and a small amount of care is taken to forward data in a way that preserves order, the replicas will be consistent with the primary node.

In a nutshell, we've already summarized the NoSQL concept. Now if you know me at all, you'll know that I really worry when someone expresses a "problem" by describing a "solution". In some sense, this is what happens with NoSQL: given a key-value system that is almost always able to offer ACID guarantees with no extra effort at all, NoSQL systems often stop there and say "look, this is pretty close to transactional, don't fret about it."

Historically, the NoSQL movement originated when transactional database systems started to hit serious scalability limits (read the wonderful little paper by Jim Gray, Pat Helland, Patrick O'Neil and Dennis Shasha on this if you haven't yet seen it: they discuss "The Dangers of Database Replication, and a Solution"). Eric Brewer put Jim's point into a very pithy form and proposed a principle he calls CAP, claiming that consistency, availability and partition tolerance are deeply at odds, so much so that you can only get 2 out of the 3. Eric recommended giving up on consistency: just say no to acid!

And this turns out to work, at least sometimes. The eBay folks, and Werner Vogels at Amazon, came up with BASE: a methodology for coding with a NoSQL DHT. Basically, go ahead and design a transactional system, just as you were taught to do at Cornell (if you opted for some other school, well, hopefully they did a good job too...). But now, get rid of the transaction begin and commit. Go ahead: just do it!

Well, without transactions, you'll see a mostly ACID execution for individual put or get operations (mostly, not always), and of course multi-operation sequences can overlap, or be disrupted by a crash. So the next step is to just pause and think hard about the surprises you might encounter. The whole point of BASE is that for many web applications, such as shopping on Amazon or picking a movie from Netflix, or bidding on eBay auctions, those surprises won't be so hard to anticipate, or to fix.

So, having thought deeply and identified all the possible oddities... just fix your code to not crash when surprises happen -- instead, just do what you can to hide them from the user. End of story! Your system will scale, and nobody will be the wiser about any nasty little inconsistencies that might arise. In some systems, one also adds a background fixer-upper to check for oddities in the database and repair them (you need to come up with a plan for that too).

Don't get me wrong: BASE works incredibly well for most web systems, and you can scale a NoSQL DHT pretty much without limits. So if you are able to follow this route, you will get super performance, amazing scalability, and your users will see the magic sub-100ms response times the web uses as its main design point. Indeed, since a NoSQL system is potentially inconsistent by design, you can toss caching in anywhere you like, for further speedups. NoSQL storage systems and the BASE methodology were a home run for the web, without any question.

The issue arises if you work on mission-critical applications with strong safety needs, like medical care systems for use in a hospital, or self-driving cars. Bankers tend to think money is a life-or-death matter, so they might put financial applications on the list. Sales people get upset if the customer relationship data is messed up. So when these kinds of critical computing uses are layered over the NoSQL/BASE model, we have an issue: inconsistency might not be so easy to sweep under the rug.

For example:

Doctor Smith noticed that patient Sally Jones was having a reaction to a medication, and took her off it. But the BASE/NoSQL system misfiled the update and now the nursing station just gave her another dose. Oops!
The self-driving car you are in just registered its plan to turn left up on Main Street. But the car planning database dropped that update, and the bus on Main Street believes that the intersection will be completely clear. Nasty fender-bender...
The smart power grid just configured itself for an unusually low demand period, but that was a mistake; actually the database system was in a forgetful mood, and neglected to log ten thousand kilowatt hours of demand. Blackout!

The point being this: not every application can manage with weak consistency. CAP is fine, for web browsing (the proof being that people use it very successfully, Eric Brewer's company Inktomi was an immense success, and CAP seems like a religion in some communities). But CAP isn't a theorem and in fact there are cases where we need more.

What should we do about this? One option is to mix ACID with BASE: you get SALT, my colleague Lorenzo Alvisi's cool new system. The basic idea is to study the transactions in isolation, offline, and form groupings: if transactions of flavors A, B and C are all in use at the same time, perhaps we need a costly form of transactional concurrency control, but if we are only running class A transactions, we might be able to manage safely using a NoSQL model, and perhaps when B and C run but not A, there is some small trick that suffices to give full ACID properties, etc. Lorenzo gets remarkably far this way. I should have come up with this idea myself!

Another option is to build a transactional DHT. There are several exciting research demonstrations of this concept by now: FaRM from Microsoft (I visited with the FaRM group for a few months in fall of 2016), HERD from CMU, and there are others too. For example, my colleague Gun Sirer has an amazing multi-dimensional DHT called Hyperdex that scales incredibly well and has full ACID properties. In fact all of these solutions work well, and if your application matches their guarantees, they can work so well that you are left puzzled by CAP: is it even more false than we thought? Could it be that we really can have all three properties, and cheaply? I think that more and more, the answer is yes!

For me, the bottom line is this: NoSQL and BASE aren't a terrible idea, if your application didn't need ACID in any case. Eliminating unneeded synchronization delays, without breaking correctness, is a very useful form of optimization. But the whole point is that if you didn't need ACID, then in some sense, you aren't sacrificing anything when you optimize this way.

On the other hand, if you actually do need strong properties, you would be a dangerous fool to assume that CAP is a theorem and that the right answer is to sweep inconsistency under the rug. That self-driving car, or the patient in that ICU -- they might notice, and could be harmed. NoSQL is fine, for those who don't need the full ACID and SQL model. But if you do need the model, use one of these transactional DHTs. They really work, and they scale extremely well too.

Some thoughts about anonymity in the BitCoin BlockChain protocol

BitCoin has spawned a whole series of information revolutions: some concerned with digital currency, some more focused around anonymity, and some concerned with the potential for cryptographic block chains. I thought I might say a few words on the topic. But I should maybe remark first that I’m not a world expert on this topic, although I’ve read a great deal and talked with people who are doing cutting edge research. So take these as a few thoughts from someone on the periphery.

I didn't give much thought to BitCoin for a long time: obviously, I was aware of it, and read about arrests of the folks running Silk Road, which was a dark web site (meaning you accessed it over cryptographically tunneled links) hosting a kind of Amazon.com of illicit substances, prostitution, and apparently even murder for hire. I won’t summarize the story, but I definitely recommend reading about it. Anyhow, Silk Road transactions were all denominated in BitCoin, because the currency was reasonably widely available, and offers anonymity.

But I always viewed BitCoin as an interesting application built using a technically uninteresting protocol: we've worked with replicated append-only persistent logging for ages (after all, this is what Paxos is about), and Byzantine versions too (consider the work that Castro and Liskov did at MIT on practical Byzantine replication). Thus the fact that BitCoin had such a protocol within it was kind of unimportant to me.

But sometimes, a technology takes hold and you are sort of forced to begin to think about it. In my crowd, people talk about BitCoin quite a lot these days, and are getting interested in BlockChain technology. So I keep finding myself pulled into discussions about the technology. The topic I think is worth discussing here isn’t specific to the way BitCoin works, but relates to its BlockChain.

We can understand Bitcoin in terms of distinct aspects:

BitCoin is a digital currency (or commodity) and people buy and sell them, and use them in transactions. So one dimension is concerned with BitCoin in this role as currency.
BitCoin centers on a style of agreement protocol that builds an append-only log of transactions: the BlockChain. The specific BitCoin log has the difficulty that because the protocol itself is capable of rollback (meaning that sometime after block A is appending to the BlockChain, subsequent events can replace A with some other block B, invalidating transactions contained in A), there is never absolute certainty in the actual final state of the BlockChain. In fact, although rollbacks of more than 3 blocks at the tail of the BlockChain are very rare, an omniscient adversary could construct a situation that would roll the entire BitCoin BlockChain back to the first block[1].
The protocol is designed for anonymous participants and is intended to tolerate some level of Byzantine mischief, but the precise conditions for progress are hard to pin down, in part because arbitrary rollback is a part of the model.

The protocol has an implied assumption that the network is fast enough to fully replicate updates (via gossip broadcast) within a short amount of time, which would normally be seconds when computers are active. Of course when a computer reconnects after being shut down for a while, it takes longer because it will need to catch up. This assumption isn’t really stated either.

The conditions under which a particular BitCoin transaction block has become fully stable (would never roll back) are somewhat fuzzy, but because a rollback-free BlockChain prefix is strong enough to achieve consensus, cannot be weaker than the bounds for progress in a consensus (uniform agreement) system operating with a partially synchronous network. This question was studied formally and results by Lynch, Keidar and others would seemingly apply.

In fact it may be possible to show that in a rollback-free Blockchain prefix, created by the Bitcoin protocol, there is a sense in which the protocol runs as a series of epochs, with the (anonymous) members of the present epoch effectively voting in the (anonymous) members that will comprise the next epoch. I’ve wanted to look at the problem carefully, but haven’t had the required time (anyhow, the proof, if I am correct, might really tax my formal skills). If you can prove this, I’m happy for you to take credit, but if you didn’t formulate the problem prior to reading this blog entry, I would appreciate credit for “proposing the problem”!)

The solution makes heavy use of anonymity. The participating endpoint computers that hold Bitcoins and mine for new Blockchain extensions are all named by cryptographic keys, and can create new names for themselves as often as they like. The proof-of-work aspect of the Blockchain protocol prevents what are called Sybil attacks, in which some computer hijacks a system by pretending to be an enormous number of computers operating in concert. Without anonymity, and fear of Byzantine behavior, BitCoin’s, proof of work would not be needed.

The power of the BlockChain

The modern view is that the concept of a BlockChain be taken as a separate entity: in this modern perspective, BitCoin needs a BlockChain with certain additional special properties, and implements one using its own probabilistic protocol.

Viewed in this more modern way, a BlockChain is simply:

An append-only log of records.
The records are ordered, which is evident from the first requirement, but also include a cryptographic signature that somehow witnesses the prior blocks. Thus the integrity of the BlockChain can easily be checked by scanning it front to back or back to front, confirming that each record correctly countersigns the prior ones.
The content of each of the blocks in the BlockChain is similarly protected: each block has a cryptographic signature spanning the information it holds. Various schemes can be used: a hash over the records, a Merkle tree of signatures, etc. But the upshot is that a valid BlockChain has a form of cryptographic proof of integrity built into it. An intruder who seeks to tamper with the chain would need to rewrite the entire suffix starting with the modified record, and the deployment would often make this prohibitively difficult (for example, in BitCoin, the whole BlockChain is fully replicated to all participants).

Described in this way, a BlockChain could be implemented in many ways -- BitCoin implements it as a protocol between what it portrays as anonymous, Byzantine participants, but actually you could store your BlockChain just as easily on a standard cloud computing framework like Azure, and it could store all sorts of data -- not just BitCoin transactions. In fact, and I'll expand on this below, you might be wise to do this (to use a more standard way of storing your BlockChain). I say this because many of the purported properties of the BitCoin protocol simply do not hold for the protocol BitCoin implements. The protocol is just wrong. In fact I'm not sure it even deserves to be called "wrong" because to be wrong, it would need a clear specification, so that I could say "in such and such a situation, it violates its specification". But BitCoin doesn't even have a real specification for its BlockChain: this is a case of a solution without a problem statement. So how can it be wrong? It isn't even wrong: wrong would be better!

Let's try and break things down and tackle them step by step.

First, just for clarity, what I am pointing out is this: obviously, the actual records within a BlockChain could record digital currency transactions, which is the only use made of them by BitCoin, but you can also create infrastructure to store far more elaborate information into the chain, and could then store them into any kind of database that won't lose them or permit tampering. There are a number of new commercial products that focus on this idea: they define higher level languages for encoding digital contracts and then store them into some form of highly reliable storage system.

Where BlockChain systems depart from this is that they implement the BlockChain as a highly replicated structure: every single BitCoin participant ends up with a full copy and sees every update to it (which are always in the forms of appends: new records that should extend the length of the chain). So we have the abstraction of a record recording transactions, and then we have the abstraction of an append-only log with cryptographic tamperproofing, and finally we have a way to implement such a log, over what turns out to be a gossip protocol. These are protocols in which system participants share full information with one-another: A contacts B, and then A sends B whatever A knows that B lacks, and vice versa. Gossip can be very robust, and BitCoin benefits from that. A further advantage is that gossip can operate without tracking the full system membership -- transitive coverage of the full set is sufficient. However, lets return to this below, because it isn't as simple as it sounds.

The digital contract languages can be quite elaborate: a digital contract could refer to variables defined by prior records (for example, in BitCoin, each coin actually has a fractional value defined by the transaction that created it), or even in a future record (“Edward agrees to sell Ken 1 to 5 sheep at a price of 75 euros per animal, contract to be consummated by December 15 of 2016, in default of which Ken would pay Edward a 15 euro cancellation fee.”). So here we see references to future payments, conditional outcomes, etc, all of which would be evaluated as a function of the state of the BlockChain, and could evolve as the chain grows longer.

Portions of such a contract could be concealed by further layers of cryptography. For example, a digital BlockChain service could log a record on behalf of Edward and Ken without knowing its contents. Later, either party could demonstrate to an impartial judge that the record was logged and then (if the two parties share the decryption key) could unseal the hidden content, revealing to the judge that the contract included such-and-such terms.

This concept, however, touches upon a problem of multiparty commit: extending the BlockChain with such a record in a manner that neither party can later repudiate requires a protocol enabling us to prove (1) that both parties desired to commit this particular record, (2) the record itself was not tampered with, (3) misbehavior by the parties cannot somehow cause the entire system to fail, or render a record inaccessible relative to the original access guarantees. Such problems can be solved, but they need to be carefully specified and the solutions proved correct.

Notice that absolutely nothing in the above requires that a BlockChain be anonymous. In fact, a BlockChain can be operated on well known servers by a company, perhaps a bank, that is completely open about the identifies of every party. BlockChain is a concept orthogonal to anonymity.

In fact, many banks are becoming interested in serving in this role: offering BlockChain services to their clients, for a fee, just as banks offer safe-deposit boxes. And with cryptographic sealing, a BlockChain record can be a kind of digital safe-deposit box, holding something on behalf of the customer that the bank itself has no way to “see”, because it holds encrypted data and lacks the encryption keys, nor would it have any way to guess them or require the customer to produce them.

This said, the same community that created BitCoin has been extremely interested in a kind of anonymous federation in which the user could define his/her own notion of trust (for example, I trust Citizen’s Bank of Ithaca, and it trusts the Alternatives Federal Credit Union and the Cornell Federal Credit Union, and those credit unions trust the National Association of Credit Unions…) and then to define transactions over these trust sets. The problem quickly becomes very interesting (when a professor uses the term “interesting” that normally means “a topic needing a great deal of research”).

I’m slightly skeptical (when a professor uses the term “slightly skeptical” he or she means “don’t agree with, at all”) that banks would ever engage in anonymous transactions, especially where financial contracts are involved. So my belief is that the world of anonymous BlockChain transactions will be a non-banking world: some form of global barter community that might use cryptocurrencies like BitCoin or Ethereum and transact through next-generation anonymous BlockChain protocols in which the participants are fully self-defined and autonomous. But meanwhile, the banking community might begin to offer digital safe-deposit boxes, implementing them in a completely distinct manner.

Banking with BlockChains

What then might a future bank wish to do with BlockChains?

Hold digital contracts on behalf of the bank itself and its customers (which to the bank, would not be anonymous entities, because they would pay for the service).

Deploy the solution in a highly fault-tolerant and secured manner, protected against tampering.

A banking BlockChain should have zero probability of rollback, so these protocols will need to be more like Paxos: protocols that guarantee agreement on ordering and on durability, with stability (in a formal sense, a logical property is stable if it once it holds, it holds forever).

Guarantee that the solution is compliant with the relevant financial records custody requirements. These rules can be quite complex: transactions subject to audit or required for tax compliance may need to be held for N years but then provably destroyed once N years have elapsed, and banks may be required to track and disclose certain kinds of transactions to the relevant authorities. The law probably will need some time to catch up with the technology in this rapidly evolving area, but it seems clear that it will be a vibrant are of future growth for the industry.

There are fascinating questions that arise when a bank has multiple BlockChains in its multiple branches, or when it transacts with other banks. For example, suppose that the BlockChain for branch A of a bank records the transaction mentioned above (“Edward agrees to sell Ken…”) but the payment by Ken to Edward is recorded into a different BlockChain. This seems to create a fault-tolerance threat: if the second chain was a different bank branch, could an earthquake or a flood somehow render that chain inaccessible and void the transaction, or at least make it impossible to validate? What about cross-bank events? What happens if a bank later fails?

It seems to me that there are some very interesting theory questions here, and it would be fun to try and pose them and develop a rigorous theory of BlockChains for banking.

The existing BitCoin community might not be enthusiastic about such work, because of their long history of working with BitCoin and its strong assumptions about anonymity, and its use of protocols that can roll back. So financial cryptography may simply need to follow its own path.

My own take on the problems stated earlier are that they suggest a need for cross-BlockChain protocols that provably witness information, so that BlockChain A can learn information from BlockChain B in a manner that A can safely record into its own chain, with no risk of later repudiation. This would let Ken’s payment to Edward be logged by BlockChain B in the Trumansburg branch of Tompkins County Trust, but then would let Edward query the Citizen’s Bank of Ithaca to learn that yes, he has been paid and should hand over the sheep: BlockChain A would be the one operated by the Citizen’s Bank, and it would run a protocol by which it learned of the payment from chain B in a safe and secure way. With a bit of work to flesh out the details (for example, does B proactively report the payment to A, or should it wait for A to inquire?), this can certainly be made to work.

There will always be an element of trust. For example, how can we be confident that Ken really paid the bill at bank B? How can we be confident that Edward actually handed over the sheep? The interface of real-world events to computational events and data records clearly needs attention.

I do not believe in full replication among banks. So while in this example, A learns something from B, in general, A’s records would live entirely within A. We do need to ask what the rules would be for performing operations that require replication, or that require cross-bank protocols. But in general, each BlockChain should be understood as an autonomous system holding private data, and interacting with other systems only under the overarching control of those rules.

Unlike BlockChains with anonymous Byzantine participants, where proof of work is also a protection against a denial of service attack in the form of a flood of transactions that overloads the system, financial BlockChain systems wouldn’t really need any form of proof of work, because they are operated by trusted servers running trusted code (at least, code justifying the same level of trust as we accord to the operating system, the database system, and the various banking application programs). We might still use Elliptic Curve cryptographic systems to make our BlockChains tamperproof, but the entire “social infrastructure” BitCoin and much of the BlockChain world seeks to build is rendered unnecessary in a banking setting.

Indeed, there is absolutely no reason that banking BlockChains would need to run in slow motion. They could potentially log any rate of transactions desired.

Ken’s take on all this stuff

Clearly there is a lot one could do in this space; if I wasn’t busy with Derecho, I might move into it. But I’m far more drawn to the banking style of BlockChain than to the anonymous Byzantine style that prevails in the field today. My reasons are simple: I honestly think that BitCoin and its cousins are ill-specified and in some ways, provably broken. How can one ever trust a currency if the records can potentially be invalidated years from now simply because Virgin Space starts a tourism service to Mars? To me the answer is obvious: we can’t. Not “it really never happens, don’t worry about it” but “no.” And once one rejects anonymity and Byzantine behavior and so forth – rejects, in some sense, the political agenda that the Satoshi Nakamoto manifesto set forth at the outset, we’re left with a fairly standard, recognizable form of distributed computing service, with replication for fault-tolerance and high availability, and with strongly consistent cross-site protocols. This class of questions is solidly in my area of interest.

[1] To trigger this, you need a partitioned situation in which a subgroup of BitCoin miners operates in total isolation. For example, perhaps you bring BitCoin mining software with you on your one-way trip to Mars and plan to mine for coins to while away the rest of your life there. A communications breakdown cuts you off from Earth, and blocks you from reporting your discovery of alien supercomputers that use a mysterious technology to solve elliptic curve cryptography problems. Using this technology, you create a blockchain far longer than the one on Earth. Now, miraculously, the new Virgin Mars Shuttle shows up to rescue you and drop off the first of the Mars tourists, and you are able to merge your BlockChain with the one on Earth. But yours is twice as long, so the entire Earth BlockChain rolls back, invalidating all the transactions that occurred during your years of isolation. (Valid coins that were created before your departure and then spent in the now-rolled-back transactions revert to their earlier owners, who get to spend them again, but the bad news is that coins minted during your absence become invalid, as do coins received through transactions in the rolled-back portion of the BlockChain).

A BitCoin backgrounder written by a guy who knows nothing about Bitcoin...

There is a saying that those who can, do, and those who can't, teach. In this particular area, I'm definitely not an expert. But even so, I thought I might jot down things I've learned about BitCoin from people who do seem to know their stuff. Ask one of them if you have a hard question.

So first of all, BitCoin is intended as a kind of anonymous currency: money, used to buy things, but managed in a way that completely hides the identity of the two (or more) parties to each transaction. At the end of the day, of course, you may prefer actual currency. So the basic model is that you get BitCoins by exchanging actual money (or other goods) for BitCoins, and this is also how you cash out. BitCoin is one of many digital currencies these days, but is probably the most widely used.
The anonymity aspect is very important to the BitCoin community. Banks don't like this because they have a legal obligation to know their customers and to be able to explain who they received money from, and who it was transferred to. Thus part of the BitCoin ecosystem is a new kind of banking model, independent of classic banking.
For the purist, BitCoin is not actually considered to be a currency, at least not yet: BitCoins are classified as commodities by the federal tax authorities in the US. Theoretically, you are supposed to declare how much you paid for your BitCoins, and how much you sold them for, and pay capital gains tax on the difference, and then for products with BitCoin pricing, there is a way to declare and pay taxes on barter transactions. I don’t know how many of the people who transact using Bitcoins actually do this.
The protocol was invented by Satoshi Nakamura, seemingly an assumed name (it turns out that there is a real Satoshi Nakamura, but he doesn’t seem likely to be the inventor and perhaps is better seen as a victim of identity theft, in this context). In fact there is a long history of mathematicians banding together and writing things under assumed names (Bourbaki, for example). Given its relatively sophisticated choice of cryptographic system (elliptic curve cryptography, fundamental to BitCoin, produces large keys and is fairly slow, hence is not widely used in today’s cryptographic systems. However, it was a great choice for BitCoin, which needed a way to construct a computationally hard puzzle, and also needs immunity to scenarios in which a future quantum computing breakthrough could render RSA ineffective. But this level of knowledge is typical of students who do well in a modern cryptography course. In fact, the image of students doing this makes some sense: the actual BitCoin manifesto (the origin document for this whole field) is poorly written, making it unlikely that the authors were academic scholars. There would be a very easy way to prove that you were one of the original inventors, if so inclined: you could simply reveal that you hold one of the original BitCoins, by spending it: the inventors generated several billion dollars worth of BitCoins early in the blockchain. But there may be reasons that they wouldn’t wish to spend those. First, having it become known that you were one of the wealthiest people in the world through your holdings of BitCoin could make you and your family a target for criminals. Further, the whole point of the original manifesto is political: it puts the protocol forward as a tool for breaking down the capitalist establishment. A person deeply committed to this view would very likely have destroyed the original BitCoins. Having done so, it would actually become very difficult to prove that one had created the system. Further, given this philosophy, “coming out” to claim the associated fame would be a betrayal of principle.

A bit more technical:

A BitCoin is actually an entry in a shared ledger, called the BitCoin blockchain. These days we consider the BitCoin blockchain to be one instance of a more general technology – blockchains, in this sense, are an enabler for other things too.

Everything in BitCoin is anonymous. A coin is just a randomly generated unique identifier. The endpoints that transact BitCoins are themselves identified by random unique ids (so don’t misplace your digital identity, because this system lacks an id recovery tool). You can create new ids as needed, and can have many ids, and can discard ones you no longer plan to use.

In some sense, an endpoint id identifies a component running the BitCoin protocol, and the BitCoin ids represent the form of data that these protocols talk about. There is no ledger that connects your real-world identity to your endpoint ids, and possession of the BitCoin id is already proof that you “own” that coin.

A BitCoin can only been spent once. Each transaction generates new coins, even if the transaction didn’t fragment one coin into multiple subcoins or combine multiple coin fragments into a larger coin (each coin has an id, but also has a value, which is denominated in fractions of the official BitCoin currency. 1 BitCoin was worth $733 the day I wrote this Blog posting, about double the value from a year ago. In fact, BitCoin values have been fairly volatile: it peaked at $1000 in 2014, about 2 years ago, yet was almost devoid of value if you look back to 2013 or earlier, and dipped to $250 early in 2016.

The particular Bitcoin implementation has some interesting properties:

It assumes a network layer that uses a form of anonymous gossip to rapidly propagate messages: any machine using the BitCoin software is supposed to be more or less current in the sense of having the same broadcast messages as all the other machines, within a delay of a few minutes (so: not microseconds, but not years either).

The participants are assumed to be Byzantine: they will violate the protocol properties if it is in their own self-interest to do so, and this could mean lying in arbitrary ways.

BitCoin depends upon public key cryptography: it uses double SHA256 for mining/block creation; and a protocol called Elliptic Curve Digital Signatures (ECDSA) for private/pub keys and signatures. ECDSA is slow, but quite strong compared to other popular methods, like RSA.

The BitCoin protocol uses a form of consensus to fully replicate the Blockchain: each new broadcast proposes an extension to the Blockchain, which needs to countersign the prior end of the Blockchain and then reports some BitcCin transactions, and is signed with a form of cryptographic seal that actually solves a computationally hard puzzle (specifically, the block includes a nonce value that must be set so that the hash of the block has a given number of leading zeros in it. The BitCoin ledger comes with a dynamic rule for deciding how many zeros are required: once every 2016 blocks, the rule is recomputed. This value is actually picked so that blocks will be created approximately once every 10 minutes. You can estimate the total hashing performance of the entire BitCoin network from the rate: as of the end of 2016, the network was computing approximately 2 billion billion hashes per second (the computers in the network are mostly equipped with special hardware – your desktop wouldn’t be able to compute hashes quickly enough to ever manage to publish a new block before some other hardware-accelerated system would beat it to the punch).

By publishing an extension to the blockchain, a computer earns money in the form of fractional BitCoins (specifically, a tax on the transactions in the block, plus a reward for finding the hash). The reward is gradually diminishing, but the belief is that the tax is enough of an incentive to keep the network running even when the reward goes to zero.

There can definitely be races in which two or more Blockchain extensions get proposed simultaneously. The rule is that any participant will adopt the longest valid Blockchain it knows about, even if this means rolling back from what it previously thought was the suffix of the Blockchain (e.g. some machine X might initially believe that A is the last block, but then learn about an extension B, C and since that extension is longer, it would accept the extension even though this causes X to throw A away).

Experience shows that rollbacks are pretty rare. In fact the longer that BitCoin has been running, the less frequent the rollbacks have become and the shorter the average rollback. Just the same, if your friend sells you a pack of chewing gum for a fraction of a BitCoin, don’t be upset if she waits until a few minutes have passed (until your transaction has been published, and then a few more blocks have been published that extend the chain beyond the one with your transaction in it) before handing over the pack of gum.

Various factors limit the number of transactions per block in BitCoin. As of late 2016 the average block contained about 2000 transactions, up from 500 two years earlier. But this means that the entire global BitCoin network is actually logging just 288,000 transactions per day. That number gives a glimpse into the limitations of the model.

And it isn’t necessarily perfect:

The transaction rate is basically limited and is way too low for genuine global use of BitCoin in any kind of serious way.

My colleagues Ittay Eyal and Gun Sirer showed that BitCoin is actually unfair, in the sense that a cartel of miners can earn more than its fair share of the BitCoin rewards, even though the official BitCoin protocol precludes cartel-like behavior (to cheat, the cartel uses a modified broadcast protocol and only shares its updates within the cartel members, reporting a few blocks at a time to the outside world rather than publishing each block as it is minted). In effect, the cartel cheats by splitting the work of mining (which is not unusual), but also as soon as some member finds an extension, reorienting the cartel to mine the extension. Meanwhile, non-cartel members are wasting time mining the original chain. By the time the cartel publishes its extension (maybe 3 blocks all released at once), the rest of the world has wasted time and only ended up with perhaps 2 blocks worth of extension, which then get rolled back. So the cartel gets rewarded for keeping its discoveries secret. There result is basically a kind of game theory analysis of the standard protocol and illustrates the sense in which the protocol really isn’t flawless.

As it turns out, Eyal and Sirer also have a protocol to fix the problem (called BitCoin NG), and it has some additional benefits too, notably that it allows more than one extension to the Blockchain at a time and by doing that, gets rid of the limitation on how many transactions can be logged per hour. But nobody knows whether it will gain wide acceptance.