Thursday, 9 August 2018

Magical Thinking and the Logical Foundations of BlockChains

During the past few years, I've been exposed to an unrelenting drumbeat for BlockChains.  The level of enthusiasm for this model, and the commercial mania around it, have all the elements of a "market bubble".  Just yesterday I saw a quote from a BlockChain/CyberCoin billionaire who believes that "BlockChain will replace the Internet."  Really?  But search for that phrase and you'll actually find that this guy is saying something many people believe.  Rational or not, there is a huge community totally convinced that the future will be a giant BlockChain.

The BlockChain buzz was evident at the recent conference I attended, where one speaker told us about a Berkeley spin-of based on BlockChain: Oasis, which just landed $45M in first-round "seed" funding.  Just think about that for a moment: how can such a number be justified?  I'm a skeptic.

Oasis apparently plans to build a secure infrastructure using BlockChain as the storage solution, but entirely secure from the ground up.  Presumably this is one reason Oasis needed so much cash: most companies these days just run on a public cloud like Azure or Amazon (or any of the others).  Building from the ground up will be expensive, but would let Oasis avoid a pitfall other startups share: because its competitors depend on existing datacenters, they are exposed to whatever security flaws those cloud platforms embody.

But can one build a secure data center from the ground up?  Let's focus purely on storage, since this seems to be the essence of the Oasis plan.  Could one build a new kind of secure data center that (only) provides secure storage using BlockChain, built from the ground up?

The first step is to reject the permissionless BlockChain model, which is too weak to guarantee freedom from rollbacks even years after a transaction seemingly commits: permissionless BlockChain systems with anonymous servers are insecure by design.  We want a minimal BlockChain solution, but if you take minimal to mean "anonymous, globally replicated, permissionless", my answer is that "it can't be done: it is impossible to create a trustworthy platform with that mix of properties and technologies."

Fortunately, the permissioned model avoids this risk. Moreover, it seems reasonable to assume that if the goal is to offer a data-center BlockChain product, the data center operator can control which machines can operate the solution.  This moves us from the BitCoin style of anonymous, amorphic, fully decentralized BlockChain to a more standard model of append-only files used by customers like banks.

Next, we should perhaps reject a standard cloud that offers encrypted append-only files. This is an interesting step in the analysis because a block chain really is just a secure append-only file, no matter what anyone might tell you (secured using SHA 256 hashes or similar block hashes, with proof-of-work if the system is permissionless, and then with the signatures entangled to prevent tampering).  Any file system could play that role, if you code the needed logic to generate records formatted this way and with the required chain of attestation.  Amazon and Azure and other cloud companies already offer extremely secure storage layers, including BlockChain services. But as noted, they do depend on other elements of the respective datacenter systems.  So out with the old, and in with the new!

Now, without knowing anything about the proprietary protocols that Oasis is presumably designing,  I can't say anything about how they plan to guarantee correctness.  But I can tell you how I would do it.  I would use a form of Paxos, and because I would want extreme speed, would go with the version of durable Paxos that we implemented in Cornell's Derecho system.  If I were chief architect at Oasis, I might want to build my own software (hence not use Derecho), but I would certainly adopt the Paxos specification, and prove that my software implements it.

Of course being a Derecho zealot, I'm going to make the case that using Derecho might be the smart move, even so.

First, I should note that by using Paxos, Derecho is able to leverage decades of work on  proofs of correctness -- Derecho is actually implemented by fusing a  proved-correct version of Paxos  integrated with a proved-correct version of the virtual synchrony membership management model and a new reliable (but not atomic) multicast layer.  Then all of the data movement steps are mapped to the available storage (SSD or 3-D XPoint) and communications technologies (RDMA or TCP) in an efficient way.  Finally, Derecho uses specialized optimizations for code paths that turn out to be performance-critical.

Next, it is worth noting that Derecho is open source, hence can benefit from community contributions over time, and also extremely fast -- the fastest such system ever created.  Further, it actually achieves minimal bounds for Paxos (Derecho's protocols are "optimal" in a formal sense).  Of course one can talk about smaller constants and so forth, but given the speed of the system, there is an obvious case to be made that the constants really aren't showing any sign of being unreasonably large.  So Derecho is quite an interesting option!  Heck, maybe my students and I should try and raise $50M or so and jump into the same commercial space!

But now we run into the first of a series of issues: the protocols and proofs we are leveraging were all created in the usual way: on paper, by hand.  The mapping didn't modify the underlying logic, yet it required to adapt them to modern networking is by hand, too, and hence also needed to be proved by hand.  The same can be said about our dozens of optimizations.

Is there really a sensible way to argue that all of these hand-written proofs be accepted as somehow being "more trustworthy" than Amazon AWS or Azure?  After all, those are companies are serious about specifications too, and further, both have invested hundreds of millions of dollars in their testing and Q/A process.  Derecho is remarkably robust, and we would point to all those proofs as part of the reason, but even so, we do find bugs in our logic.

Now, if I had the money, one option might be to harden Derecho.  Aggressive testing, deep quality assurance and better documentation would be a breeze.   Starting with such a strong prototype, we could quickly end up with a professional-quality tool.  In fact I actually hope to do this, in the next year or so, if Derecho users are able to chip in to help with the costs.

But perhaps you still wouldn't be satisfied.  And indeed, today's state of the art goes much further: the very best approach is to only run code extracted from a machine-checked proof.  In effect, we compile the proof into the running system and take the developers entirely off the development path.

This, though, turns out to be infeasible for software as elaborate as Derecho, and would be even harder for whatever Oasis really plans to build.  The issue here is that as of today, the provers can't handle the style of code that we needed to use in order to create Derecho, and any full data-center infrastructure would have 10x more such code, and perhaps far more than just 10x.

Today's best provers actually can handle automated extraction of Paxos code from correctness proofs (this has been done using NuPRL here at Cornell, or IOA and Larch in Nancy Lynch's group at MIT, or IronFleet, using Dafny at Microsoft).  The resulting solutions are very robust in the sense of the logical proof structures that result.  Unfortunately, however, the resulting code is incredibly slow compared to Derecho.  Moreover, infrastructure aspects are much harder to formalize than data replication and consistency, so in some sense, Paxos is the "easy" part of the story.  Can we specify the system, fully?  Is the specification itself bullet-proof?  These steps are much harder than many people would expect.  They may not even be possible, in some cases!

This same pattern is evident in many projects that have formalized aspects of operating systems or runtime environments.  At MIT, Nick Zeldovich famously proved a very simple file system correct a few years ago.  It ran on Linux, but there is a Linux u-kernel called SEL4 that has been proved correct, and while it doesn't cover all of Linux, SEL4 probably has enough stuff in it to support Nick's provably correct file system.

Then you could perhaps used the proved version of the C compiler to compile the thing (C++ is considered to be too complex for existing provers).  Even better would be to just build the whole thing in a language like RUST or Dafny, where proof is much more central to the coding and compilation model.  With such help, you may actually manage to create a proved solution, for parts of your full system.  

But without exception, you'll end up with slow code compared to the best possible, and will have solved just a portion of the entire datacenter infrastructure.  We are decades from having data-center scale, provably correct , ultra-performant software systems.  Perhaps we'll never get there.

Moreover, even if we did manage to do so, we would run into further questions.

One big issue is the hardware.  Think about any hardware component in the entire data center.  The routers.  Printers. Digital telephone systems.  Storage devices.

$45M may seem like a huge amount of money, but in fact it is a tiny drop in the bucket when you consider that companies like Intel spend billions on their VLSI chip fabrication factories.  So there is simply no question that Oasis will end up using components some other company created.

The problem is that these components tend to be software-defined, and this is becoming more and more of a standard story: Almost all hardware components have general-purpose, highly capable processors within them.  An entire separate computer, with its own memory, network interfaces and code.

Indeed, if you needed one just to operate the hardware, why not include two?  And if so, why not drop malicious code into the second unit?  Your printer company may not even realize that it is selling compromised devices: I've heard of network chips that had entire secondary networking infrastructures built in, operated by entire stealth operating systems.

So you can perhaps make the higher layers provably secure, but this issue of trust in the hardware limits the degree that the resulting system could be secure.

If you run a secure solution on compromised hardware, all bets are off.  The situation is a bit better if you trust the hardware: Intel has an approach called SGX that can do a bit better, and perhaps Oasis plans to leverage it, but if so, they will face performance challenges.  Sadly, SGX is quite slow.

But suppose they pull all of this off: a ground-up datacenter solution, minimal trust in the hardware, offering a BlockChain storage layer to customers.  Now we run into a new issue: the issue arises of how to draw the boundary between the trusted storage solution and the customer's application.

The problem here has to do with composing a trusted application with a trusted storage layer through some form of less-trusted layer of intermediary logic, like the runtime associated with the programming language.  Modern applications are coded in standard languages like Java, Python, C++, Ruby.  They use databases from big vendors like Oracle, web servers like Apache, and so forth... and each of brings its own millions of lines of logic and its own runtime environment.  Those presumably have flaws, hence an attacker could potentially compromise the application by compromising some element of the compiler or runtime, even if other aspects of the datacenter BlockChain storage solution magically were iron-clad.

And this is why I'm a skeptic: I don't see how such a picture can be secured, not today, and probably not in my lifetime.

I actually do believe in security, but I think that in modern systems, we get much more protection from diversity and multiple layers of monitoring than we do from automated techniques like verification.  We should definitely do verification where we can -- for systems like Derecho, it offers a path that can slash bug rates and greatly improve confidence in the correctness of the system relative to the promises it can legitimately make.  But to me we oversell the power of verification and proofs if we go further and allow people to believe that we have discovered a magic way to carry this idea to the limits and "prove the whole thing", whatever that thing may be.  BlockChains don't change this reality.

The Oasis folks will presumably read this blog, and I should emphasis that it isn't a personal criticism.  I'm a huge fan of the Berkeley security team and have been amazed by their accomplishments.  Amazing work has been done.  But even so, and without being hostile in any way at all, I do think we all share a need to be frank with the public about what technology can, and cannot, accomplish.

BlockChains are being oversold as a trivial way to get perfect security -- not by Oasis, but by the popular technology press, which seems to have a very dim understanding of the concept.  The press seems to think BlockChains are somehow "more" than secure file systems.  Yet if anything, they are "less"!

Perhaps we are seeing a superposition of two elements.  Clearly, we do have a community that was unaware of the concept of a tamper-proof append-only log protected using cryptography.  All sorts of folks who work in government, health care, manufacturing clearly feel that this is a revelation, and offers an answer to all their security worries.  And quite possibly some were actually not aware that we can use file systems in secure ways -- I'm glad they know, now, and if BlockChain helps them conceive of this, I'm all for it.

Then we have hype, driven by cryptocurrency valuations in the stratosphere, a technology press endlessly eager for the next big thing, and investors keen to make a killing.  In this marketplace, I see a lot of value that companies can bring to the table, particularly ones with exciting ideas for packaging this stuff to be useful, and some of this value is very real: HyperLedger and Ethereum and related tools are genuinely powerful.

But just the same, we need to stop claiming that BlockChain is a revolutionary invention.

My worry is that by overselling the same old file systems to a naïve community of infrastructure owners, we may end up with systems that are actually less secure than what they replace.  After all, a platform like Azure is actually remarkably secure today, managed professionally by a company obsessed with security (and nobody is paying me to say this, by the way!), and has such a diversity of technologies deployed that compromising it in a major way would be quite hard.  If some hospital were to abandon that and jump to BlockChain as its file system, the illusion of security may have replaced much stronger real security.  True, today's reality has its limits, but in fact  I would be far more comfortable seeing medical records hosted on Azure or AWS, than abandoning these professional solutions in favor of a storage product that uses BlockChain as a magic wand to solve all problems.

But clearly, the current climate (especially the technology press) is prone to magical thinking, and a bit weak on just what BlockChains are, and how they work, and what their logical foundations turn out to be.  And in light of that, I suppose that the amazingly high first round of Oasis investment makes a kind of sense.  A bubble?  Definitely.  And yet all valuations are measures of market sentiment.  So perhaps not an unreasonable bubble, given the modern business climate and the craze that BlockChain has engendered.

Wednesday, 11 July 2018

Why we need a "Bell's test" for BlockChains

Last fall, Scott Aaronson visited Cornell and gave a series of talks on quantum computing.  I asked him about quantum-encrypted fiber-optic communication: how can users be sure that the technology actually uses entanglement, and isn't just some form of standard communication link set up to mimic the API to a quantum one?

Background: Quantum cryptographic methods basically mix classic communication with a quantum source of completely secure "noise".  They use a normal PKI for endpoint authentication, but introduce a quantum device that sends entangled photos to the two endpoints.  By measuring some unknown property (usually, polarization), the endpoints extract a genuinely random sequence of 0 and 1 bits that both observe identically (entanglement),  Any intruder who attempts to spy on the system would disrupt the entanglement.  The shared sequence of random bits can then be used as the basis for end-to-end symmetric cryptography.

A vendor offering a product in this space needs to do much more than to provide an optical fiber that carries entangled photons.  One issue is that the endpoints need to synchronize to ensure that they perform the same test on the same photons.  This isn't easy at high data rates.  To work around the limit, you would probably use quantum entanglement to create a symmetric key shared by the endpoints, then employ that symmetric key as input to DES or some other high-speed symmetric cryptographic solution.

But suppose we  don't trust the vendor.  Could the hardware be designed to "cheat"?

Scott's answer:  Thus there are many ways to cheat.   For example, notice that the scheme outlined above starts with a completely unknown property: entangled photos with totally random polarization.  One could instead generate an entangled sequence with known polarization.

The user will have been fooled into using a key that the evil-doer generated (and hence, knows).  The user's secrets will actually be out in the open, for anyone (at least, anyone who knows the sequence) to read.

In fact, why even bother to entangle the photons? Why not just take a laser, polarize the output (again in a known way), and then beam the resulting (non-random, non-entangled) output through a half-silvered mirror, so that both endpoints can see the same signal.  A naïve user would measure polarizations, extract the same sequence from each end, and think that the device was working flawlessly.

Beyond this, one can imagine endpoint hardware that genuinely goes to all the trouble of extracting  random data from quantum-entangled photons, but then ignores the random input and substitutes some form of pre-computed key that the evil-doer computed months ago, and stored in a table for later use.  Here, the buyer can actually go to the trouble of verifying the entanglement, confirm that the device is genuinely intrusion-tolerance, and so forth.   Yet we would have zero security, because the broken endpoint logic ignores the quantum-random input.

Bell's Theorem.  Setting such cases to the side, Scott also pointed out that for the entangled photons on the fiber-optic cable, there actually is a good way to test that the device is working.   He explained that in the lab, physicists test such technologies by running a "Bell's Inequality" experiment.

As you may know, Bell's Theorem was proposed by John Stewart Bell as a way to test one of the theories about quantum entanglement -- some believed at the time that "hidden variables" were at the root of entanglement, and were actively exploring ways that such variables could arise.  Bell showed that true entanglement could be distinguished from a hidden variable system using a series of measurements that would yield different results for the two cases.  Scott's point was that could run a Bell's inequality experiment on the fiber.  It would give unambiguous evidence that the photons emerging from our fiber are genuinely entangled.

But a Bell's test would only cover the technology to the endpoints of the fiber carrying the entangled photons, and we could only run such a test with "unrestricted" access to the medium.  Very few products could possibly be deconstructed in this way.

Bottom line?  Clearly, it is vitally important that quantum encrypted communications technology be from a full-trusted vendor.  A compromised vendor could sell an undetectably flawed technology.

Why is this relevant to BlockChains?  A BlockChain technology is only as secure as the  cryptographic package used in its block-entanglement step.  Suppose, for example, that I created a cryptographic package called SHA 256, but instead of using the actual SHA 256 algorithm, implemented it in some trivial and insecure way.  As long as that package produces an random-looking hash of the input, of the proper length, one that varies for different inputs, you might be fooled.

What's the risk?  Well, if I could trick you into using my product, it would look like a BlockChain and seem secure.  Yet suppose that the chain included block X that has a transaction I find "awkward".  If my fake hashing system lacks a cryptographic property called "strong collision resistance", I could substitute block Y for X, modifying the stable body of the chain, and you wouldn't be able to prove that this tampering had occurred.  Obviously this defeats the whole point.

Now, if you were to check the output against a different, more trusted SHA 256 hash solution the values would differ.  Yet how many people audit their BlockChain products using a technology totally independent of anything the BlockChain vendor provided?  In this example, even using the SHA 256 code provided by your vendor is a mistake: the SHA 256 code is broken.

Moreover, there are other ways that one could potentially trick a user.  A SHA 256 hash computed on just a portion of the transaction record could look completely valid (would be valid, for that portion of the block), and yet would leave room for tampering with any bytes not covered by the hash.  Your audit would thus need to really understand the BlockChain data structure, which isn't as simple as you might expect.  Many BlockChain vendors use fairly complex data structures, and it isn't totally trivial to extract just the chain of blocks and hashes to audit that the hash actually covers the data in the block.  Any vendor-supplied code you use for this step, at all, would expose you to a risk that when you go to audit the chain, the vendor tool covers up any tampering.

My point?  This is a genuine risk.  An immense number of companies are jumping to use BlockChain for diverse mission-critical purposes.  These companies are relying on the guarantee that once the blocks in the chain become stable, nobody could tamper with the record.  Yet what we see here is that a BlockChain is only as good as the vendor and the cryptographic package, and that the chain can only be trusted if you have some way to independently test its integrity.  And you had better really run that test, often.

My advice to anyone working with BlockChain is to hire a trusted independent consultant to build a BlockChain test program that would audit the actual chain daily, using completely independent technology.  If the vendor is using AES 256 for hashing, your auditing company should find a very trustworthy AES 256 and base the audit on that.  If the chain uses some other hashing method, same goes for it -- this can work for any standards.

What if your vendor is offering a non-standard BlockChain that runs super-fast by virtue of using a new but proprietary hashing technology, or a new but non-standard secret data structure?  My advice is simple: if the vendor won't supply you with enough detail to let you audit the chain, don't trust it.

Thursday, 28 June 2018

When open source is the right model

At DSN, I found myself in conversation with some entrepreneurs who were curious to know why in an era when people are making billions on relatively small ideas, we aren't adopting a more mercenary IP stance with Derecho.  For them, our focus on papers at DSN and TOCS and open software that really works was a strange choice, given that we do have a shot at products and startups that could be pretty lucrative.

Here are some dimensions of the question worth pondering.
  • Academic research is judged by impact, meaning broad adoption, lots of citations, etc.  
  • We started our project with DARPA MRC funding.  DARPA insisted that we use open source licensing from the start, but we can pretend it didn’t and still reach the same conclusion.
  • Publicly funded research should benefit the taxpayers who wrote the checks.  For a system like Derecho, this means the system needs to be useful, adopted by US companies that can leverage our ideas in their products, and help them hire lots of people in high-paying jobs.  Derecho should enable high value applications that would not have been possible without it.   
Should Derecho be patented and licensed for a fee?
  • Patents don’t protect mathematical proofs or theorems (or algorithms, or protocols).  Patents protect artifacts.  I often end up in debate with theory people who find this frustrating.  Yet it is the way the system works.  Patents simply cannot be used to protect conceptual aspects, even those fundamental to engineering artifacts that use those concepts in basic ways.  They protect the actual realization: the physical embodiment, the actual lines of code.  
  • Thus my group could perhaps patent Derecho itself, the actual software, through Cornell (this ownership assignment is defined under the US Bayh-Dole act).  But we cannot pursue a patent on state machine replication, the 1980’s model (the theory) underlying Derecho. Our patent would be narrow, and would not stop you from creating your own system, Tornado, with similar algorithms inspired directly by our papers.   Sure, we could work with a lawyer to arrive at tricky patent-claim wording that a naive reader might believe to cover optimal state machine replication.  Yet even if the USPTO were to allow the resulting claims no judge would uphold the broad interpretation and rule in our favor against Tornado, because patents are defined to cover artifacts, not mathematical theories.  Software patent claims must be interpreted as statements about Derecho embodiment of the principle, not the principle itself.  This is just how patents work.
  • Wait, am I some kind of expert on patents?  How would I know this stuff about patent law?  Without belaboring the point, yes, I actually am an expert on software IP and patents, although I am not an IP lawyer.  I got my expertise by fighting lawsuits starting in the 1990’s, most recently was the lead expert witness in a case with a lot of money at stake ($Bs), and I’ve worked with some of the country’s best legal IP talent.  My side in these cases never lost, not once.  I also helped Cornell develop its software IP policies.
  • Can open source still be monetized?  Sure.  Just think about Linux.  RedHat and other companies add high value, yet Linux itself remains open and free.  Or DataBricks (Spark).  Nothing stops us from someday following that path.
So, why should this imply that Derecho should be free, open source?
  • There are software systems that nobody wants to keep secret.  Thousands of people know every line of the Linux kernel source code, maybe even tens of thousands, and this is good because it enables Linux to play a universal role: the most standard “device driver” available to the industry, taking the whole machine to be the device.  We need this form of universal standard, and we learned decades ago that without standards, we end up with a Tower of Babel: components that simply don’t interoperate.  The key enabler is open source.
  • That same issue has denied us a standard, universal solution for state machine replication.  We want this to change, and for Derecho to be the standard.
  • There are already a ton of open source, free, software libraries for group communication.  Linux even has a basic block replication layer, built right in.  You need to value an artifact by first assessing the value of other comparable things, then asking what the value-add of your  new technology is, then using the two to arrive at a fair market value and sales proposition.  
  • But this suggests that because Derecho competes with viable options (“incumbents”) that have a dollar value of zero, then even if it has a high differentiated value as a much better tool, the highest possible monetary value for it would need to be quite low, or the market would reject it.  So yes, we could perhaps charge $5 for a license, but we would be foolish to try and charge $500K.  
  • You might still be able to construct a logic for valuing Derecho very high and then licensing is just once.  The broader market would reject the offering, but some single company might consider taking an exclusive license.  So you could protect Derecho with a patent, then sell it.  The system would end up as a proprietary product owned fully by the buyer: your US tax dollars hard at work on behalf of that one lucky buyer.  But now what’s happens?  In fact,   someone else could simply create a new free version.  Examples?  Think about HDFS and Hadoop and Zookeeper (created to mimic GFS, MapReduce and Chubby, all proprietary).  To me the writing is on the wall: if Derecho isn’t offered for free, the market will  reject it in favor of less performent free software, and then if the speed issue becomes a problem, ultimately someone else would build a Derecho clone, offering it as free software to fill the gap.  They would find this fairly easy, given our detailed papers, and it would be legal: recall that you can’t patent protocols.  This would be totally legal.
Conclusion? To maximize impact, Derecho needs to be open source.

Thursday, 14 June 2018

If RDMA is just 4x faster, will that doom adoption?

I’m staring at our latest Derecho numbers and am thinking about their implications.

With additional tuning, by now Weijia and Sagar have Derecho on LibFabrics over TCP running just 4x slower than Derecho mapped to RDMA on the same NIC.  I guess this number makes sense, since on our clusters the memcpy speed was about 3.75GB for uncached objects, while RDMA transfers run at a peak rate of 14GB: just about a 4x improvement.  So what we are seeing is that with a memcpy in the critical path from user to kernel, and one more memcpy from receiver kernel up to the user (TCP lives in the kernel), throughout drops to the speed of the copy operation.  In Derecho, adequately deep pipelines can absorb this effect.  Derecho is far slower if the window size is limited to too small a value.

Latency measurements are hard to carry out accurately, at these speeds, but once we get them, I’m sure we will see that latency is higher with TCP: rising from about 1.5us for small objects by some factor associated with memcpy delay and thread scheduling: LibFabrics has extra threads that RDMA avoids.  Weijia’s preliminary results suggest that one-way latency rises by tens of microseconds for small objects and hundreds of microseconds for large ones.  But today’s cloud applications live happily with much higher latency due to multitasking and scheduling overheads, so this may not be a primary concern for many users.

Moreover, to the extent that memcpy is the culprit, data center hardware vendors could speed up memcpy if they really set out to do so.  They could extend the DRAM itself to do the transfer, and offer that functionality via a special instruction.  That would probably double memcpy speeds.  But the demand hasn’t been there.  One could also modify TCP to do some form of scatter-gather with a DMA operation directly from the user-space object, avoiding the memcpy.  Of course this would break the semantics of the standard synchronous TCP send, but you could offer the fast path purely with asynchronous I/O, in which case the semantics wouldn’t need to change.

Derecho is very tolerant of high latency.  Systems like Tensor Flow are too, as would be any event-stream system using a queuing service (Kafka, SQS) to interconnect its components.  But some applications would care more, namely those dominated by RPC.  Are those still common?

Leading to my question: if RDMA only gives us 4x speedup on datacenter networks, and latency increases but only by fractions of a millisecond, will operators adopt the technology, given the complexity of deployment?  As you know, if you’ve read this blog, Microsoft and Google are gaining experience with a datacenter RDMA, using DCQCN or TIMELY for congestion control and various tricks to run RDMA on Converged Ethernet (RoCE) with flow isolation.  They are succeeding, but finding it fairly hard to pull off.   Normally, a technology that is this hard to manage and brings less than a 10x benefit doesn’t make it.

One case for RDMA might be based on CPU loads.  TCP will peg a CPU doing all this copying, so with Derecho over TCP we see the typical 12-core NUMA node acting more like an 11-core machine, and maybe even like a 10-core machine since there are other threads involved, too. As the datacenter owner, this 10-15% tax on your compute resources could be a big issue, giving RDMA that magic 10x benefit not in one year, but probably over a four or five year period, considering that it was 4x faster in the first place.

A second point is that not every project has all-star developers to do the tuning.  So our 4x may translate to a 20x difference for MPI, or a 25x for Oracle.  That would drive towards RDMA adoption after all. What Weijia and Sagar did came down to adjusting the SSTMC window size in Derecho.  But you need to figure out that for Derecho in this modem the window size is the dominating factor.  Who knows what it would be for MVAPICH (MPI), or Oracle, or Tensor Flow? RDMA takes that puzzle, and that datacenter tax, off the table.

My cards are still on RDMA.  But these results over TCP definitely came as a surprise!

Monday, 28 May 2018

Derecho on LibFabrics versus Derecho on RDMA: Hot off the wire...

A while back, I mentioned that we were trying to port Derecho to also run over LibFabrics, so that people could use the system on a TCP-only network, or one with the Intel OMNI-Path/IODirect technology, which are based on the LibFabrics API rather than using RDMA Verbs.

So this is starting to work now and I thought I might share a quick summary of the experience!  A big thanks to Weijia Song and Sagar Jha, who worked with Cornell ECE undergraduate Alex Katz to make this happen.  It was, as it turned out, remarkably hard.

For those lacking context, let me start with a summary of why this topic is of interest, but then that group of people may want to tune out because I'll go back to being highly technical.  As you may know, RDMA is a technology for permitting one computer in a data-center or cluster (close-by, densely packed machines) to do direct memory-to-memory data transfers.  RDMA comes in various modes: you can do messaging over RDMA (one machine sends, the other receives), and in this case it will behave much like TCP, waiting to send until the receiver is ready, then transmitting losslessly.  The big win is that the transfer will run at DMA speeds, which are extremely high: 100Gbps or more.  In contrast, if you code a little TCP-based transfer on the identical computers, same network card, you tend to see performance 1/5th what RDMA can achieve or less, because of all the copying (from user space to kernel, then the kernel to kernel transfer over the network in IP packets, then copying up).  Copying takes time, and this ends up being 4/5ths of the total path.  RDMA eliminates all of that, at least for large transfer.

For small transfers, people work with "one-sided" RDMA in which the sender and receiver prenegotiate a connection but also share a memory region: B grants A permission to read or write within some chunk of B's local memory.  Then A basically ends up thinking of B's memory as mapped into A's own computer, although the data transfer isn't as transparent as just reaching into a mapped memory -- it has to occur via special read and write commands.  Still, the model would be one of shared memory, reliably updated or read remotely.

RDMA originated on InfiniBand (IB) networks, popular in today's supercomputer systems.  Most supercomputer communication runs over a library called the Message Passing Interface, MPI, via an implementation like MVAPICH, and versions like that are highly tuned to leverage RDMA on IB.  Our work on the Cornell Derecho system uses this same technology.

Now, continuing the back story, RDMA in this form dates to work done long ago at Cornell by Thorsten Von Eicken (went on to lead GoToMyPC.com) and Werner Vogels (now CTO at Amazon).  RDMA took off in industry, and some think it killed off the older style of parallel supercomputing, replacing it with less costly clusters of commodity machines connected with RDMA networks on IB.  Anyhow, right from the start the leaders pushed for a TCP-like model.  In fact you actually start by creating a TCP connection, a normal one, then put an RDMA connection in place "next to it", using TCP for passing control messages until the RDMA layer is up and running.  RDMA itself involves a firmware program that runs in the NIC (so these NICs are like little co-processors, and in fact they often have a full Linux O/S on them, and are coded in C or C++).  The interaction between end-host and NIC is through shared "queue" data structures, which are always grouped: there is a queue used for send/receive/read/write operations (called "verbs" for reasons lost in the mists of time), and then a separate completion queue where the NIC reports on operations that have finished.

Verbs, the RDMA API, define a huge number of specialization options.  Data can be "inline" (inside the operation) or external (via a pointer, and a length), or one can specify a vector of locations (scatter-gather).  Operations can be selected for completion reporting.  There is a way to multiplex and demultiplex more than one logical connection over a single RDMA connection, using "keys".  There are "tags" that can carry an additional few bytes of information such as object length data (useful in cases where a big transfer is broken down into little sub-transfers).

Then there are ways to use RDMA verbs for unreliable datagram-like communication.  Unlike the normal case, where the sender will be certain that messages got through, in order, reliably, with unreliable datagrams some form of end-to-end acknowledgement and retransmission would be needed in the application.  And this mode also includes an unreliable multicast feature.

RDMA is a troublesome technology in normal optical ethernet clusters and data centers, and many customers just run it on a second side-by-side IB network to avoid the hassle.  But there is an RDMA on ethernet standard called RoCE, and slowly the industry is learning to deploy it without destabilizing normal TCP/IP traffic that the same network would presumably carry.  RoCE hasn't been trivial to use in this way: one reads about storms of "priority pause frames" (PPFs), and there are at least two flow control concepts (DCQCN and TIMELY), both in active use, but neither fully mature.  Microsoft has just started to run RoCE+DCQCN on its big Azure clusters, although the company is limiting use of RDMA to just its own internal infrastructure tools.  Google uses it too.  In contrast, any end-user application on an HPC system depends on RDMA via MPI (Azure HPC permits this too).  So there are ways to access RDMA, but not directly, on most platforms.  Direct use is limited to people who build new infrastructure tools, like we did with our Derecho work.

Against this backdrop, there is a question of who will win in the RoCE market.  Intel seemed likely to become a big player when they purchased QLogic, and Mellanox, the dominate company on the HPC side, is the obvious incumbent, with maybe 80% of market share.

But RDMA has technical limits that worry some companies.  One concern is the Verbs API itself, which is sort of weird, since it requires these helper TCP connections (you won't find this documented in any clear way, but I challenge you to use RDMA Verbs without having an n * n array of TCP connections handy... CMU seems to have done this in their HeRD/FaSST system, but I'm not aware of any other project that pulled it off).  A second is that every RDMA transfer actually occurs as a parallel transfer of some number of "lanes" of data.  Each lane is a pair of single-bit connections, one for data travelling in each direction, and at the high-end of the range, lanes typically run at speeds of 14Gbps.  Thus when you read that the Connect-X4 from Mellanox runs at 100Gbp/s, you can assume that they employ 7 or 8 lanes to do this.  In fact the rates are quoted as one-way rates so you should double them in the case of Derecho, which is aggressive about using connections in a full bidirectional way.  By scaling the number of lanes, the industry is now starting to deploy 200Gbps solutions with 400 within sight and 1Tbps expected by 2025 or so.

Now, at these rates it isn't efficient to send a single bit at a time, so instead, RDMA devices normally send frames of data, generally 84 bytes (this is the frame size for 100GE).  Thus if you have 8 lanes, a single RDMA transfer might actually carry 512 bytes, with 1024 or more likely by the time the technology reaches the higher rates just mentioned.

The issue is that not every application wants to move objects in blocks of such large size.  After all, with direct RDMA writes, a typical operation might access just a few bytes or a word: perhaps 32 or 64 bits.  This will be incredibly inefficient over RDMA.

In Derecho we work around that by designing the system around a style that we think of as "steady flow".  We want a stream of updates to transit any given path, never pausing even briefly, because we want ways to aggregate traffic.  You can read about how we do this in our paper (it was under review by TOCS, but now we are revising it for resubmission, so the version I pointed to is going to change).

Intel has led a group of vendors arguing that this entire direction is ultimately unworkable: we will get to 1Tbps RDMA this way (and Derecho will let you access that speed), but most applications will find that they get a tiny fraction of the full performance because of patterns of transfers that send entire RDMA frames just to read or write a byte or two at a time -- in such cases the majority of the RDMA transfer is just wasted, because the NICs are exchanging perhaps 8 or 16 bytes of useful information and yet every RDMA transfer is of full size: 512 or 1024 bytes.

One answer is for everyone to use Derecho.  But LibFabrics, which is Intel's story, aims at a different concept: they want more flexibility about how the hardware could multiplex large numbers of concurrent remote I/O operations so that these huge frames can carry much more useful traffic.

At the same time, LibFabrics breaks with the RDMA Verbs model and tries to more explicitly mimic the normal Linux sockets API.  One can love or hate sockets, but at least they are a widely adopted standard that all of us know how to work with.  Lots of software runs through TCP or UDP over sockets, hence all of that software can (with some effort) also run on LibFabrics.  Yet RDMA is still available as a LibFabrics mapping: there is a way to run directly on Mellanox and theoretically, it maps to the identical RDMA Verbs we would have used in the first place.

So with this in mind, we bit the bullet and ported Derecho to LibFabrics.  Now our main software distribution actually will have both options, and we'll use some form of configuration manager (we're thinking about Getopt or its sibling, the slightly fancier Getpot).  Then you'll be able to tell us which NIC to select, which mode to run in, etc.

Derecho it looking pretty good on LibFabrics right now.  We're seeing identical performance for RDMA versus LibFabrics (on the identical NIC and network, obviously), provided that the transfer size is large.  For small operations, LibFabrics seems slower.  This is clearly at odds with a direct mapping to RDMA verbs (otherwise we would have seen identical performance for this case too), so we'll need to understand the root cause more deeply.  We've also tested Derecho over LibFabrics mapped to TCP, still on the same network but using the kernel stack, and it seems to be about 4x slower than with RDMA, again in the large objects case.  Derecho over TCP with small objects is very slow, so there is definitely a serious puzzle there.

One very cool option will be to look at Derecho on WAN links over TCP.  RDMA can't run on a WAN network, but we definitely can run LibFabrics this way.

Summary: work in progress, but work revealing very nice progress, even if some mysteries remain to be resolved.  Watch this blog for updates!  I'm sure we'll have lots of them in months to come.

(Added May 30): Here's data from an experiment run by Weijia Song, Alex Katz and Sagar Jha for a 3-member Derecho shard or group, first with 1 sender and all three members receiving and then with all sending and all receiving, using our atomic multicast configuration of ordered_send (compare with Zookeeper's ZAB protocol, without checkpointing).  

What you see here is that (1) LibFabrics didn't introduce overhead when we map down to RDMA.  To see this compare Derecho ROCE and IB against the Derecho Baseline (which used Verbs directly).  So, two left data sets versus the one on the right.  Axis is in gigabytes-per-second and the color coding is for the message size: 100MB, 1MB and 10KB.

Next look at the middle three cases.  These were ones in which we used the same NICs but told LibFabrics not to use RDMA and instead to run on the native TCP for those NICs (TCP works perfectly well on these NICs, but doesn't leverage RDMA).  So of course performance drops, by roughly the factor of 4x mentioned above, but what is impressive is that if you were to go to our Derecho paper and compare this with Zookeeper or LibPaxos, we are still close to 100x faster than those libraries!  That would be a true apples-to-apples comparison, since ZAB and LibPaxos run on TCP or UDP, not RDMA, so when we compared Derecho against them previously, it was a bit unfair to them -- they had no way to leverage RDMA.  

Why is Derecho so much faster than classic Paxos?  Clearly the world already had reached a data rate in which streaming data and doing receiver-side batching is far preferable to the classic 2-phase commit for each action.  Nobody had noticed this, but Derecho happens to live in this new configuration because our development methodology led us to that formulation of the protocols.  And here we can see that it pays off even on a TCP infrastructure!

Sorry for all the explanation points, but after something like 35 years of working on this kind of thing, I'm thrilled to see these numbers and to realize what a great opportunity this represents.





Wednesday, 16 May 2018

Type-checking Schrödinger's Cat

Computer scientists have long needed to deal with an issue that  triggered schisms within the physics community starting in the early 1900’s and continuing until the present.  

We all know Schrödinger's thought experiment: he proposed that one visualize placing a cat in a box with a flask of poison gas, doomed if a uranium atom happens to decay during the five minutes it is in the box, but free to live another (eight? nine?) lives otherwise.   His suggestion was that the entire apparatus, cat included, enters a quantum superposition state.  Opening the box and observing the cat causes the state to collapse: you'll either see a happy living cat, or a feline tragedy.   

There is no need to actually perform the experiment: you can never actually observe a superposition state, and doing the experiment once to see whether the cat survives would just be cruel; it wouldn't answer any questions.  The challenge is to explain how to apply quantum mechanics to the situation such a setup creates.  Does quantum mechanisms operate only at small scale, or does it genuinely have large-scale implications?  In the early 1900's, it was common to assume that the quantum world was somehow disconnected from the world of larger material objects.  The experiment sets up a scenario in which, at least at first glance, a small-scale phenomenon shapes a large-scale outcome.

The physics community debated the meaning of this strange scenario endlessly, but were unable to resolve their views: one group held that indeed, the cat ends up in a superposition state resolved only by opening the box.  A second community insisted that the outcome was "observed" when the uranium either did or did not decay.  Yet another community questioned whether the problem was properly posed.  Ultimately, exhaustion set in, at which point they agreed that it might be best to stop arguing.  

One reason for not arguing this question to a conclusion is that irrespective of the interpretation of the experiment, Schrödinger's equation for predicting the evolution of state in a system of particles works quite well, so the debate centers not on the physics, per se, but rather on the best way to explain the physics.  But while the cat story was intended to elevate the model to a more intuitive level, it only left everyone baffled.  So the decision to not argue about it was really an agreement that whatever the experiment illustrates, it definitely doesn't help explain the physics.

In fact, the many-worlds model resolves this puzzle, for those willing to accept that there are an infinity of worlds, all equally real.  But let's not go there, and instead focus on the statement of the puzzle itself.

As a computer scientist, it strikes me that a key part of the confusion is this: Can statements about quantum mechanics be made about cats?  I'm putting statements in italics because I want to emphasize that I'm asking a syntax question, not a semantic one.  Can the phrase "quantum superposition" be used as a modifier for "cat", as we might use "angry cat" or "happy cat"?  Is it correct to talk about "quantum superposition of a cat"?

One of my favorite courses to teach is Cornell’s CS2110, our second programming class, where we introduce abstractions and object oriented computing.  To me, there is nothing more foundational about computing than the shift in perspective that occurs when we define a new data type, instantiate it, and endow it with operations that have semantic meaning only with respect to the new abstraction.

With CS2110 in mind, let me pose a slightly different version of the cat question.  Suppose that I define a new data type, a “foo”, and a true/false property “shimmer.”  Given a foo object, you can always ask whether or not it shimmers.  But now let's introduce a second unrelated data type, one we'll call "bar".  If bar has no shimmer property, you simply can't ask if a bar object shimmers.  

Back to animals.  If one pets a cat, it may or may not purr.  You can pet a dog, but it will never purr.  What would it even mean for a dog to purr?  Asking "did that dog purr?" is thus a nonsense question.  Without defining what purring means, for dogs, the words simply don't fit together into a meaningful question.

So... armed with this very basic insight about abstract data types, let's revisit our physics puzzle. Subatomic particles can be placed into superposition states But is the term "superposition" even relevant to a cat? 

The cat in the box scenario conflates a “perceptual” abstraction (the cat) with a model related to the wave/particle duality (superposition). Superposition as properly defined is a specific phenomenon seen in a totally different setting: one involving elementary particles, as we define and model them. You can't simply take such a term and apply it willy-nilly.

This now leads to a second and surprisingly subtle question:  What in fact, is a cat?

You might be tempted to say something about atoms, molecules, cells.   This would be like saying something about computer memories and bits and how Java compiles to basic operations, if we were talking about data types.  But just as memory locations change  over time, a cat's atoms and molecules and cells are constantly changing.  Schrödinger's cat started life as a single fertilized cell.  It acquired most of its current mass as it grew to become an adult.  So, defining a cat in terms of its constituent parts seem wrong.

Is there a cat definition that transcends the “moving parts”?

As a computer scientist, an appealing answer is to suggest that catness centers on information.   This view would define the living cat as a biological program that implements all the various processes by which a cat comes into existence, grows, and also sustains itself.  In this perspective, the cat is a distributed program, admittedly one more complex than anything we know how to create today.  

This programming perspective can be used to talk about many kinds of things in our perceived world.   Even inanimate objects like stones are, in some sense, as much about information as they are about their constituent atoms and molecules and crystals. 

In fact an information perspective can even be used to assign a kind of discrete existence to things that have no physical embodiment.  We often talk about properties such as love, anger, happiness.  A cat purrs to express contentment.  Yet there is no physical object that corresponds to this  emotional state that causes the cat to purr: contentment, hence purring,  is a state resulting from a form of information-based biological computation, performed by the cat (I should almost say that the computation is executed on a biological machine that evolves over time under control of that same biological computation). 

If a cat is actually best understood by thinking about information in this sense, and Schrödinger's box is a form of information, what does it really mean to say that we have placed a cat in the box?  Obviously, at some high level of abstraction, you open the box, pick up Fluffy, and gently place her in the box.  You close it, wait a few minutes, then reopen the box and remove Fluffy.  But as you can see, if we really try to pin down the semantics of each of these steps it gets somewhat tricky.  Each one of them involves non-trivial definitions.  Some of these seem extremely difficult to really define in a full, logical way -- perhaps it isn't even possible in some cases.

Think now about the process as it was understood by late 19th century and early 20'th century particle physicists.   They started out on a path towards the basic structure of the universe.  This involved an effort to understand the composition of atoms, which they eventually understood to be atoms composed of protons, neutrons and electrons.  Today, of course, we know that protons, neutrons and electrons are themselves structures composed of more elementary entities called quarks, and that quarks in turn may arise from mathematical structures that modern string-theorists like to call m-branes.  But let's focus just on elementary particles like the ones making up atoms.  At normal energies, this is a reasonable thing to do (you don't see protons and neutrons and electrons behaving like a soup of quarks unless you force them to collide, which requires extremely high energies).  

So we have atoms, and we have elementary particles within them.  We can model these abstractly, and to first approximation, they reside on something foundational and ultimately, are "atomic" in the original Greek sense of the term: indivisible, irreducible, fundamental.  

And, returning to our point, as it turns out we can and should talk about superposition for these elementary particles.  In this statement, the term "superposition" is applicable to our way of abstracting and modelling the behavior of individual particles.  Indeed, we are really forced to define terms like superposition and entanglement to develop a sensible physical description of particles like neutrons and protons and electrons.  Lacking these terms and their associated physical laws, we end up with an incomplete mathematical characterization.

The terminology, viewed this way, is meaningful in the context where it was first proposed, and is part of the mathematical physics that works so incredibly well at predicting the evolution of state for quantum systems.  However, the terminology isn't absolute.  We can't necessarily translate physics statements about this particular model of elementary particles to totally different settings.

In particular, when we take a term such as quantum superposition and apply it to Fluffy, we arrive at a conundrum: Is it, in fact, meaningful to make such a statement about a cat in a box?   Yes, we can write the statement easily enough: "the cat in the box is in a superposition state."  Yet one must question the well-formedness of the resulting statement itself.   

This issue is especially striking when we consider an information-based perspective on catness, because such a definition of what it means to be a cat, or a box, really transcends the physical realities of the constituent components of the cat and the box, and exists in a different semantic context.  Arguably, each level of abstraction in a physical system needs its own set of definitions, meanings, and physical laws.  While these layers of meaning obviously do emerge from more basic ones, including the layer at which the original quantum superposition abstraction was first recognized, it isn't always the case that a property or physical law from one layer carries over to another higher one.  Rather, each higher layer is "emergent" and as such, could be quite distinct from the layers below it.  My foo data type, for example, might be very remote from the Java JVM instruction set or the data values used to represent a particular foo instance.

Of particular note is that Fluffy is really the same cat, even though she grew from a kitten into an adult cat over the past few years.  Fluffy's atoms are mostly different from when she was small, yet she is still the same cat.  So here we have an emergent notion of Fluffy that actually departs from the subatomic components of Fluffy in two senses: first, Fluffy endures even through her constituent atoms may be entirely different.  And second, the behavior of Fluffy's atoms might not actually be evident at the higher level where Fluffy herself really exists.  There is a disconnect between the layers of reality.

This brings me to a recently published book about quantum physics:  What is Real?  The Unfinished Quest for the Meaning of Quantum Physics, by Adam Becker (I recommend it highly).  Becker describes a problem similar to the one I've outlined, and talks about the struggle Neils Bohr went through, trying to explain why measurement of quantum phenomena was (in his eyes) deeply problematic.  For him, the cat in the box puzzle was simply an incorrectly posed problem.  I think we would say today that his objection comes down to a “type checking” error.  

When you try to talk about quantum superposition states in relation to cats, you are combining two abstractions from completely different type systems.  And this simply isn’t well founded, a point that Bohr apparently fought to express to colleagues (who often were totally unable to appreciate his view).  Bohr perhaps didn't even arrive at the observation that the cat is really an information-based abstraction that sustains itself, a biological machine.  His focus was on the cat as a huge assemblage of particles.  But if you take this next step and think of the cat as a form of information you arrive at a further insight: the concept of superposition applicable to elementary particles might (conceivably) apply to assemblages of them.  But how it could it possibly apply to a purely informational abstraction?

Even in the case of a cat viewed as a massive collection of particles, there is endless debate about whether such a collection is so entangled that we can think of it as a massive cat-particle, or is in fact a collection of independent particles, which would potentially have independent superposition states.  Becker discusses this at length... unfortunately he doesn't use the language of computer science, because for me our perspective on this is far clearer.  Yet he does a good job, in a more intuitive way of explaining things.  Certainly his writing is far better than mine... I could never write a book for his audience. 

Getting back to Bohr, I find in his views an attempt to express precisely this issue of type inconsistency.  Schrödinger's experiment simply doesn't type-check.

Monday, 7 May 2018

Is the universe a big asynchronous distributed system?

Now that summer is approaching, I'll have a chance to do some reading.  One topic I've been fascinated by, for a few years now, centers on the way that research on quantum computing is shedding light on the way the universe itself is "programmed". I'm fascinated by something that for me, was a surprise: a connection to distributed computing.  I'm not one to write a whole book on this topic (I once tried, but then gave up).  I'll summarize sort of briefly.

The place to start is with a famous real experiment by John Wheeler, the Princeton quantum mechanics professor who passed away in 2008 without a Nobel prize. Seeking to debunk the "spooky action at a distance" (also called "Copenhagen") model of quantum mechanisms, Wheeler proposed a strange twist on those famous one and two slit interference experiments. His version worked like this:

You'll need a source of charged particles, very focused and able to run at low power (for example, one electron per millisecond). Then a beam splitter for that kind of beam, with a 50% probability of sending your beam "up" and a 50% probability of sending it "down". Mirrors reflect the beam back towards a narrow slit. And then you put a particle detector on the far side of the slit. A digital camera sensor will do the trick. You can find images of this, or of his two-slit version online (sometimes it is called a "delayed choice" experiment, but as we'll see, delay isn't the point).

Last, put a little detector on one arm of the experiment. If enabled, it will sense the passing electrons and report those by blinking an LED. But the detector should have an on-off switch. Initially, turn it off. The cool thing is that because our particle is charged, we can actually sense it without disturbing it, so the detector seemingly observes the particle without changing any property of the particle. (The "work" of turning the LED on or off is done by the detector hardware, not the particle beam. And you can even arrange the detector to be far enough from the path of the particle so that it will already have reached the camera and impacted the LCD screen before the "which way?" detection even occurs.)

Each single electron that passes through our system will actually hit just one pixel on the LCD screen of the camera on the far side of the slit. This is because a single electron delivers a single quantum of energy to a single LCD pixel: the image for just one electron would be just one very dim spot somewhere on the screen.  But if we run the experiment billions of times, we build up an image.

In this sense the image is a histogram revealing the underlying probabilities.  Each single run of the system (each electron) was a kind of probe of the probability density function of the system. Any given single electron picked an outcome at random among the possible outcomes, in accordance with the probabilities of each possible path and outcome. Thus we are visualizing the probability function -- the single electron itself wasn't smeared out at all. To get interference we really had no option except to run the experiment until billions of data points had been gathered. The interference pattern is the cumulative record of billions of individual events.

Well, lets run our experiment. What happens? As you might guess, this form of experiment reveals interference patterns: the electron has a 50-50 probability of taking the up path versus the down path. But this is true only with the LED that tracks the actual electron path turned off. If you turn on the LED detector... the pattern collapses to a bright dot! A very curious finding.

Even stranger: turn the detector on, but encase the LED output in a black bag so that no information about it escapes... the pattern reappears.

In effect, overtly "knowing" which way the electron went eliminates the diffraction pattern. Not sensing it at all, or destroying the data after sensing it, restore the diffraction pattern.

Sounds like magical nonsense? Yes, definitely. Yet this is a real experiment.  For me, it definitely deserved a Nobel prize.

Now, there are a few ways to make sense of such a finding. One is to consider relativistic time perspectives. If you do this, you quickly discover that concepts like "before" and "after" had no real meaning. Brian Greene's first book on modern physics discusses this and makes the point that causality is meaningful, but that non-causal concepts of time are ultimately subjective. Einstein was the first to realize this, and it underlies his theories of relativity. So was that sensor active when the particle passed? Yes, or no, depending on how you measured time. Not a well-posed question.

But people have done versions of this experiment in which the detector, and that decision, are physically quite far away. Using a speed of light argument, this version of the experiment potentially eliminates any relativistic question of interpretation.  Your decision genuinely, provably, occurs after the particles hit the LCD detector.  And seemingly, it lets you retroactively change the past. Cool, huh? And you thought that scenarios like this were the stuff of terrible TV serials.   Now you learn that they are just the stuff of terrible blogs!  But keep in mind: this is real science. 

Back to interpretations.  The second interpretation is the Copenhagen one. In this, the observation made by the which-way sensor collapses the quantum superposition that characterizes the system state, once the beam splitter has done its thing. The sense in which it is a spooky event is that seemingly, you can delay and decide whether or not to activate the which-way sensor until after the particle already hit the LCD pixel. Cool, huh? But utter nonsense if you believe that in reality, information cannot move backwards in time (faster than the speed of light). So people who adopt this view are tacitly accepting that quantum observations actually can move faster than the speed of light.

A third, more modern interpretation is called the many-worlds model. This model views the world as a giant quantum superposition. The superpositions are really non-quantum classical world-lines that follow Newtonian rules. So in a single world-line the behavior of our electron was fully determined by various factors: the specific electron hit the half-silvered mirror just when this molecule of aluminum dioxide was oriented just so, and hence it reflected up, deterministically. There was no probability involved at that step.

But a many-worlds model assumes that because we live at the tip of a causal cone stretching 13.8B years into the past, we only "know" a tiny amount about the universe in a causal sense of actual past observations. So in this model, each interaction between two elementary particles is an observation, and lives on into the causal future of both particles: an entanglement, but without any quantum fuzz. Today, we sit at the tip of this cone of past observations, but now think of all the unobserved state in the universe. Our world-line observed something (the state of the mirror) that was genuinely unpredictable. After the observation it was fully known.  Before observation, 50-50.

The many-worlds model holds that if both outcomes were equally probable, then this means that in some world-line the electron bounced up, and in others, it bounced down. The world lines were indistinguishable until that happened, but then each "learned" a different past.

So in this model, the arrow of time is concerned with increasing knowledge of the past. Any single world-line accumulates knowledge in a steady way.

Critically, no world-line ever collapses, vanishes, reverses course and changes its path. We just learn more and more as events progress.  The new knowledge rules out certain possibilities. Prior to hitting the mirror, it was genuinely possible that the electron might bounce up, and might bounce down. But after hitting the mirror, any world-line in which we "know" the state of the electron is one in which we have eliminated all uncertainty about that mirror interaction. By turning on the electron path detector, we eliminated all the uncertainty, and this is why the diffraction pattern vanished: with no uncertainty, our electron beam gives a sharp, tightly focused spot on the detector. 

How did this all explain our Wheeler experiment? Well, with the sensor turned off, an analysis of the probabilities of the various outcomes does give rise to an intersection not of the electron with itself, but rather of the two "branches" of the probabilistic state of the system, leading to the interference effect that we see as a pattern on our eventual LDC screen, built up one pixel at a time.  You can compute this pattern using Schrodinger's equation. With the sensor turned on, the probabilistic analysis is changed: if information from the sensor can reach the LDC detector, we eliminate uncertainty in a way that leaves us with a clean focused spot, or a clean set of lines (depending on how the slit is set up).

If you are big believer in quantum computing, you'll be happiest with this third model, although many people in that field tell me that they prefer not to think of it in terms of real-world interpretations ("shut up and compute" is the common refrain). No need to worry about the meaning of all those world-lines.  Are those other world-lines "real?" Well, in what sense is our reality objectively more real, or less real? Any objective attempt to define reality either ends up with an unmeasurable property, or concludes that any world-line that has some non-zero probability of arising is as real as any other.

In a similar vein, it is wise to avoid speculation about free will, and about whether or not such a model leaves room for religion.

I myself am a believer in the many-worlds interpretation. It eliminates spooky faster-than-light action at a distance and other oddities. We are left with something elegant and simple: causality, nothing more. Lacking knowledge of the current state, some things seem possible, with various probabilities. Then we learn things, and that rules out other things. Viola.

Now, how does this lead us towards a perspective relevant to computing, and distributed systems at that?

Well, we have these deterministic world-lines that are all about interactions of elementary particles (or maybe m-branes in the modern 11-dimensional string theories -- whatever your favorite most elementary model may be). Let me ask a small question: do you believe that all of this is governed by mathematics? Sure, mathematics we haven't fully learned yet, and without question, the mathematics may be very foreign to our normal mathematical systems. But do you believe that ultimately, the physical world is governed by physical law?

I myself do believe this: I have yet to hear of a physical situation that wasn't ultimately subject to a scientific explanation. So we are then looking at model of the universe in which strict mathematical laws govern the evolution of these world-lines from their initial state, back at the big bang when time began, up to the present, 13.8B years later.

There were a set of possible initial conditions, and each initial state gives rise to a world-line. But there are a seemingly infinite number of possible initial conditions consistent with the 13.8B years of observations prior to me typing these characters into this blog-page. I am uncertain as to those unobserved states, and hence unable to predict the actual future: in some sense, I am all of the me's in all of these identical (up to now) world-lines. Then time progresses, events occur (interactions between elementary particles), and more knowledge is learned. My experience is that I reside in the world lines consistent with my personal past. But other versions of me experience the other possible outcomes. And so it goes, branching infinitely.

How does the universe compute these next states? Here, we approach the distributed systems question. The first answer is that it does so by applying the (only partially known to us) physical laws to the states of the world-lines, in a Newtonian style, computing each next state from the current state. As it turns out, this deterministic world-line model is actually fully reversible, so you can run time backwards and forwards (obviously, if you run a world-line backwards the system forgets its future, and regains uncertainty, but you can do this if you wish -- quantum computers depend on this property and wouldn't work, at all, without it).

So what does it mean to have a single world-line and to apply the "next" events to it? Seemingly, the proper model requires that we combine two models: one to model space, and one to model particles, both quantized. So we think of space as a mesh of 3-D (or perhaps 10-D) locations, plus time if you want to reintroduce clocks -- I'll leave them out for a moment, but Brian Greene prefers to include them, adopting the view that events sweep forward at the speed of light, such that any particle is moving at speed c, as a vector in space+time. In Brian's explanation, a photon moves at speed c and experiences no movement in time at all: it experiences its entire existence as a single event in time, spread over space. An electron in motion moves through space at some speed, and then experiences time at whatever rate will give us speed c for its space-time vector. Cool idea (Einstein's, actually).

So we have this mesh of spatial locations. What does space "know?" Well, it knows of the local gravitational gradient (the proper term is the "geodesic"), it knows of nearby particles and their contribution to electromagnetic fields, and in fact it knows of other forces too: the weak and strong force, the Higgs field, etc. So space is a record of fields. And then the particles seemingly know where they are (which space-time location they are in), and where they are heading (their vector of movement), plus other properties like mass, spin, etc.

The distributed computation is one that takes each element of space-time from its state "now" to its state one interaction into the future. This involves interactions with the particles in the vicinity, and also interactions with the neighboring space-time locations. The particles, similarly, interact with space-time (for example, our electron might be pushed by a local magnetic field, causing its trajectory to change, and it would learn the prevailing EM field by interrogating the space-time location), and also with one-another (when particles collide). Perhaps the collisions can be fully expressed as interactions mediated through space-time -- that would be nice.

Very much to my taste, this particular model has a dynamic form of group membership! First, the presence of mass causes space-time itself to stretch: new space-time elements form. This would generate the geodesic mentioned earlier. For example, near a heavy mass, like as we approach the event horizon of a black hold, space-time curves: more spatial volume forms. Indeed, the infall of matter towards the singularity at the core of the black hole seemingly would cause an unbounded expansion of space-time, but entirely within the event horizon, where we can't actually see this happening. And then there is a second aspect, which is that as the most basic particles or m-branes combine or break apart, the population of particles seemingly changes too: perhaps, a neutron bangs into a proton, and a spray of other particles is emitted. As I said, the universal computation seems to track dynamic group membership!

Back to the real world (but here's a little Matrix-like puzzle, referring to the movie: if the universe is a superposition of world-lines and the world-lines are deterministic and governed by physics, and the physics ultimately has all the details filled in -- even if we don't know how the thing works, the universe itself obvious does -- does anyone need to run the computation? After all: there is a Taylor-series expansion of pi, and with it, I could compute the 2^2^2^2^2^...'th digit of pi in any radix you like. Perhaps I tell you that such and such a digit is 7, base 10. But that digit has never actually been computed by anyone except me. Would you trust me? No, but at the same time, you do have a way to validate the claim. The digit definitely has a defined value: my claim is either true, or false: it can't wiggle around and change value. So in a similar sense, the mathematics, given the initial conditions, predicts this moment, 13.8B years down the road. Do we care whether or not the computation has actually occurred? Is it meaningful to claim that we have some existence outside of this definition? And if so, how would you make that claim rigorous)?

At MIT, Scott Aaronson (who has since moved to Austin) and his colleague Seth Lloyd speculated about this -- two stars in the slowly growing field of quantum computing. Seth wrote book in which he argues that even the universe wouldn't be able to solve unsolvable problems -- we computer scientists do have Russell's Paradox to wrestle with, not to mention basic questions of complexity.  If a problem takes unbounded time and resources to compute, it may not be solvable even if there is an algorithm for performing the computation. In this sense all the digits of pi are computable, but some digits cannot be "reached" in the 13.8B years the universe has had to compute them: they are much further out there.

Seth's point is interesting, because it raises real questions about what the distributed computation that comprises the universe might be up to. Taking this point about pi a bit further: pi itself has no finite representation. Yet pi certainly arises in physics, along with many other transcendental constants. How can the universe actually carry out such computations in bounded time?

A few ideas: it could be computing symbolically. The actual mathematics of the universe might be expressed in a different mathematics than we normally use, in which those constants vanish into some other representation (think of the way a transformation to polar coordinates eliminates the need to talk about pi... although such a transformation also introduces a problem, namely that the (x,y,z) coordinate (0,0,0) has no single representation in a polar coordinate system: it has vector length 0, but the associated angle is undefined). Anyhow, perhaps there is a mathematics in which these constants don't arise in their unbounded-length form. Or perhaps the universe only needs to carry out the computation until it has enough digits to unambiguously determine the next state.

Traditional models of singularities pose issues too, once you start to think in terms of tractable mathematics (Seth's point is that if a problem can't be solved computationally, the universe must not be solving it -- either we have the physics wrong, or it has a clever work-around!) A singularity is infinitely small and infinitely dense: clearly not a state amenable to a state-by-state computation. Again, various ideas have been floated. Perhaps the singularity is just not reachable from within the universe: the particle falling into it needs an unbounded amount of time to get there, as perceived from outside (in fact from the outside, the particle remains smeared over the event horizon and the associated information mixes with that of other in-falling particles, and this is all we can know). Still, for the universe to be doing this calculation, we would need to figure out what it "does" for space-time locations closer and closer to the event horizon, and very likely that model needs to account for the behavior inside the black hole and at the singularity, too. Otherwise, the mathematics would be incomplete. Seth argues against such views: for him, the thing that makes a universe possible is that it is self-consistent and has a complete specification of initial state and the physical laws by which that state evolves.

Here at Cornell, my colleague Paul Ginsparg always has the answers to such questions, up to the event horizon. Beyond that, though... as I said, perhaps we shouldn't view the questions or answers as part of this universe. Yet Paul points out that an object passing the event horizon wouldn't notice anything unusual, including the increasingly extreme curvature of space-time. Whatever the mathematics are that govern the universe, they should still work in there.

And here's yet one more puzzle.  If computing the next state involves some form of computation over the entire universe, the associated physical model has to be wrong: such a computation couldn't be feasible.  Yet this seems to imply that much of contemporary physics is incorrect, because these unbounded integrals do arise (for example, contemporary models of quantum uncertainty assign a non-zero probability to finding a particular particle anywhere in the entire universe).

Instead, some form of boundedness must enter the picture. At a minimum, the computation shouldn't need to reach out to more of the universe than we have had time to interact with: an electron on my computer screen has interacted (at most) with particles within a causal cone stretching just 13.B years into the past, and in fact with just a tiny subset of those. This tells us that computing the next state is a finite problem, even if it is a bit large by computational standards. But it also tells us that as time elapses, it gets harder and harder for the universe to compute its own next state. Does it make any sense at all to imagine that physics works this way, or does the universe have some way to bound the computational task so that it would be of manageable scale?

Which leads to another insight.  Armed with our many-worlds hypothesis, suppose that all of us track down the URL of the "ANU quantum random number project."  As you might surmise, this project  uses quantum noise from outer space to generate what seem to be totally random numbers.  Now use those numbers as the basis for lottery ticket purchases in the next PowerBall drawing.  You'll win, in some world-line.  In fact, do it 50 times in succession.  If the game isn't rigged and you don't get arrested, kidnapped or killed, you'll win 50 times -- in some world-line.  But not in any world-line you are likely to actually be living in.

In fact there was an old science fiction story along these same lines: some very elderly fellow had lived an increasingly implausible life, surviving one catastrophe after another.  After all: if survival isn't impossible, but just very unlikely, then by definition there must be some possibility of surviving.  And in that world-line, you do indeed survive, and so forth.

Do low-probability events "really" ever occur?  Or does the universe somehow save itself the trouble and never compute them?  Could it be there is even some hidden natural law that cuts off absurdly low probability sequences?  Brian Greene discusses this too, in a chapter that asks whether the initial conditions of the universe would have allowed a kind of Petit Prince scenario: could the initial universe have started with a single planet in it, and a single person standing on that planet, fully aware and healthy, air to breath, but nothing else in the universe at all?  Extreme interpretations of the multi-verse theory seem to say that yes, such an initial condition would be possible.  But Greene argues that there is a sense in which all of these examples violate the second law of Thermodynamics: the law that says that entropy always increases (obviously, we can expend energy and create local order, but in doing so, we create even more entropy elsewhere).  So perhaps this law about entropy has a generalization that prunes low probability world-lines.

One could make the case that pruning low probability world-lines might be necessary because the total computational complexity of computing the universe would be too high if the universe were to compute every possible pathway, including all of these extremely obscure scenarios.  Pruning really obscure corner cases could be the key to computational tractability, in some very real sense (after all: many problems are infeasible in their general form, yet we manage to solve them computationally because we really only run into an easier subset that wouldn't normally include the very hard instances).  Is there a world-line out there where I actually won lotteries 50 times in a row, using those ANU quantum numbers?   Such a sequence would violate this generalized entropy principle.  So, maybe not, after all.  Sigh.  But on the positive side, that wouldn't have been a very fair way to win the lottery.  It would be nice if a natural law actually prevented it!

A fun topic. I'm always hungry for new books about it -- the people I've mentioned mostly have written such books (it seems to be a rite of passage for physicists). With summer arriving soon, let me know if  you have any to recommend!