Monday 28 May 2018

Derecho on LibFabrics versus Derecho on RDMA: Hot off the wire...

A while back, I mentioned that we were trying to port Derecho to also run over LibFabrics, so that people could use the system on a TCP-only network, or one with the Intel OMNI-Path/IODirect technology, which are based on the LibFabrics API rather than using RDMA Verbs.

So this is starting to work now and I thought I might share a quick summary of the experience!  A big thanks to Weijia Song and Sagar Jha, who worked with Cornell ECE undergraduate Alex Katz to make this happen.  It was, as it turned out, remarkably hard.

For those lacking context, let me start with a summary of why this topic is of interest, but then that group of people may want to tune out because I'll go back to being highly technical.  As you may know, RDMA is a technology for permitting one computer in a data-center or cluster (close-by, densely packed machines) to do direct memory-to-memory data transfers.  RDMA comes in various modes: you can do messaging over RDMA (one machine sends, the other receives), and in this case it will behave much like TCP, waiting to send until the receiver is ready, then transmitting losslessly.  The big win is that the transfer will run at DMA speeds, which are extremely high: 100Gbps or more.  In contrast, if you code a little TCP-based transfer on the identical computers, same network card, you tend to see performance 1/5th what RDMA can achieve or less, because of all the copying (from user space to kernel, then the kernel to kernel transfer over the network in IP packets, then copying up).  Copying takes time, and this ends up being 4/5ths of the total path.  RDMA eliminates all of that, at least for large transfer.

For small transfers, people work with "one-sided" RDMA in which the sender and receiver prenegotiate a connection but also share a memory region: B grants A permission to read or write within some chunk of B's local memory.  Then A basically ends up thinking of B's memory as mapped into A's own computer, although the data transfer isn't as transparent as just reaching into a mapped memory -- it has to occur via special read and write commands.  Still, the model would be one of shared memory, reliably updated or read remotely.

RDMA originated on InfiniBand (IB) networks, popular in today's supercomputer systems.  Most supercomputer communication runs over a library called the Message Passing Interface, MPI, via an implementation like MVAPICH, and versions like that are highly tuned to leverage RDMA on IB.  Our work on the Cornell Derecho system uses this same technology.

Now, continuing the back story, RDMA in this form dates to work done long ago at Cornell by Thorsten Von Eicken (went on to lead GoToMyPC.com) and Werner Vogels (now CTO at Amazon).  RDMA took off in industry, and some think it killed off the older style of parallel supercomputing, replacing it with less costly clusters of commodity machines connected with RDMA networks on IB.  Anyhow, right from the start the leaders pushed for a TCP-like model.  In fact you actually start by creating a TCP connection, a normal one, then put an RDMA connection in place "next to it", using TCP for passing control messages until the RDMA layer is up and running.  RDMA itself involves a firmware program that runs in the NIC (so these NICs are like little co-processors, and in fact they often have a full Linux O/S on them, and are coded in C or C++).  The interaction between end-host and NIC is through shared "queue" data structures, which are always grouped: there is a queue used for send/receive/read/write operations (called "verbs" for reasons lost in the mists of time), and then a separate completion queue where the NIC reports on operations that have finished.

Verbs, the RDMA API, define a huge number of specialization options.  Data can be "inline" (inside the operation) or external (via a pointer, and a length), or one can specify a vector of locations (scatter-gather).  Operations can be selected for completion reporting.  There is a way to multiplex and demultiplex more than one logical connection over a single RDMA connection, using "keys".  There are "tags" that can carry an additional few bytes of information such as object length data (useful in cases where a big transfer is broken down into little sub-transfers).

Then there are ways to use RDMA verbs for unreliable datagram-like communication.  Unlike the normal case, where the sender will be certain that messages got through, in order, reliably, with unreliable datagrams some form of end-to-end acknowledgement and retransmission would be needed in the application.  And this mode also includes an unreliable multicast feature.

RDMA is a troublesome technology in normal optical ethernet clusters and data centers, and many customers just run it on a second side-by-side IB network to avoid the hassle.  But there is an RDMA on ethernet standard called RoCE, and slowly the industry is learning to deploy it without destabilizing normal TCP/IP traffic that the same network would presumably carry.  RoCE hasn't been trivial to use in this way: one reads about storms of "priority pause frames" (PPFs), and there are at least two flow control concepts (DCQCN and TIMELY), both in active use, but neither fully mature.  Microsoft has just started to run RoCE+DCQCN on its big Azure clusters, although the company is limiting use of RDMA to just its own internal infrastructure tools.  Google uses it too.  In contrast, any end-user application on an HPC system depends on RDMA via MPI (Azure HPC permits this too).  So there are ways to access RDMA, but not directly, on most platforms.  Direct use is limited to people who build new infrastructure tools, like we did with our Derecho work.

Against this backdrop, there is a question of who will win in the RoCE market.  Intel seemed likely to become a big player when they purchased QLogic, and Mellanox, the dominate company on the HPC side, is the obvious incumbent, with maybe 80% of market share.

But RDMA has technical limits that worry some companies.  One concern is the Verbs API itself, which is sort of weird, since it requires these helper TCP connections (you won't find this documented in any clear way, but I challenge you to use RDMA Verbs without having an n * n array of TCP connections handy... CMU seems to have done this in their HeRD/FaSST system, but I'm not aware of any other project that pulled it off).  A second is that every RDMA transfer actually occurs as a parallel transfer of some number of "lanes" of data.  Each lane is a pair of single-bit connections, one for data travelling in each direction, and at the high-end of the range, lanes typically run at speeds of 14Gbps.  Thus when you read that the Connect-X4 from Mellanox runs at 100Gbp/s, you can assume that they employ 7 or 8 lanes to do this.  In fact the rates are quoted as one-way rates so you should double them in the case of Derecho, which is aggressive about using connections in a full bidirectional way.  By scaling the number of lanes, the industry is now starting to deploy 200Gbps solutions with 400 within sight and 1Tbps expected by 2025 or so.

Now, at these rates it isn't efficient to send a single bit at a time, so instead, RDMA devices normally send frames of data, generally 84 bytes (this is the frame size for 100GE).  Thus if you have 8 lanes, a single RDMA transfer might actually carry 512 bytes, with 1024 or more likely by the time the technology reaches the higher rates just mentioned.

The issue is that not every application wants to move objects in blocks of such large size.  After all, with direct RDMA writes, a typical operation might access just a few bytes or a word: perhaps 32 or 64 bits.  This will be incredibly inefficient over RDMA.

In Derecho we work around that by designing the system around a style that we think of as "steady flow".  We want a stream of updates to transit any given path, never pausing even briefly, because we want ways to aggregate traffic.  You can read about how we do this in our paper (it was under review by TOCS, but now we are revising it for resubmission, so the version I pointed to is going to change).

Intel has led a group of vendors arguing that this entire direction is ultimately unworkable: we will get to 1Tbps RDMA this way (and Derecho will let you access that speed), but most applications will find that they get a tiny fraction of the full performance because of patterns of transfers that send entire RDMA frames just to read or write a byte or two at a time -- in such cases the majority of the RDMA transfer is just wasted, because the NICs are exchanging perhaps 8 or 16 bytes of useful information and yet every RDMA transfer is of full size: 512 or 1024 bytes.

One answer is for everyone to use Derecho.  But LibFabrics, which is Intel's story, aims at a different concept: they want more flexibility about how the hardware could multiplex large numbers of concurrent remote I/O operations so that these huge frames can carry much more useful traffic.

At the same time, LibFabrics breaks with the RDMA Verbs model and tries to more explicitly mimic the normal Linux sockets API.  One can love or hate sockets, but at least they are a widely adopted standard that all of us know how to work with.  Lots of software runs through TCP or UDP over sockets, hence all of that software can (with some effort) also run on LibFabrics.  Yet RDMA is still available as a LibFabrics mapping: there is a way to run directly on Mellanox and theoretically, it maps to the identical RDMA Verbs we would have used in the first place.

So with this in mind, we bit the bullet and ported Derecho to LibFabrics.  Now our main software distribution actually will have both options, and we'll use some form of configuration manager (we're thinking about Getopt or its sibling, the slightly fancier Getpot).  Then you'll be able to tell us which NIC to select, which mode to run in, etc.

Derecho it looking pretty good on LibFabrics right now.  We're seeing identical performance for RDMA versus LibFabrics (on the identical NIC and network, obviously), provided that the transfer size is large.  For small operations, LibFabrics seems slower.  This is clearly at odds with a direct mapping to RDMA verbs (otherwise we would have seen identical performance for this case too), so we'll need to understand the root cause more deeply.  We've also tested Derecho over LibFabrics mapped to TCP, still on the same network but using the kernel stack, and it seems to be about 4x slower than with RDMA, again in the large objects case.  Derecho over TCP with small objects is very slow, so there is definitely a serious puzzle there.

One very cool option will be to look at Derecho on WAN links over TCP.  RDMA can't run on a WAN network, but we definitely can run LibFabrics this way.

Summary: work in progress, but work revealing very nice progress, even if some mysteries remain to be resolved.  Watch this blog for updates!  I'm sure we'll have lots of them in months to come.

(Added May 30): Here's data from an experiment run by Weijia Song, Alex Katz and Sagar Jha for a 3-member Derecho shard or group, first with 1 sender and all three members receiving and then with all sending and all receiving, using our atomic multicast configuration of ordered_send (compare with Zookeeper's ZAB protocol, without checkpointing).  

What you see here is that (1) LibFabrics didn't introduce overhead when we map down to RDMA.  To see this compare Derecho ROCE and IB against the Derecho Baseline (which used Verbs directly).  So, two left data sets versus the one on the right.  Axis is in gigabytes-per-second and the color coding is for the message size: 100MB, 1MB and 10KB.

Next look at the middle three cases.  These were ones in which we used the same NICs but told LibFabrics not to use RDMA and instead to run on the native TCP for those NICs (TCP works perfectly well on these NICs, but doesn't leverage RDMA).  So of course performance drops, by roughly the factor of 4x mentioned above, but what is impressive is that if you were to go to our Derecho paper and compare this with Zookeeper or LibPaxos, we are still close to 100x faster than those libraries!  That would be a true apples-to-apples comparison, since ZAB and LibPaxos run on TCP or UDP, not RDMA, so when we compared Derecho against them previously, it was a bit unfair to them -- they had no way to leverage RDMA.  

Why is Derecho so much faster than classic Paxos?  Clearly the world already had reached a data rate in which streaming data and doing receiver-side batching is far preferable to the classic 2-phase commit for each action.  Nobody had noticed this, but Derecho happens to live in this new configuration because our development methodology led us to that formulation of the protocols.  And here we can see that it pays off even on a TCP infrastructure!

Sorry for all the explanation points, but after something like 35 years of working on this kind of thing, I'm thrilled to see these numbers and to realize what a great opportunity this represents.





Wednesday 16 May 2018

Type-checking Schrödinger's Cat

Computer scientists have long needed to deal with an issue that  triggered schisms within the physics community starting in the early 1900’s and continuing until the present.  

We all know Schrödinger's thought experiment: he proposed that one visualize placing a cat in a box with a flask of poison gas, doomed if a uranium atom happens to decay during the five minutes it is in the box, but free to live another (eight? nine?) lives otherwise.   His suggestion was that the entire apparatus, cat included, enters a quantum superposition state.  Opening the box and observing the cat causes the state to collapse: you'll either see a happy living cat, or a feline tragedy.   

There is no need to actually perform the experiment: you can never actually observe a superposition state, and doing the experiment once to see whether the cat survives would just be cruel; it wouldn't answer any questions.  The challenge is to explain how to apply quantum mechanics to the situation such a setup creates.  Does quantum mechanisms operate only at small scale, or does it genuinely have large-scale implications?  In the early 1900's, it was common to assume that the quantum world was somehow disconnected from the world of larger material objects.  The experiment sets up a scenario in which, at least at first glance, a small-scale phenomenon shapes a large-scale outcome.

The physics community debated the meaning of this strange scenario endlessly, but were unable to resolve their views: one group held that indeed, the cat ends up in a superposition state resolved only by opening the box.  A second community insisted that the outcome was "observed" when the uranium either did or did not decay.  Yet another community questioned whether the problem was properly posed.  Ultimately, exhaustion set in, at which point they agreed that it might be best to stop arguing.  

One reason for not arguing this question to a conclusion is that irrespective of the interpretation of the experiment, Schrödinger's equation for predicting the evolution of state in a system of particles works quite well, so the debate centers not on the physics, per se, but rather on the best way to explain the physics.  But while the cat story was intended to elevate the model to a more intuitive level, it only left everyone baffled.  So the decision to not argue about it was really an agreement that whatever the experiment illustrates, it definitely doesn't help explain the physics.

In fact, the many-worlds model resolves this puzzle, for those willing to accept that there are an infinity of worlds, all equally real.  But let's not go there, and instead focus on the statement of the puzzle itself.

As a computer scientist, it strikes me that a key part of the confusion is this: Can statements about quantum mechanics be made about cats?  I'm putting statements in italics because I want to emphasize that I'm asking a syntax question, not a semantic one.  Can the phrase "quantum superposition" be used as a modifier for "cat", as we might use "angry cat" or "happy cat"?  Is it correct to talk about "quantum superposition of a cat"?

One of my favorite courses to teach is Cornell’s CS2110, our second programming class, where we introduce abstractions and object oriented computing.  To me, there is nothing more foundational about computing than the shift in perspective that occurs when we define a new data type, instantiate it, and endow it with operations that have semantic meaning only with respect to the new abstraction.

With CS2110 in mind, let me pose a slightly different version of the cat question.  Suppose that I define a new data type, a “foo”, and a true/false property “shimmer.”  Given a foo object, you can always ask whether or not it shimmers.  But now let's introduce a second unrelated data type, one we'll call "bar".  If bar has no shimmer property, you simply can't ask if a bar object shimmers.  

Back to animals.  If one pets a cat, it may or may not purr.  You can pet a dog, but it will never purr.  What would it even mean for a dog to purr?  Asking "did that dog purr?" is thus a nonsense question.  Without defining what purring means, for dogs, the words simply don't fit together into a meaningful question.

So... armed with this very basic insight about abstract data types, let's revisit our physics puzzle. Subatomic particles can be placed into superposition states But is the term "superposition" even relevant to a cat? 

The cat in the box scenario conflates a “perceptual” abstraction (the cat) with a model related to the wave/particle duality (superposition). Superposition as properly defined is a specific phenomenon seen in a totally different setting: one involving elementary particles, as we define and model them. You can't simply take such a term and apply it willy-nilly.

This now leads to a second and surprisingly subtle question:  What in fact, is a cat?

You might be tempted to say something about atoms, molecules, cells.   This would be like saying something about computer memories and bits and how Java compiles to basic operations, if we were talking about data types.  But just as memory locations change  over time, a cat's atoms and molecules and cells are constantly changing.  Schrödinger's cat started life as a single fertilized cell.  It acquired most of its current mass as it grew to become an adult.  So, defining a cat in terms of its constituent parts seem wrong.

Is there a cat definition that transcends the “moving parts”?

As a computer scientist, an appealing answer is to suggest that catness centers on information.   This view would define the living cat as a biological program that implements all the various processes by which a cat comes into existence, grows, and also sustains itself.  In this perspective, the cat is a distributed program, admittedly one more complex than anything we know how to create today.  

This programming perspective can be used to talk about many kinds of things in our perceived world.   Even inanimate objects like stones are, in some sense, as much about information as they are about their constituent atoms and molecules and crystals. 

In fact an information perspective can even be used to assign a kind of discrete existence to things that have no physical embodiment.  We often talk about properties such as love, anger, happiness.  A cat purrs to express contentment.  Yet there is no physical object that corresponds to this  emotional state that causes the cat to purr: contentment, hence purring,  is a state resulting from a form of information-based biological computation, performed by the cat (I should almost say that the computation is executed on a biological machine that evolves over time under control of that same biological computation). 

If a cat is actually best understood by thinking about information in this sense, and Schrödinger's box is a form of information, what does it really mean to say that we have placed a cat in the box?  Obviously, at some high level of abstraction, you open the box, pick up Fluffy, and gently place her in the box.  You close it, wait a few minutes, then reopen the box and remove Fluffy.  But as you can see, if we really try to pin down the semantics of each of these steps it gets somewhat tricky.  Each one of them involves non-trivial definitions.  Some of these seem extremely difficult to really define in a full, logical way -- perhaps it isn't even possible in some cases.

Think now about the process as it was understood by late 19th century and early 20'th century particle physicists.   They started out on a path towards the basic structure of the universe.  This involved an effort to understand the composition of atoms, which they eventually understood to be atoms composed of protons, neutrons and electrons.  Today, of course, we know that protons, neutrons and electrons are themselves structures composed of more elementary entities called quarks, and that quarks in turn may arise from mathematical structures that modern string-theorists like to call m-branes.  But let's focus just on elementary particles like the ones making up atoms.  At normal energies, this is a reasonable thing to do (you don't see protons and neutrons and electrons behaving like a soup of quarks unless you force them to collide, which requires extremely high energies).  

So we have atoms, and we have elementary particles within them.  We can model these abstractly, and to first approximation, they reside on something foundational and ultimately, are "atomic" in the original Greek sense of the term: indivisible, irreducible, fundamental.  

And, returning to our point, as it turns out we can and should talk about superposition for these elementary particles.  In this statement, the term "superposition" is applicable to our way of abstracting and modelling the behavior of individual particles.  Indeed, we are really forced to define terms like superposition and entanglement to develop a sensible physical description of particles like neutrons and protons and electrons.  Lacking these terms and their associated physical laws, we end up with an incomplete mathematical characterization.

The terminology, viewed this way, is meaningful in the context where it was first proposed, and is part of the mathematical physics that works so incredibly well at predicting the evolution of state for quantum systems.  However, the terminology isn't absolute.  We can't necessarily translate physics statements about this particular model of elementary particles to totally different settings.

In particular, when we take a term such as quantum superposition and apply it to Fluffy, we arrive at a conundrum: Is it, in fact, meaningful to make such a statement about a cat in a box?   Yes, we can write the statement easily enough: "the cat in the box is in a superposition state."  Yet one must question the well-formedness of the resulting statement itself.   

This issue is especially striking when we consider an information-based perspective on catness, because such a definition of what it means to be a cat, or a box, really transcends the physical realities of the constituent components of the cat and the box, and exists in a different semantic context.  Arguably, each level of abstraction in a physical system needs its own set of definitions, meanings, and physical laws.  While these layers of meaning obviously do emerge from more basic ones, including the layer at which the original quantum superposition abstraction was first recognized, it isn't always the case that a property or physical law from one layer carries over to another higher one.  Rather, each higher layer is "emergent" and as such, could be quite distinct from the layers below it.  My foo data type, for example, might be very remote from the Java JVM instruction set or the data values used to represent a particular foo instance.

Of particular note is that Fluffy is really the same cat, even though she grew from a kitten into an adult cat over the past few years.  Fluffy's atoms are mostly different from when she was small, yet she is still the same cat.  So here we have an emergent notion of Fluffy that actually departs from the subatomic components of Fluffy in two senses: first, Fluffy endures even through her constituent atoms may be entirely different.  And second, the behavior of Fluffy's atoms might not actually be evident at the higher level where Fluffy herself really exists.  There is a disconnect between the layers of reality.

This brings me to a recently published book about quantum physics:  What is Real?  The Unfinished Quest for the Meaning of Quantum Physics, by Adam Becker (I recommend it highly).  Becker describes a problem similar to the one I've outlined, and talks about the struggle Neils Bohr went through, trying to explain why measurement of quantum phenomena was (in his eyes) deeply problematic.  For him, the cat in the box puzzle was simply an incorrectly posed problem.  I think we would say today that his objection comes down to a “type checking” error.  

When you try to talk about quantum superposition states in relation to cats, you are combining two abstractions from completely different type systems.  And this simply isn’t well founded, a point that Bohr apparently fought to express to colleagues (who often were totally unable to appreciate his view).  Bohr perhaps didn't even arrive at the observation that the cat is really an information-based abstraction that sustains itself, a biological machine.  His focus was on the cat as a huge assemblage of particles.  But if you take this next step and think of the cat as a form of information you arrive at a further insight: the concept of superposition applicable to elementary particles might (conceivably) apply to assemblages of them.  But how it could it possibly apply to a purely informational abstraction?

Even in the case of a cat viewed as a massive collection of particles, there is endless debate about whether such a collection is so entangled that we can think of it as a massive cat-particle, or is in fact a collection of independent particles, which would potentially have independent superposition states.  Becker discusses this at length... unfortunately he doesn't use the language of computer science, because for me our perspective on this is far clearer.  Yet he does a good job, in a more intuitive way of explaining things.  Certainly his writing is far better than mine... I could never write a book for his audience. 

Getting back to Bohr, I find in his views an attempt to express precisely this issue of type inconsistency.  Schrödinger's experiment simply doesn't type-check.

Monday 7 May 2018

Is the universe a big asynchronous distributed system?

Now that summer is approaching, I'll have a chance to do some reading.  One topic I've been fascinated by, for a few years now, centers on the way that research on quantum computing is shedding light on the way the universe itself is "programmed". I'm fascinated by something that for me, was a surprise: a connection to distributed computing.  I'm not one to write a whole book on this topic (I once tried, but then gave up).  I'll summarize sort of briefly.

The place to start is with a famous real experiment by John Wheeler, the Princeton quantum mechanics professor who passed away in 2008 without a Nobel prize. Seeking to debunk the "spooky action at a distance" (also called "Copenhagen") model of quantum mechanisms, Wheeler proposed a strange twist on those famous one and two slit interference experiments. His version worked like this:

You'll need a source of charged particles, very focused and able to run at low power (for example, one electron per millisecond). Then a beam splitter for that kind of beam, with a 50% probability of sending your beam "up" and a 50% probability of sending it "down". Mirrors reflect the beam back towards a narrow slit. And then you put a particle detector on the far side of the slit. A digital camera sensor will do the trick. You can find images of this, or of his two-slit version online (sometimes it is called a "delayed choice" experiment, but as we'll see, delay isn't the point).

Last, put a little detector on one arm of the experiment. If enabled, it will sense the passing electrons and report those by blinking an LED. But the detector should have an on-off switch. Initially, turn it off. The cool thing is that because our particle is charged, we can actually sense it without disturbing it, so the detector seemingly observes the particle without changing any property of the particle. (The "work" of turning the LED on or off is done by the detector hardware, not the particle beam. And you can even arrange the detector to be far enough from the path of the particle so that it will already have reached the camera and impacted the LCD screen before the "which way?" detection even occurs.)

Each single electron that passes through our system will actually hit just one pixel on the LCD screen of the camera on the far side of the slit. This is because a single electron delivers a single quantum of energy to a single LCD pixel: the image for just one electron would be just one very dim spot somewhere on the screen.  But if we run the experiment billions of times, we build up an image.

In this sense the image is a histogram revealing the underlying probabilities.  Each single run of the system (each electron) was a kind of probe of the probability density function of the system. Any given single electron picked an outcome at random among the possible outcomes, in accordance with the probabilities of each possible path and outcome. Thus we are visualizing the probability function -- the single electron itself wasn't smeared out at all. To get interference we really had no option except to run the experiment until billions of data points had been gathered. The interference pattern is the cumulative record of billions of individual events.

Well, lets run our experiment. What happens? As you might guess, this form of experiment reveals interference patterns: the electron has a 50-50 probability of taking the up path versus the down path. But this is true only with the LED that tracks the actual electron path turned off. If you turn on the LED detector... the pattern collapses to a bright dot! A very curious finding.

Even stranger: turn the detector on, but encase the LED output in a black bag so that no information about it escapes... the pattern reappears.

In effect, overtly "knowing" which way the electron went eliminates the diffraction pattern. Not sensing it at all, or destroying the data after sensing it, restore the diffraction pattern.

Sounds like magical nonsense? Yes, definitely. Yet this is a real experiment.  For me, it definitely deserved a Nobel prize.

Now, there are a few ways to make sense of such a finding. One is to consider relativistic time perspectives. If you do this, you quickly discover that concepts like "before" and "after" had no real meaning. Brian Greene's first book on modern physics discusses this and makes the point that causality is meaningful, but that non-causal concepts of time are ultimately subjective. Einstein was the first to realize this, and it underlies his theories of relativity. So was that sensor active when the particle passed? Yes, or no, depending on how you measured time. Not a well-posed question.

But people have done versions of this experiment in which the detector, and that decision, are physically quite far away. Using a speed of light argument, this version of the experiment potentially eliminates any relativistic question of interpretation.  Your decision genuinely, provably, occurs after the particles hit the LCD detector.  And seemingly, it lets you retroactively change the past. Cool, huh? And you thought that scenarios like this were the stuff of terrible TV serials.   Now you learn that they are just the stuff of terrible blogs!  But keep in mind: this is real science. 

Back to interpretations.  The second interpretation is the Copenhagen one. In this, the observation made by the which-way sensor collapses the quantum superposition that characterizes the system state, once the beam splitter has done its thing. The sense in which it is a spooky event is that seemingly, you can delay and decide whether or not to activate the which-way sensor until after the particle already hit the LCD pixel. Cool, huh? But utter nonsense if you believe that in reality, information cannot move backwards in time (faster than the speed of light). So people who adopt this view are tacitly accepting that quantum observations actually can move faster than the speed of light.

A third, more modern interpretation is called the many-worlds model. This model views the world as a giant quantum superposition. The superpositions are really non-quantum classical world-lines that follow Newtonian rules. So in a single world-line the behavior of our electron was fully determined by various factors: the specific electron hit the half-silvered mirror just when this molecule of aluminum dioxide was oriented just so, and hence it reflected up, deterministically. There was no probability involved at that step.

But a many-worlds model assumes that because we live at the tip of a causal cone stretching 13.8B years into the past, we only "know" a tiny amount about the universe in a causal sense of actual past observations. So in this model, each interaction between two elementary particles is an observation, and lives on into the causal future of both particles: an entanglement, but without any quantum fuzz. Today, we sit at the tip of this cone of past observations, but now think of all the unobserved state in the universe. Our world-line observed something (the state of the mirror) that was genuinely unpredictable. After the observation it was fully known.  Before observation, 50-50.

The many-worlds model holds that if both outcomes were equally probable, then this means that in some world-line the electron bounced up, and in others, it bounced down. The world lines were indistinguishable until that happened, but then each "learned" a different past.

So in this model, the arrow of time is concerned with increasing knowledge of the past. Any single world-line accumulates knowledge in a steady way.

Critically, no world-line ever collapses, vanishes, reverses course and changes its path. We just learn more and more as events progress.  The new knowledge rules out certain possibilities. Prior to hitting the mirror, it was genuinely possible that the electron might bounce up, and might bounce down. But after hitting the mirror, any world-line in which we "know" the state of the electron is one in which we have eliminated all uncertainty about that mirror interaction. By turning on the electron path detector, we eliminated all the uncertainty, and this is why the diffraction pattern vanished: with no uncertainty, our electron beam gives a sharp, tightly focused spot on the detector. 

How did this all explain our Wheeler experiment? Well, with the sensor turned off, an analysis of the probabilities of the various outcomes does give rise to an intersection not of the electron with itself, but rather of the two "branches" of the probabilistic state of the system, leading to the interference effect that we see as a pattern on our eventual LDC screen, built up one pixel at a time.  You can compute this pattern using Schrodinger's equation. With the sensor turned on, the probabilistic analysis is changed: if information from the sensor can reach the LDC detector, we eliminate uncertainty in a way that leaves us with a clean focused spot, or a clean set of lines (depending on how the slit is set up).

If you are big believer in quantum computing, you'll be happiest with this third model, although many people in that field tell me that they prefer not to think of it in terms of real-world interpretations ("shut up and compute" is the common refrain). No need to worry about the meaning of all those world-lines.  Are those other world-lines "real?" Well, in what sense is our reality objectively more real, or less real? Any objective attempt to define reality either ends up with an unmeasurable property, or concludes that any world-line that has some non-zero probability of arising is as real as any other.

In a similar vein, it is wise to avoid speculation about free will, and about whether or not such a model leaves room for religion.

I myself am a believer in the many-worlds interpretation. It eliminates spooky faster-than-light action at a distance and other oddities. We are left with something elegant and simple: causality, nothing more. Lacking knowledge of the current state, some things seem possible, with various probabilities. Then we learn things, and that rules out other things. Viola.

Now, how does this lead us towards a perspective relevant to computing, and distributed systems at that?

Well, we have these deterministic world-lines that are all about interactions of elementary particles (or maybe m-branes in the modern 11-dimensional string theories -- whatever your favorite most elementary model may be). Let me ask a small question: do you believe that all of this is governed by mathematics? Sure, mathematics we haven't fully learned yet, and without question, the mathematics may be very foreign to our normal mathematical systems. But do you believe that ultimately, the physical world is governed by physical law?

I myself do believe this: I have yet to hear of a physical situation that wasn't ultimately subject to a scientific explanation. So we are then looking at model of the universe in which strict mathematical laws govern the evolution of these world-lines from their initial state, back at the big bang when time began, up to the present, 13.8B years later.

There were a set of possible initial conditions, and each initial state gives rise to a world-line. But there are a seemingly infinite number of possible initial conditions consistent with the 13.8B years of observations prior to me typing these characters into this blog-page. I am uncertain as to those unobserved states, and hence unable to predict the actual future: in some sense, I am all of the me's in all of these identical (up to now) world-lines. Then time progresses, events occur (interactions between elementary particles), and more knowledge is learned. My experience is that I reside in the world lines consistent with my personal past. But other versions of me experience the other possible outcomes. And so it goes, branching infinitely.

How does the universe compute these next states? Here, we approach the distributed systems question. The first answer is that it does so by applying the (only partially known to us) physical laws to the states of the world-lines, in a Newtonian style, computing each next state from the current state. As it turns out, this deterministic world-line model is actually fully reversible, so you can run time backwards and forwards (obviously, if you run a world-line backwards the system forgets its future, and regains uncertainty, but you can do this if you wish -- quantum computers depend on this property and wouldn't work, at all, without it).

So what does it mean to have a single world-line and to apply the "next" events to it? Seemingly, the proper model requires that we combine two models: one to model space, and one to model particles, both quantized. So we think of space as a mesh of 3-D (or perhaps 10-D) locations, plus time if you want to reintroduce clocks -- I'll leave them out for a moment, but Brian Greene prefers to include them, adopting the view that events sweep forward at the speed of light, such that any particle is moving at speed c, as a vector in space+time. In Brian's explanation, a photon moves at speed c and experiences no movement in time at all: it experiences its entire existence as a single event in time, spread over space. An electron in motion moves through space at some speed, and then experiences time at whatever rate will give us speed c for its space-time vector. Cool idea (Einstein's, actually).

So we have this mesh of spatial locations. What does space "know?" Well, it knows of the local gravitational gradient (the proper term is the "geodesic"), it knows of nearby particles and their contribution to electromagnetic fields, and in fact it knows of other forces too: the weak and strong force, the Higgs field, etc. So space is a record of fields. And then the particles seemingly know where they are (which space-time location they are in), and where they are heading (their vector of movement), plus other properties like mass, spin, etc.

The distributed computation is one that takes each element of space-time from its state "now" to its state one interaction into the future. This involves interactions with the particles in the vicinity, and also interactions with the neighboring space-time locations. The particles, similarly, interact with space-time (for example, our electron might be pushed by a local magnetic field, causing its trajectory to change, and it would learn the prevailing EM field by interrogating the space-time location), and also with one-another (when particles collide). Perhaps the collisions can be fully expressed as interactions mediated through space-time -- that would be nice.

Very much to my taste, this particular model has a dynamic form of group membership! First, the presence of mass causes space-time itself to stretch: new space-time elements form. This would generate the geodesic mentioned earlier. For example, near a heavy mass, like as we approach the event horizon of a black hold, space-time curves: more spatial volume forms. Indeed, the infall of matter towards the singularity at the core of the black hole seemingly would cause an unbounded expansion of space-time, but entirely within the event horizon, where we can't actually see this happening. And then there is a second aspect, which is that as the most basic particles or m-branes combine or break apart, the population of particles seemingly changes too: perhaps, a neutron bangs into a proton, and a spray of other particles is emitted. As I said, the universal computation seems to track dynamic group membership!

Back to the real world (but here's a little Matrix-like puzzle, referring to the movie: if the universe is a superposition of world-lines and the world-lines are deterministic and governed by physics, and the physics ultimately has all the details filled in -- even if we don't know how the thing works, the universe itself obvious does -- does anyone need to run the computation? After all: there is a Taylor-series expansion of pi, and with it, I could compute the 2^2^2^2^2^...'th digit of pi in any radix you like. Perhaps I tell you that such and such a digit is 7, base 10. But that digit has never actually been computed by anyone except me. Would you trust me? No, but at the same time, you do have a way to validate the claim. The digit definitely has a defined value: my claim is either true, or false: it can't wiggle around and change value. So in a similar sense, the mathematics, given the initial conditions, predicts this moment, 13.8B years down the road. Do we care whether or not the computation has actually occurred? Is it meaningful to claim that we have some existence outside of this definition? And if so, how would you make that claim rigorous)?

At MIT, Scott Aaronson (who has since moved to Austin) and his colleague Seth Lloyd speculated about this -- two stars in the slowly growing field of quantum computing. Seth wrote book in which he argues that even the universe wouldn't be able to solve unsolvable problems -- we computer scientists do have Russell's Paradox to wrestle with, not to mention basic questions of complexity.  If a problem takes unbounded time and resources to compute, it may not be solvable even if there is an algorithm for performing the computation. In this sense all the digits of pi are computable, but some digits cannot be "reached" in the 13.8B years the universe has had to compute them: they are much further out there.

Seth's point is interesting, because it raises real questions about what the distributed computation that comprises the universe might be up to. Taking this point about pi a bit further: pi itself has no finite representation. Yet pi certainly arises in physics, along with many other transcendental constants. How can the universe actually carry out such computations in bounded time?

A few ideas: it could be computing symbolically. The actual mathematics of the universe might be expressed in a different mathematics than we normally use, in which those constants vanish into some other representation (think of the way a transformation to polar coordinates eliminates the need to talk about pi... although such a transformation also introduces a problem, namely that the (x,y,z) coordinate (0,0,0) has no single representation in a polar coordinate system: it has vector length 0, but the associated angle is undefined). Anyhow, perhaps there is a mathematics in which these constants don't arise in their unbounded-length form. Or perhaps the universe only needs to carry out the computation until it has enough digits to unambiguously determine the next state.

Traditional models of singularities pose issues too, once you start to think in terms of tractable mathematics (Seth's point is that if a problem can't be solved computationally, the universe must not be solving it -- either we have the physics wrong, or it has a clever work-around!) A singularity is infinitely small and infinitely dense: clearly not a state amenable to a state-by-state computation. Again, various ideas have been floated. Perhaps the singularity is just not reachable from within the universe: the particle falling into it needs an unbounded amount of time to get there, as perceived from outside (in fact from the outside, the particle remains smeared over the event horizon and the associated information mixes with that of other in-falling particles, and this is all we can know). Still, for the universe to be doing this calculation, we would need to figure out what it "does" for space-time locations closer and closer to the event horizon, and very likely that model needs to account for the behavior inside the black hole and at the singularity, too. Otherwise, the mathematics would be incomplete. Seth argues against such views: for him, the thing that makes a universe possible is that it is self-consistent and has a complete specification of initial state and the physical laws by which that state evolves.

Here at Cornell, my colleague Paul Ginsparg always has the answers to such questions, up to the event horizon. Beyond that, though... as I said, perhaps we shouldn't view the questions or answers as part of this universe. Yet Paul points out that an object passing the event horizon wouldn't notice anything unusual, including the increasingly extreme curvature of space-time. Whatever the mathematics are that govern the universe, they should still work in there.

And here's yet one more puzzle.  If computing the next state involves some form of computation over the entire universe, the associated physical model has to be wrong: such a computation couldn't be feasible.  Yet this seems to imply that much of contemporary physics is incorrect, because these unbounded integrals do arise (for example, contemporary models of quantum uncertainty assign a non-zero probability to finding a particular particle anywhere in the entire universe).

Instead, some form of boundedness must enter the picture. At a minimum, the computation shouldn't need to reach out to more of the universe than we have had time to interact with: an electron on my computer screen has interacted (at most) with particles within a causal cone stretching just 13.B years into the past, and in fact with just a tiny subset of those. This tells us that computing the next state is a finite problem, even if it is a bit large by computational standards. But it also tells us that as time elapses, it gets harder and harder for the universe to compute its own next state. Does it make any sense at all to imagine that physics works this way, or does the universe have some way to bound the computational task so that it would be of manageable scale?

Which leads to another insight.  Armed with our many-worlds hypothesis, suppose that all of us track down the URL of the "ANU quantum random number project."  As you might surmise, this project  uses quantum noise from outer space to generate what seem to be totally random numbers.  Now use those numbers as the basis for lottery ticket purchases in the next PowerBall drawing.  You'll win, in some world-line.  In fact, do it 50 times in succession.  If the game isn't rigged and you don't get arrested, kidnapped or killed, you'll win 50 times -- in some world-line.  But not in any world-line you are likely to actually be living in.

In fact there was an old science fiction story along these same lines: some very elderly fellow had lived an increasingly implausible life, surviving one catastrophe after another.  After all: if survival isn't impossible, but just very unlikely, then by definition there must be some possibility of surviving.  And in that world-line, you do indeed survive, and so forth.

Do low-probability events "really" ever occur?  Or does the universe somehow save itself the trouble and never compute them?  Could it be there is even some hidden natural law that cuts off absurdly low probability sequences?  Brian Greene discusses this too, in a chapter that asks whether the initial conditions of the universe would have allowed a kind of Petit Prince scenario: could the initial universe have started with a single planet in it, and a single person standing on that planet, fully aware and healthy, air to breath, but nothing else in the universe at all?  Extreme interpretations of the multi-verse theory seem to say that yes, such an initial condition would be possible.  But Greene argues that there is a sense in which all of these examples violate the second law of Thermodynamics: the law that says that entropy always increases (obviously, we can expend energy and create local order, but in doing so, we create even more entropy elsewhere).  So perhaps this law about entropy has a generalization that prunes low probability world-lines.

One could make the case that pruning low probability world-lines might be necessary because the total computational complexity of computing the universe would be too high if the universe were to compute every possible pathway, including all of these extremely obscure scenarios.  Pruning really obscure corner cases could be the key to computational tractability, in some very real sense (after all: many problems are infeasible in their general form, yet we manage to solve them computationally because we really only run into an easier subset that wouldn't normally include the very hard instances).  Is there a world-line out there where I actually won lotteries 50 times in a row, using those ANU quantum numbers?   Such a sequence would violate this generalized entropy principle.  So, maybe not, after all.  Sigh.  But on the positive side, that wouldn't have been a very fair way to win the lottery.  It would be nice if a natural law actually prevented it!

A fun topic. I'm always hungry for new books about it -- the people I've mentioned mostly have written such books (it seems to be a rite of passage for physicists). With summer arriving soon, let me know if  you have any to recommend!