Thursday, 14 June 2018

If RDMA is just 4x faster, will that doom adoption?

I’m staring at our latest Derecho numbers and am thinking about their implications.

With additional tuning, by now Weijia and Sagar have Derecho on LibFabrics over TCP running just 4x slower than Derecho mapped to RDMA on the same NIC.  I guess this number makes sense, since on our clusters the memcpy speed was about 3.75GB for uncached objects, while RDMA transfers run at a peak rate of 14GB: just about a 4x improvement.  So what we are seeing is that with a memcpy in the critical path from user to kernel, and one more memcpy from receiver kernel up to the user (TCP lives in the kernel), throughout drops to the speed of the copy operation.  In Derecho, adequately deep pipelines can absorb this effect.  Derecho is far slower if the window size is limited to too small a value.

Latency measurements are hard to carry out accurately, at these speeds, but once we get them, I’m sure we will see that latency is higher with TCP: rising from about 1.5us for small objects by some factor associated with memcpy delay and thread scheduling: LibFabrics has extra threads that RDMA avoids.  Weijia’s preliminary results suggest that one-way latency rises by tens of microseconds for small objects and hundreds of microseconds for large ones.  But today’s cloud applications live happily with much higher latency due to multitasking and scheduling overheads, so this may not be a primary concern for many users.

Moreover, to the extent that memcpy is the culprit, data center hardware vendors could speed up memcpy if they really set out to do so.  They could extend the DRAM itself to do the transfer, and offer that functionality via a special instruction.  That would probably double memcpy speeds.  But the demand hasn’t been there.  One could also modify TCP to do some form of scatter-gather with a DMA operation directly from the user-space object, avoiding the memcpy.  Of course this would break the semantics of the standard synchronous TCP send, but you could offer the fast path purely with asynchronous I/O, in which case the semantics wouldn’t need to change.

Derecho is very tolerant of high latency.  Systems like Tensor Flow are too, as would be any event-stream system using a queuing service (Kafka, SQS) to interconnect its components.  But some applications would care more, namely those dominated by RPC.  Are those still common?

Leading to my question: if RDMA only gives us 4x speedup on datacenter networks, and latency increases but only by fractions of a millisecond, will operators adopt the technology, given the complexity of deployment?  As you know, if you’ve read this blog, Microsoft and Google are gaining experience with a datacenter RDMA, using DCQCN or TIMELY for congestion control and various tricks to run RDMA on Converged Ethernet (RoCE) with flow isolation.  They are succeeding, but finding it fairly hard to pull off.   Normally, a technology that is this hard to manage and brings less than a 10x benefit doesn’t make it.

One case for RDMA might be based on CPU loads.  TCP will peg a CPU doing all this copying, so with Derecho over TCP we see the typical 12-core NUMA node acting more like an 11-core machine, and maybe even like a 10-core machine since there are other threads involved, too. As the datacenter owner, this 10-15% tax on your compute resources could be a big issue, giving RDMA that magic 10x benefit not in one year, but probably over a four or five year period, considering that it was 4x faster in the first place.

A second point is that not every project has all-star developers to do the tuning.  So our 4x may translate to a 20x difference for MPI, or a 25x for Oracle.  That would drive towards RDMA adoption after all. What Weijia and Sagar did came down to adjusting the SSTMC window size in Derecho.  But you need to figure out that for Derecho in this modem the window size is the dominating factor.  Who knows what it would be for MVAPICH (MPI), or Oracle, or Tensor Flow? RDMA takes that puzzle, and that datacenter tax, off the table.

My cards are still on RDMA.  But these results over TCP definitely came as a surprise!

Monday, 28 May 2018

Derecho on LibFabrics versus Derecho on RDMA: Hot off the wire...

A while back, I mentioned that we were trying to port Derecho to also run over LibFabrics, so that people could use the system on a TCP-only network, or one with the Intel OMNI-Path/IODirect technology, which are based on the LibFabrics API rather than using RDMA Verbs.

So this is starting to work now and I thought I might share a quick summary of the experience!  A big thanks to Weijia Song and Sagar Jha, who worked with Cornell ECE undergraduate Alex Katz to make this happen.  It was, as it turned out, remarkably hard.

For those lacking context, let me start with a summary of why this topic is of interest, but then that group of people may want to tune out because I'll go back to being highly technical.  As you may know, RDMA is a technology for permitting one computer in a data-center or cluster (close-by, densely packed machines) to do direct memory-to-memory data transfers.  RDMA comes in various modes: you can do messaging over RDMA (one machine sends, the other receives), and in this case it will behave much like TCP, waiting to send until the receiver is ready, then transmitting losslessly.  The big win is that the transfer will run at DMA speeds, which are extremely high: 100Gbps or more.  In contrast, if you code a little TCP-based transfer on the identical computers, same network card, you tend to see performance 1/5th what RDMA can achieve or less, because of all the copying (from user space to kernel, then the kernel to kernel transfer over the network in IP packets, then copying up).  Copying takes time, and this ends up being 4/5ths of the total path.  RDMA eliminates all of that, at least for large transfer.

For small transfers, people work with "one-sided" RDMA in which the sender and receiver prenegotiate a connection but also share a memory region: B grants A permission to read or write within some chunk of B's local memory.  Then A basically ends up thinking of B's memory as mapped into A's own computer, although the data transfer isn't as transparent as just reaching into a mapped memory -- it has to occur via special read and write commands.  Still, the model would be one of shared memory, reliably updated or read remotely.

RDMA originated on InfiniBand (IB) networks, popular in today's supercomputer systems.  Most supercomputer communication runs over a library called the Message Passing Interface, MPI, via an implementation like MVAPICH, and versions like that are highly tuned to leverage RDMA on IB.  Our work on the Cornell Derecho system uses this same technology.

Now, continuing the back story, RDMA in this form dates to work done long ago at Cornell by Thorsten Von Eicken (went on to lead and Werner Vogels (now CTO at Amazon).  RDMA took off in industry, and some think it killed off the older style of parallel supercomputing, replacing it with less costly clusters of commodity machines connected with RDMA networks on IB.  Anyhow, right from the start the leaders pushed for a TCP-like model.  In fact you actually start by creating a TCP connection, a normal one, then put an RDMA connection in place "next to it", using TCP for passing control messages until the RDMA layer is up and running.  RDMA itself involves a firmware program that runs in the NIC (so these NICs are like little co-processors, and in fact they often have a full Linux O/S on them, and are coded in C or C++).  The interaction between end-host and NIC is through shared "queue" data structures, which are always grouped: there is a queue used for send/receive/read/write operations (called "verbs" for reasons lost in the mists of time), and then a separate completion queue where the NIC reports on operations that have finished.

Verbs, the RDMA API, define a huge number of specialization options.  Data can be "inline" (inside the operation) or external (via a pointer, and a length), or one can specify a vector of locations (scatter-gather).  Operations can be selected for completion reporting.  There is a way to multiplex and demultiplex more than one logical connection over a single RDMA connection, using "keys".  There are "tags" that can carry an additional few bytes of information such as object length data (useful in cases where a big transfer is broken down into little sub-transfers).

Then there are ways to use RDMA verbs for unreliable datagram-like communication.  Unlike the normal case, where the sender will be certain that messages got through, in order, reliably, with unreliable datagrams some form of end-to-end acknowledgement and retransmission would be needed in the application.  And this mode also includes an unreliable multicast feature.

RDMA is a troublesome technology in normal optical ethernet clusters and data centers, and many customers just run it on a second side-by-side IB network to avoid the hassle.  But there is an RDMA on ethernet standard called RoCE, and slowly the industry is learning to deploy it without destabilizing normal TCP/IP traffic that the same network would presumably carry.  RoCE hasn't been trivial to use in this way: one reads about storms of "priority pause frames" (PPFs), and there are at least two flow control concepts (DCQCN and TIMELY), both in active use, but neither fully mature.  Microsoft has just started to run RoCE+DCQCN on its big Azure clusters, although the company is limiting use of RDMA to just its own internal infrastructure tools.  Google uses it too.  In contrast, any end-user application on an HPC system depends on RDMA via MPI (Azure HPC permits this too).  So there are ways to access RDMA, but not directly, on most platforms.  Direct use is limited to people who build new infrastructure tools, like we did with our Derecho work.

Against this backdrop, there is a question of who will win in the RoCE market.  Intel seemed likely to become a big player when they purchased QLogic, and Mellanox, the dominate company on the HPC side, is the obvious incumbent, with maybe 80% of market share.

But RDMA has technical limits that worry some companies.  One concern is the Verbs API itself, which is sort of weird, since it requires these helper TCP connections (you won't find this documented in any clear way, but I challenge you to use RDMA Verbs without having an n * n array of TCP connections handy... CMU seems to have done this in their HeRD/FaSST system, but I'm not aware of any other project that pulled it off).  A second is that every RDMA transfer actually occurs as a parallel transfer of some number of "lanes" of data.  Each lane is a pair of single-bit connections, one for data travelling in each direction, and at the high-end of the range, lanes typically run at speeds of 14Gbps.  Thus when you read that the Connect-X4 from Mellanox runs at 100Gbp/s, you can assume that they employ 7 or 8 lanes to do this.  In fact the rates are quoted as one-way rates so you should double them in the case of Derecho, which is aggressive about using connections in a full bidirectional way.  By scaling the number of lanes, the industry is now starting to deploy 200Gbps solutions with 400 within sight and 1Tbps expected by 2025 or so.

Now, at these rates it isn't efficient to send a single bit at a time, so instead, RDMA devices normally send frames of data, generally 84 bytes (this is the frame size for 100GE).  Thus if you have 8 lanes, a single RDMA transfer might actually carry 512 bytes, with 1024 or more likely by the time the technology reaches the higher rates just mentioned.

The issue is that not every application wants to move objects in blocks of such large size.  After all, with direct RDMA writes, a typical operation might access just a few bytes or a word: perhaps 32 or 64 bits.  This will be incredibly inefficient over RDMA.

In Derecho we work around that by designing the system around a style that we think of as "steady flow".  We want a stream of updates to transit any given path, never pausing even briefly, because we want ways to aggregate traffic.  You can read about how we do this in our paper (it was under review by TOCS, but now we are revising it for resubmission, so the version I pointed to is going to change).

Intel has led a group of vendors arguing that this entire direction is ultimately unworkable: we will get to 1Tbps RDMA this way (and Derecho will let you access that speed), but most applications will find that they get a tiny fraction of the full performance because of patterns of transfers that send entire RDMA frames just to read or write a byte or two at a time -- in such cases the majority of the RDMA transfer is just wasted, because the NICs are exchanging perhaps 8 or 16 bytes of useful information and yet every RDMA transfer is of full size: 512 or 1024 bytes.

One answer is for everyone to use Derecho.  But LibFabrics, which is Intel's story, aims at a different concept: they want more flexibility about how the hardware could multiplex large numbers of concurrent remote I/O operations so that these huge frames can carry much more useful traffic.

At the same time, LibFabrics breaks with the RDMA Verbs model and tries to more explicitly mimic the normal Linux sockets API.  One can love or hate sockets, but at least they are a widely adopted standard that all of us know how to work with.  Lots of software runs through TCP or UDP over sockets, hence all of that software can (with some effort) also run on LibFabrics.  Yet RDMA is still available as a LibFabrics mapping: there is a way to run directly on Mellanox and theoretically, it maps to the identical RDMA Verbs we would have used in the first place.

So with this in mind, we bit the bullet and ported Derecho to LibFabrics.  Now our main software distribution actually will have both options, and we'll use some form of configuration manager (we're thinking about Getopt or its sibling, the slightly fancier Getpot).  Then you'll be able to tell us which NIC to select, which mode to run in, etc.

Derecho it looking pretty good on LibFabrics right now.  We're seeing identical performance for RDMA versus LibFabrics (on the identical NIC and network, obviously), provided that the transfer size is large.  For small operations, LibFabrics seems slower.  This is clearly at odds with a direct mapping to RDMA verbs (otherwise we would have seen identical performance for this case too), so we'll need to understand the root cause more deeply.  We've also tested Derecho over LibFabrics mapped to TCP, still on the same network but using the kernel stack, and it seems to be about 4x slower than with RDMA, again in the large objects case.  Derecho over TCP with small objects is very slow, so there is definitely a serious puzzle there.

One very cool option will be to look at Derecho on WAN links over TCP.  RDMA can't run on a WAN network, but we definitely can run LibFabrics this way.

Summary: work in progress, but work revealing very nice progress, even if some mysteries remain to be resolved.  Watch this blog for updates!  I'm sure we'll have lots of them in months to come.

(Added May 30): Here's data from an experiment run by Weijia Song, Alex Katz and Sagar Jha for a 3-member Derecho shard or group, first with 1 sender and all three members receiving and then with all sending and all receiving, using our atomic multicast configuration of ordered_send (compare with Zookeeper's ZAB protocol, without checkpointing).  

What you see here is that (1) LibFabrics didn't introduce overhead when we map down to RDMA.  To see this compare Derecho ROCE and IB against the Derecho Baseline (which used Verbs directly).  So, two left data sets versus the one on the right.  Axis is in gigabytes-per-second and the color coding is for the message size: 100MB, 1MB and 10KB.

Next look at the middle three cases.  These were ones in which we used the same NICs but told LibFabrics not to use RDMA and instead to run on the native TCP for those NICs (TCP works perfectly well on these NICs, but doesn't leverage RDMA).  So of course performance drops, by roughly the factor of 4x mentioned above, but what is impressive is that if you were to go to our Derecho paper and compare this with Zookeeper or LibPaxos, we are still close to 100x faster than those libraries!  That would be a true apples-to-apples comparison, since ZAB and LibPaxos run on TCP or UDP, not RDMA, so when we compared Derecho against them previously, it was a bit unfair to them -- they had no way to leverage RDMA.  

Why is Derecho so much faster than classic Paxos?  Clearly the world already had reached a data rate in which streaming data and doing receiver-side batching is far preferable to the classic 2-phase commit for each action.  Nobody had noticed this, but Derecho happens to live in this new configuration because our development methodology led us to that formulation of the protocols.  And here we can see that it pays off even on a TCP infrastructure!

Sorry for all the explanation points, but after something like 35 years of working on this kind of thing, I'm thrilled to see these numbers and to realize what a great opportunity this represents.

Wednesday, 16 May 2018

Type-checking Schrödinger's Cat

Computer scientists have long needed to deal with an issue that  triggered schisms within the physics community starting in the early 1900’s and continuing until the present.  

We all know Schrödinger's thought experiment: he proposed that one visualize placing a cat in a box with a flask of poison gas, doomed if a uranium atom happens to decay during the five minutes it is in the box, but free to live another (eight? nine?) lives otherwise.   His suggestion was that the entire apparatus, cat included, enters a quantum superposition state.  Opening the box and observing the cat causes the state to collapse: you'll either see a happy living cat, or a feline tragedy.   

There is no need to actually perform the experiment: you can never actually observe a superposition state, and doing the experiment once to see whether the cat survives would just be cruel; it wouldn't answer any questions.  The challenge is to explain how to apply quantum mechanics to the situation such a setup creates.  Does quantum mechanisms operate only at small scale, or does it genuinely have large-scale implications?  In the early 1900's, it was common to assume that the quantum world was somehow disconnected from the world of larger material objects.  The experiment sets up a scenario in which, at least at first glance, a small-scale phenomenon shapes a large-scale outcome.

The physics community debated the meaning of this strange scenario endlessly, but were unable to resolve their views: one group held that indeed, the cat ends up in a superposition state resolved only by opening the box.  A second community insisted that the outcome was "observed" when the uranium either did or did not decay.  Yet another community questioned whether the problem was properly posed.  Ultimately, exhaustion set in, at which point they agreed that it might be best to stop arguing.  

One reason for not arguing this question to a conclusion is that irrespective of the interpretation of the experiment, Schrödinger's equation for predicting the evolution of state in a system of particles works quite well, so the debate centers not on the physics, per se, but rather on the best way to explain the physics.  But while the cat story was intended to elevate the model to a more intuitive level, it only left everyone baffled.  So the decision to not argue about it was really an agreement that whatever the experiment illustrates, it definitely doesn't help explain the physics.

In fact, the many-worlds model resolves this puzzle, for those willing to accept that there are an infinity of worlds, all equally real.  But let's not go there, and instead focus on the statement of the puzzle itself.

As a computer scientist, it strikes me that a key part of the confusion is this: Can statements about quantum mechanics be made about cats?  I'm putting statements in italics because I want to emphasize that I'm asking a syntax question, not a semantic one.  Can the phrase "quantum superposition" be used as a modifier for "cat", as we might use "angry cat" or "happy cat"?  Is it correct to talk about "quantum superposition of a cat"?

One of my favorite courses to teach is Cornell’s CS2110, our second programming class, where we introduce abstractions and object oriented computing.  To me, there is nothing more foundational about computing than the shift in perspective that occurs when we define a new data type, instantiate it, and endow it with operations that have semantic meaning only with respect to the new abstraction.

With CS2110 in mind, let me pose a slightly different version of the cat question.  Suppose that I define a new data type, a “foo”, and a true/false property “shimmer.”  Given a foo object, you can always ask whether or not it shimmers.  But now let's introduce a second unrelated data type, one we'll call "bar".  If bar has no shimmer property, you simply can't ask if a bar object shimmers.  

Back to animals.  If one pets a cat, it may or may not purr.  You can pet a dog, but it will never purr.  What would it even mean for a dog to purr?  Asking "did that dog purr?" is thus a nonsense question.  Without defining what purring means, for dogs, the words simply don't fit together into a meaningful question.

So... armed with this very basic insight about abstract data types, let's revisit our physics puzzle. Subatomic particles can be placed into superposition states But is the term "superposition" even relevant to a cat? 

The cat in the box scenario conflates a “perceptual” abstraction (the cat) with a model related to the wave/particle duality (superposition). Superposition as properly defined is a specific phenomenon seen in a totally different setting: one involving elementary particles, as we define and model them. You can't simply take such a term and apply it willy-nilly.

This now leads to a second and surprisingly subtle question:  What in fact, is a cat?

You might be tempted to say something about atoms, molecules, cells.   This would be like saying something about computer memories and bits and how Java compiles to basic operations, if we were talking about data types.  But just as memory locations change  over time, a cat's atoms and molecules and cells are constantly changing.  Schrödinger's cat started life as a single fertilized cell.  It acquired most of its current mass as it grew to become an adult.  So, defining a cat in terms of its constituent parts seem wrong.

Is there a cat definition that transcends the “moving parts”?

As a computer scientist, an appealing answer is to suggest that catness centers on information.   This view would define the living cat as a biological program that implements all the various processes by which a cat comes into existence, grows, and also sustains itself.  In this perspective, the cat is a distributed program, admittedly one more complex than anything we know how to create today.  

This programming perspective can be used to talk about many kinds of things in our perceived world.   Even inanimate objects like stones are, in some sense, as much about information as they are about their constituent atoms and molecules and crystals. 

In fact an information perspective can even be used to assign a kind of discrete existence to things that have no physical embodiment.  We often talk about properties such as love, anger, happiness.  A cat purrs to express contentment.  Yet there is no physical object that corresponds to this  emotional state that causes the cat to purr: contentment, hence purring,  is a state resulting from a form of information-based biological computation, performed by the cat (I should almost say that the computation is executed on a biological machine that evolves over time under control of that same biological computation). 

If a cat is actually best understood by thinking about information in this sense, and Schrödinger's box is a form of information, what does it really mean to say that we have placed a cat in the box?  Obviously, at some high level of abstraction, you open the box, pick up Fluffy, and gently place her in the box.  You close it, wait a few minutes, then reopen the box and remove Fluffy.  But as you can see, if we really try to pin down the semantics of each of these steps it gets somewhat tricky.  Each one of them involves non-trivial definitions.  Some of these seem extremely difficult to really define in a full, logical way -- perhaps it isn't even possible in some cases.

Think now about the process as it was understood by late 19th century and early 20'th century particle physicists.   They started out on a path towards the basic structure of the universe.  This involved an effort to understand the composition of atoms, which they eventually understood to be atoms composed of protons, neutrons and electrons.  Today, of course, we know that protons, neutrons and electrons are themselves structures composed of more elementary entities called quarks, and that quarks in turn may arise from mathematical structures that modern string-theorists like to call m-branes.  But let's focus just on elementary particles like the ones making up atoms.  At normal energies, this is a reasonable thing to do (you don't see protons and neutrons and electrons behaving like a soup of quarks unless you force them to collide, which requires extremely high energies).  

So we have atoms, and we have elementary particles within them.  We can model these abstractly, and to first approximation, they reside on something foundational and ultimately, are "atomic" in the original Greek sense of the term: indivisible, irreducible, fundamental.  

And, returning to our point, as it turns out we can and should talk about superposition for these elementary particles.  In this statement, the term "superposition" is applicable to our way of abstracting and modelling the behavior of individual particles.  Indeed, we are really forced to define terms like superposition and entanglement to develop a sensible physical description of particles like neutrons and protons and electrons.  Lacking these terms and their associated physical laws, we end up with an incomplete mathematical characterization.

The terminology, viewed this way, is meaningful in the context where it was first proposed, and is part of the mathematical physics that works so incredibly well at predicting the evolution of state for quantum systems.  However, the terminology isn't absolute.  We can't necessarily translate physics statements about this particular model of elementary particles to totally different settings.

In particular, when we take a term such as quantum superposition and apply it to Fluffy, we arrive at a conundrum: Is it, in fact, meaningful to make such a statement about a cat in a box?   Yes, we can write the statement easily enough: "the cat in the box is in a superposition state."  Yet one must question the well-formedness of the resulting statement itself.   

This issue is especially striking when we consider an information-based perspective on catness, because such a definition of what it means to be a cat, or a box, really transcends the physical realities of the constituent components of the cat and the box, and exists in a different semantic context.  Arguably, each level of abstraction in a physical system needs its own set of definitions, meanings, and physical laws.  While these layers of meaning obviously do emerge from more basic ones, including the layer at which the original quantum superposition abstraction was first recognized, it isn't always the case that a property or physical law from one layer carries over to another higher one.  Rather, each higher layer is "emergent" and as such, could be quite distinct from the layers below it.  My foo data type, for example, might be very remote from the Java JVM instruction set or the data values used to represent a particular foo instance.

Of particular note is that Fluffy is really the same cat, even though she grew from a kitten into an adult cat over the past few years.  Fluffy's atoms are mostly different from when she was small, yet she is still the same cat.  So here we have an emergent notion of Fluffy that actually departs from the subatomic components of Fluffy in two senses: first, Fluffy endures even through her constituent atoms may be entirely different.  And second, the behavior of Fluffy's atoms might not actually be evident at the higher level where Fluffy herself really exists.  There is a disconnect between the layers of reality.

This brings me to a recently published book about quantum physics:  What is Real?  The Unfinished Quest for the Meaning of Quantum Physics, by Adam Becker (I recommend it highly).  Becker describes a problem similar to the one I've outlined, and talks about the struggle Neils Bohr went through, trying to explain why measurement of quantum phenomena was (in his eyes) deeply problematic.  For him, the cat in the box puzzle was simply an incorrectly posed problem.  I think we would say today that his objection comes down to a “type checking” error.  

When you try to talk about quantum superposition states in relation to cats, you are combining two abstractions from completely different type systems.  And this simply isn’t well founded, a point that Bohr apparently fought to express to colleagues (who often were totally unable to appreciate his view).  Bohr perhaps didn't even arrive at the observation that the cat is really an information-based abstraction that sustains itself, a biological machine.  His focus was on the cat as a huge assemblage of particles.  But if you take this next step and think of the cat as a form of information you arrive at a further insight: the concept of superposition applicable to elementary particles might (conceivably) apply to assemblages of them.  But how it could it possibly apply to a purely informational abstraction?

Even in the case of a cat viewed as a massive collection of particles, there is endless debate about whether such a collection is so entangled that we can think of it as a massive cat-particle, or is in fact a collection of independent particles, which would potentially have independent superposition states.  Becker discusses this at length... unfortunately he doesn't use the language of computer science, because for me our perspective on this is far clearer.  Yet he does a good job, in a more intuitive way of explaining things.  Certainly his writing is far better than mine... I could never write a book for his audience. 

Getting back to Bohr, I find in his views an attempt to express precisely this issue of type inconsistency.  Schrödinger's experiment simply doesn't type-check.

Monday, 7 May 2018

Is the universe a big asynchronous distributed system?

Now that summer is approaching, I'll have a chance to do some reading.  One topic I've been fascinated by, for a few years now, centers on the way that research on quantum computing is shedding light on the way the universe itself is "programmed". I'm fascinated by something that for me, was a surprise: a connection to distributed computing.  I'm not one to write a whole book on this topic (I once tried, but then gave up).  I'll summarize sort of briefly.

The place to start is with a famous real experiment by John Wheeler, the Princeton quantum mechanics professor who passed away in 2008 without a Nobel prize. Seeking to debunk the "spooky action at a distance" (also called "Copenhagen") model of quantum mechanisms, Wheeler proposed a strange twist on those famous one and two slit interference experiments. His version worked like this:

You'll need a source of charged particles, very focused and able to run at low power (for example, one electron per millisecond). Then a beam splitter for that kind of beam, with a 50% probability of sending your beam "up" and a 50% probability of sending it "down". Mirrors reflect the beam back towards a narrow slit. And then you put a particle detector on the far side of the slit. A digital camera sensor will do the trick. You can find images of this, or of his two-slit version online (sometimes it is called a "delayed choice" experiment, but as we'll see, delay isn't the point).

Last, put a little detector on one arm of the experiment. If enabled, it will sense the passing electrons and report those by blinking an LED. But the detector should have an on-off switch. Initially, turn it off. The cool thing is that because our particle is charged, we can actually sense it without disturbing it, so the detector seemingly observes the particle without changing any property of the particle. (The "work" of turning the LED on or off is done by the detector hardware, not the particle beam. And you can even arrange the detector to be far enough from the path of the particle so that it will already have reached the camera and impacted the LCD screen before the "which way?" detection even occurs.)

Each single electron that passes through our system will actually hit just one pixel on the LCD screen of the camera on the far side of the slit. This is because a single electron delivers a single quantum of energy to a single LCD pixel: the image for just one electron would be just one very dim spot somewhere on the screen.  But if we run the experiment billions of times, we build up an image.

In this sense the image is a histogram revealing the underlying probabilities.  Each single run of the system (each electron) was a kind of probe of the probability density function of the system. Any given single electron picked an outcome at random among the possible outcomes, in accordance with the probabilities of each possible path and outcome. Thus we are visualizing the probability function -- the single electron itself wasn't smeared out at all. To get interference we really had no option except to run the experiment until billions of data points had been gathered. The interference pattern is the cumulative record of billions of individual events.

Well, lets run our experiment. What happens? As you might guess, this form of experiment reveals interference patterns: the electron has a 50-50 probability of taking the up path versus the down path. But this is true only with the LED that tracks the actual electron path turned off. If you turn on the LED detector... the pattern collapses to a bright dot! A very curious finding.

Even stranger: turn the detector on, but encase the LED output in a black bag so that no information about it escapes... the pattern reappears.

In effect, overtly "knowing" which way the electron went eliminates the diffraction pattern. Not sensing it at all, or destroying the data after sensing it, restore the diffraction pattern.

Sounds like magical nonsense? Yes, definitely. Yet this is a real experiment.  For me, it definitely deserved a Nobel prize.

Now, there are a few ways to make sense of such a finding. One is to consider relativistic time perspectives. If you do this, you quickly discover that concepts like "before" and "after" had no real meaning. Brian Greene's first book on modern physics discusses this and makes the point that causality is meaningful, but that non-causal concepts of time are ultimately subjective. Einstein was the first to realize this, and it underlies his theories of relativity. So was that sensor active when the particle passed? Yes, or no, depending on how you measured time. Not a well-posed question.

But people have done versions of this experiment in which the detector, and that decision, are physically quite far away. Using a speed of light argument, this version of the experiment potentially eliminates any relativistic question of interpretation.  Your decision genuinely, provably, occurs after the particles hit the LCD detector.  And seemingly, it lets you retroactively change the past. Cool, huh? And you thought that scenarios like this were the stuff of terrible TV serials.   Now you learn that they are just the stuff of terrible blogs!  But keep in mind: this is real science. 

Back to interpretations.  The second interpretation is the Copenhagen one. In this, the observation made by the which-way sensor collapses the quantum superposition that characterizes the system state, once the beam splitter has done its thing. The sense in which it is a spooky event is that seemingly, you can delay and decide whether or not to activate the which-way sensor until after the particle already hit the LCD pixel. Cool, huh? But utter nonsense if you believe that in reality, information cannot move backwards in time (faster than the speed of light). So people who adopt this view are tacitly accepting that quantum observations actually can move faster than the speed of light.

A third, more modern interpretation is called the many-worlds model. This model views the world as a giant quantum superposition. The superpositions are really non-quantum classical world-lines that follow Newtonian rules. So in a single world-line the behavior of our electron was fully determined by various factors: the specific electron hit the half-silvered mirror just when this molecule of aluminum dioxide was oriented just so, and hence it reflected up, deterministically. There was no probability involved at that step.

But a many-worlds model assumes that because we live at the tip of a causal cone stretching 13.8B years into the past, we only "know" a tiny amount about the universe in a causal sense of actual past observations. So in this model, each interaction between two elementary particles is an observation, and lives on into the causal future of both particles: an entanglement, but without any quantum fuzz. Today, we sit at the tip of this cone of past observations, but now think of all the unobserved state in the universe. Our world-line observed something (the state of the mirror) that was genuinely unpredictable. After the observation it was fully known.  Before observation, 50-50.

The many-worlds model holds that if both outcomes were equally probable, then this means that in some world-line the electron bounced up, and in others, it bounced down. The world lines were indistinguishable until that happened, but then each "learned" a different past.

So in this model, the arrow of time is concerned with increasing knowledge of the past. Any single world-line accumulates knowledge in a steady way.

Critically, no world-line ever collapses, vanishes, reverses course and changes its path. We just learn more and more as events progress.  The new knowledge rules out certain possibilities. Prior to hitting the mirror, it was genuinely possible that the electron might bounce up, and might bounce down. But after hitting the mirror, any world-line in which we "know" the state of the electron is one in which we have eliminated all uncertainty about that mirror interaction. By turning on the electron path detector, we eliminated all the uncertainty, and this is why the diffraction pattern vanished: with no uncertainty, our electron beam gives a sharp, tightly focused spot on the detector. 

How did this all explain our Wheeler experiment? Well, with the sensor turned off, an analysis of the probabilities of the various outcomes does give rise to an intersection not of the electron with itself, but rather of the two "branches" of the probabilistic state of the system, leading to the interference effect that we see as a pattern on our eventual LDC screen, built up one pixel at a time.  You can compute this pattern using Schrodinger's equation. With the sensor turned on, the probabilistic analysis is changed: if information from the sensor can reach the LDC detector, we eliminate uncertainty in a way that leaves us with a clean focused spot, or a clean set of lines (depending on how the slit is set up).

If you are big believer in quantum computing, you'll be happiest with this third model, although many people in that field tell me that they prefer not to think of it in terms of real-world interpretations ("shut up and compute" is the common refrain). No need to worry about the meaning of all those world-lines.  Are those other world-lines "real?" Well, in what sense is our reality objectively more real, or less real? Any objective attempt to define reality either ends up with an unmeasurable property, or concludes that any world-line that has some non-zero probability of arising is as real as any other.

In a similar vein, it is wise to avoid speculation about free will, and about whether or not such a model leaves room for religion.

I myself am a believer in the many-worlds interpretation. It eliminates spooky faster-than-light action at a distance and other oddities. We are left with something elegant and simple: causality, nothing more. Lacking knowledge of the current state, some things seem possible, with various probabilities. Then we learn things, and that rules out other things. Viola.

Now, how does this lead us towards a perspective relevant to computing, and distributed systems at that?

Well, we have these deterministic world-lines that are all about interactions of elementary particles (or maybe m-branes in the modern 11-dimensional string theories -- whatever your favorite most elementary model may be). Let me ask a small question: do you believe that all of this is governed by mathematics? Sure, mathematics we haven't fully learned yet, and without question, the mathematics may be very foreign to our normal mathematical systems. But do you believe that ultimately, the physical world is governed by physical law?

I myself do believe this: I have yet to hear of a physical situation that wasn't ultimately subject to a scientific explanation. So we are then looking at model of the universe in which strict mathematical laws govern the evolution of these world-lines from their initial state, back at the big bang when time began, up to the present, 13.8B years later.

There were a set of possible initial conditions, and each initial state gives rise to a world-line. But there are a seemingly infinite number of possible initial conditions consistent with the 13.8B years of observations prior to me typing these characters into this blog-page. I am uncertain as to those unobserved states, and hence unable to predict the actual future: in some sense, I am all of the me's in all of these identical (up to now) world-lines. Then time progresses, events occur (interactions between elementary particles), and more knowledge is learned. My experience is that I reside in the world lines consistent with my personal past. But other versions of me experience the other possible outcomes. And so it goes, branching infinitely.

How does the universe compute these next states? Here, we approach the distributed systems question. The first answer is that it does so by applying the (only partially known to us) physical laws to the states of the world-lines, in a Newtonian style, computing each next state from the current state. As it turns out, this deterministic world-line model is actually fully reversible, so you can run time backwards and forwards (obviously, if you run a world-line backwards the system forgets its future, and regains uncertainty, but you can do this if you wish -- quantum computers depend on this property and wouldn't work, at all, without it).

So what does it mean to have a single world-line and to apply the "next" events to it? Seemingly, the proper model requires that we combine two models: one to model space, and one to model particles, both quantized. So we think of space as a mesh of 3-D (or perhaps 10-D) locations, plus time if you want to reintroduce clocks -- I'll leave them out for a moment, but Brian Greene prefers to include them, adopting the view that events sweep forward at the speed of light, such that any particle is moving at speed c, as a vector in space+time. In Brian's explanation, a photon moves at speed c and experiences no movement in time at all: it experiences its entire existence as a single event in time, spread over space. An electron in motion moves through space at some speed, and then experiences time at whatever rate will give us speed c for its space-time vector. Cool idea (Einstein's, actually).

So we have this mesh of spatial locations. What does space "know?" Well, it knows of the local gravitational gradient (the proper term is the "geodesic"), it knows of nearby particles and their contribution to electromagnetic fields, and in fact it knows of other forces too: the weak and strong force, the Higgs field, etc. So space is a record of fields. And then the particles seemingly know where they are (which space-time location they are in), and where they are heading (their vector of movement), plus other properties like mass, spin, etc.

The distributed computation is one that takes each element of space-time from its state "now" to its state one interaction into the future. This involves interactions with the particles in the vicinity, and also interactions with the neighboring space-time locations. The particles, similarly, interact with space-time (for example, our electron might be pushed by a local magnetic field, causing its trajectory to change, and it would learn the prevailing EM field by interrogating the space-time location), and also with one-another (when particles collide). Perhaps the collisions can be fully expressed as interactions mediated through space-time -- that would be nice.

Very much to my taste, this particular model has a dynamic form of group membership! First, the presence of mass causes space-time itself to stretch: new space-time elements form. This would generate the geodesic mentioned earlier. For example, near a heavy mass, like as we approach the event horizon of a black hold, space-time curves: more spatial volume forms. Indeed, the infall of matter towards the singularity at the core of the black hole seemingly would cause an unbounded expansion of space-time, but entirely within the event horizon, where we can't actually see this happening. And then there is a second aspect, which is that as the most basic particles or m-branes combine or break apart, the population of particles seemingly changes too: perhaps, a neutron bangs into a proton, and a spray of other particles is emitted. As I said, the universal computation seems to track dynamic group membership!

Back to the real world (but here's a little Matrix-like puzzle, referring to the movie: if the universe is a superposition of world-lines and the world-lines are deterministic and governed by physics, and the physics ultimately has all the details filled in -- even if we don't know how the thing works, the universe itself obvious does -- does anyone need to run the computation? After all: there is a Taylor-series expansion of pi, and with it, I could compute the 2^2^2^2^2^...'th digit of pi in any radix you like. Perhaps I tell you that such and such a digit is 7, base 10. But that digit has never actually been computed by anyone except me. Would you trust me? No, but at the same time, you do have a way to validate the claim. The digit definitely has a defined value: my claim is either true, or false: it can't wiggle around and change value. So in a similar sense, the mathematics, given the initial conditions, predicts this moment, 13.8B years down the road. Do we care whether or not the computation has actually occurred? Is it meaningful to claim that we have some existence outside of this definition? And if so, how would you make that claim rigorous)?

At MIT, Scott Aaronson (who has since moved to Austin) and his colleague Seth Lloyd speculated about this -- two stars in the slowly growing field of quantum computing. Seth wrote book in which he argues that even the universe wouldn't be able to solve unsolvable problems -- we computer scientists do have Russell's Paradox to wrestle with, not to mention basic questions of complexity.  If a problem takes unbounded time and resources to compute, it may not be solvable even if there is an algorithm for performing the computation. In this sense all the digits of pi are computable, but some digits cannot be "reached" in the 13.8B years the universe has had to compute them: they are much further out there.

Seth's point is interesting, because it raises real questions about what the distributed computation that comprises the universe might be up to. Taking this point about pi a bit further: pi itself has no finite representation. Yet pi certainly arises in physics, along with many other transcendental constants. How can the universe actually carry out such computations in bounded time?

A few ideas: it could be computing symbolically. The actual mathematics of the universe might be expressed in a different mathematics than we normally use, in which those constants vanish into some other representation (think of the way a transformation to polar coordinates eliminates the need to talk about pi... although such a transformation also introduces a problem, namely that the (x,y,z) coordinate (0,0,0) has no single representation in a polar coordinate system: it has vector length 0, but the associated angle is undefined). Anyhow, perhaps there is a mathematics in which these constants don't arise in their unbounded-length form. Or perhaps the universe only needs to carry out the computation until it has enough digits to unambiguously determine the next state.

Traditional models of singularities pose issues too, once you start to think in terms of tractable mathematics (Seth's point is that if a problem can't be solved computationally, the universe must not be solving it -- either we have the physics wrong, or it has a clever work-around!) A singularity is infinitely small and infinitely dense: clearly not a state amenable to a state-by-state computation. Again, various ideas have been floated. Perhaps the singularity is just not reachable from within the universe: the particle falling into it needs an unbounded amount of time to get there, as perceived from outside (in fact from the outside, the particle remains smeared over the event horizon and the associated information mixes with that of other in-falling particles, and this is all we can know). Still, for the universe to be doing this calculation, we would need to figure out what it "does" for space-time locations closer and closer to the event horizon, and very likely that model needs to account for the behavior inside the black hole and at the singularity, too. Otherwise, the mathematics would be incomplete. Seth argues against such views: for him, the thing that makes a universe possible is that it is self-consistent and has a complete specification of initial state and the physical laws by which that state evolves.

Here at Cornell, my colleague Paul Ginsparg always has the answers to such questions, up to the event horizon. Beyond that, though... as I said, perhaps we shouldn't view the questions or answers as part of this universe. Yet Paul points out that an object passing the event horizon wouldn't notice anything unusual, including the increasingly extreme curvature of space-time. Whatever the mathematics are that govern the universe, they should still work in there.

And here's yet one more puzzle.  If computing the next state involves some form of computation over the entire universe, the associated physical model has to be wrong: such a computation couldn't be feasible.  Yet this seems to imply that much of contemporary physics is incorrect, because these unbounded integrals do arise (for example, contemporary models of quantum uncertainty assign a non-zero probability to finding a particular particle anywhere in the entire universe).

Instead, some form of boundedness must enter the picture. At a minimum, the computation shouldn't need to reach out to more of the universe than we have had time to interact with: an electron on my computer screen has interacted (at most) with particles within a causal cone stretching just 13.B years into the past, and in fact with just a tiny subset of those. This tells us that computing the next state is a finite problem, even if it is a bit large by computational standards. But it also tells us that as time elapses, it gets harder and harder for the universe to compute its own next state. Does it make any sense at all to imagine that physics works this way, or does the universe have some way to bound the computational task so that it would be of manageable scale?

Which leads to another insight.  Armed with our many-worlds hypothesis, suppose that all of us track down the URL of the "ANU quantum random number project."  As you might surmise, this project  uses quantum noise from outer space to generate what seem to be totally random numbers.  Now use those numbers as the basis for lottery ticket purchases in the next PowerBall drawing.  You'll win, in some world-line.  In fact, do it 50 times in succession.  If the game isn't rigged and you don't get arrested, kidnapped or killed, you'll win 50 times -- in some world-line.  But not in any world-line you are likely to actually be living in.

In fact there was an old science fiction story along these same lines: some very elderly fellow had lived an increasingly implausible life, surviving one catastrophe after another.  After all: if survival isn't impossible, but just very unlikely, then by definition there must be some possibility of surviving.  And in that world-line, you do indeed survive, and so forth.

Do low-probability events "really" ever occur?  Or does the universe somehow save itself the trouble and never compute them?  Could it be there is even some hidden natural law that cuts off absurdly low probability sequences?  Brian Greene discusses this too, in a chapter that asks whether the initial conditions of the universe would have allowed a kind of Petit Prince scenario: could the initial universe have started with a single planet in it, and a single person standing on that planet, fully aware and healthy, air to breath, but nothing else in the universe at all?  Extreme interpretations of the multi-verse theory seem to say that yes, such an initial condition would be possible.  But Greene argues that there is a sense in which all of these examples violate the second law of Thermodynamics: the law that says that entropy always increases (obviously, we can expend energy and create local order, but in doing so, we create even more entropy elsewhere).  So perhaps this law about entropy has a generalization that prunes low probability world-lines.

One could make the case that pruning low probability world-lines might be necessary because the total computational complexity of computing the universe would be too high if the universe were to compute every possible pathway, including all of these extremely obscure scenarios.  Pruning really obscure corner cases could be the key to computational tractability, in some very real sense (after all: many problems are infeasible in their general form, yet we manage to solve them computationally because we really only run into an easier subset that wouldn't normally include the very hard instances).  Is there a world-line out there where I actually won lotteries 50 times in a row, using those ANU quantum numbers?   Such a sequence would violate this generalized entropy principle.  So, maybe not, after all.  Sigh.  But on the positive side, that wouldn't have been a very fair way to win the lottery.  It would be nice if a natural law actually prevented it!

A fun topic. I'm always hungry for new books about it -- the people I've mentioned mostly have written such books (it seems to be a rite of passage for physicists). With summer arriving soon, let me know if  you have any to recommend!

Wednesday, 4 April 2018

Blockchains and the new mythology

For two years now, the drumbeat of Blockchain technology has gradually dominated one area of systems after another: in the public eye, Blockchains are somehow the universal solution to every problem.  This I find very odd: the core BlockChain concept is technically flawed (I don’t want to repeat prior blogs, so I’ll simply point out that four or five of my older postings were on technical problems with the model).  The model doesn’t even fit many of the imagined uses.  And in actual fact, we see few examples of real uses, other than to support cryptocurrencies that run some risk of evaporating from your wallet.  Yet this dream of using BlockChain technology for everything that really matters has somehow taken hold.

BlockChain has evolved into a mythology.

I remember a talk by Michael Brody, the CTO of Verizon around 1998 (back when it was still part of the GTE empire).  He focused on the psychological wish for magic silver bullets that can slay every technical barrier.  In companies struggling with technology challenges, it can be very appealing to wish for miracles (and all too easy to worry that the other guy will find it first and win market dominance by so doing).  At the time, the silver bullets were client server architectures, CORBA, Paxos.  But the underlying pattern was similar: an overwhelming desire to believe, coupled with eyes closed against the limitations.

We see this now for artificial intelligence, too: how many self-driving cars will have to run down bicyclists and swerve into ongoing traffic before people realize that putting a robot in charge of a car is simply an overreach? The technology isn’t ready yet.

And so too with BlockChain.  Yet when I attend talks on Digital Agriculture, or the future of medicine, or banking, somehow the very term seems to command authority (and to shut down any skepticism the audience might normally have expressed).  In the New York Times on April 3, an article talked about BlockChain in all of these and many other “uses”, quoting one gushing entrepreneur as saying that BlockChain is a revolutionary, disruptive and nearly universal technology for storage, communication, security and safety.  Oh, and he suggests we use it for online voting, to repel those Russian hackers.  Come again?  All this from an append-only log, running on anonymous servers, and prone to rollbacks?

It does seem true that a permissioned BlockChain (one running on specified servers, probably in the machine room of a bank, or sold as a turn-key product by a storage or cloud vendor) would be a great place to log transactions that you want to keep on record indefinitely.  Moreover, a permissioned  Blockchain won’t roll back unexpectedly.  But the expert quoted by the NY Times apparently wants all sorts of digital information logged into indelible records, and seemingly has the permissionless variety of BlockChains in mind (he would never trust any single bank or company to host the chain).

Beyond the privacy issues raised by having your life logged in a globally shared place, we get the oddity of using a type of log that by construction is capable of spontaneously erasing itself.  It could even be erased deliberately by the same Russian hackers out to tamper with the election you are trying to protect!

Setting the technical issues to the side, the psychology of this article speaks to Brody’s old story of using silver bullets to slay dragons.  Technology has become so complex that it certainly can feel like magic, and magical thinking is a natural fit.  Nobody wants their illusions punctured.  No technology is perfect, but even this plays into the story: if you point to a flaw, like the tendency of permissionless Blockchain to roll back, Blockchain fans just assert that version 2.0 will fix that.

The dialog reminds me of science fiction.  We start with a conceit: “dilithium crystals and antimatter  enable faster than light travel” (and every other technical miracle the script writers could dream up).  The physics wouldn’t bear up under close scrutiny, but we don’t look too closely.  And from this starting point, we boldly go where no one has gone before.

But one can easily understand the motivation of a science fiction script writer.   Where I’m left puzzled is with the motivations of all these self-proclaimed experts.  are they deliberately lying?

Quite a few must know better. Yet somehow, this chance to be the expert seems to drive rational computer scientists to make wild claims, while brushing obvious concerns to the side.  While the scam endures, these people are becoming millionaires.

I imagine that it must be fun to be able to sound off at fashionable cocktail parties, too.  “Am I concerned by the opioid crisis?  Well, of course.   But in my view, the entire problem could be solved by using Blockchain to log every transaction involving these habit forming drugs...”

Down the road reality will impose itself.  But I guess that by then, these experts will have long since cashed out, bought Napa vineyards, and moved on to speculate in some other illusory commodity.

Wednesday, 7 March 2018

Data stream computing

I've actually posted on this topic in the past, but I spend so much of my own time thinking about it that I'm going to revisit the question from my "revised and improved" perspective, a year or so down the road.

Context: with the move towards a disaggregated data center model in which we use RDMA or other forms of next-generation direct access to talk to remote memory (perhaps persistent, perhaps not), we find ourselves in a world where the underlying costs and barriers have shifted dramatically relative to the programming style familiar in the past.

Specific example: Consider 2-phase commit, or your favorite consensus protocol, or Paxos.  In the recent past, we would program these protocols by having some leader who initiates the protocol and sends out a request to the other participants: "can you commit transaction 71?" or "I vote 1", or  "I propose that we place value v into Paxos log slot 22, and this is my 3rd ballot for that slot."  Responses eventually come back, and after a round or two of these back-and-forth interactions, the protocol terminates.

In that classic way of thinking, there was a kind of implicit balance in mind between the cost of moving the data (value v in the example above) and the cost of the back-and-forth round-trips.  I'm not quite sure how to best express this sense of balance, but what one finds is that the cost of the round-trip actions (the RTT, as we say in the business...) is sort of "reasonable" relative to the bandwidth available for moving the value field, often over TCP connections or UDP.

Today, with new forms of direct remote memory access, talking to remote memory over a technology like RDMA is actually faster than talking to local memory, but only in the sense that RDMA can sustain a far higher bandwidth than local actions like using memcpy (the fastest available copying primitive on a Linux/C computing platform) to move data from one place in memory to another, assuming that the data in question isn't in the L2 cache.   As RDMA speeds up, this gap will only get larger.  So it is very cheap to move data from machine A to machine B.

In contrast, the RTT for a modern optical Ethernet is surprisingly high, when you consider quite how fast copying can be.  We see RTTs in the 1.5us to 2.5us range: quite fast compared to the 1-10ms numbers that were common for protocols running over TCP/UDP/IP, but relatively sluggish if you consider the number of bytes that could have been sent in that time, particularly if you add in the delays for scheduling a thread to notice the request coming in (when A's message arrives on B), or to send the reply (when B's response reaches A).

If A blocks sending while waiting for this RTT to complete, quite a few bytes won't be sent.  And as RDMA gets faster (today, 100Gbps, but tomorrow 200, then 400, and 1Tbps is within range now), the amount of missed transmission opportunity gets pretty substantial.

Concurrency feels like the obvious answer,  but this instinctive reaction proves to be naive.  The theory would be that if A has lots of side by side transactions with B, then while one waits, another can transmit.  But that overlooks overheads: concurrency also brings many kinds of costs, and this model is anything but a clear win!  Those costs could quickly add up: locking to ensure that the transfers won't interfere with one-another, memory contention on the DRAM chunks where the data lives, various forms of state that might be sitting around while the round-trip interactions are running (threads consume stack space, for example).  So concurrency doesn't come for free.

Batching is a second possibility:  A could collect a bunch of requests, then send them as a batch (while collecting the next batch), B could respond on the whole series of requests (one by one, but as a batch of responses), and so forth.  This amortizes costs in a nice way.

Batching makes sense on a receiver, too.  Rather than sending batches and forcing the receiver to use those batch sizes, a sender could send continuously, and the receiver could bite off a chunk of incoming records based on when it happens to be ready to run. So if we have A sending continuously, rather than having A batch 10 items at a time, it could stream steadily to B.  B might be ready for its next chunk of input at a moment when there are 11 items available; it treats these as a batch of 11.  Then at the next loop, B is perhaps ready for input when 4 new items have turned up: a batch of 4.  We still see a benefit of amortized costs, but the batch sizes are variable and determined by when B happens to be ready, rather than fixed and decided by A (the sender).

Derecho works this way.

Now why are we discussing these obvious points?  Well, the optimality of Derecho is directly tied to its use of this style of receiver-side batching.    The key "thing" about Derecho is that it runs a version of Paxos modified to make decisions as soon as possible, based on the Paxos notion of safety, using receiver-side batches.  This allows us to stream an exceptionally high rate of Paxos requests, process them with no unnecessary delays, and amortize all costs over batches of events.

In fact, the value of this method centers on the steady stream of updates.  With one update, by itself, classic Paxos is a reasonable choice of protocol.  But with a stream of updates, we need to amortize costs for efficiency.  The classic protocols run in a greedy way, and cease to be efficient because they miss this opportunity!

So now the question arises: is this a story purely about Derecho, or does the same opportunity transition to other kinds of systems that aren't doing atomic multicast (vertical Paxos) and atomic log updates (classic durable Paxos)?  For example, could a distributed file system somehow gain performance by being implemented to use the same sort of pattern?  Could a database system be optimized by leveraging this sort of transformation?

And we can pose the problem even more broadly.   As a programming question (almost a compilation question), is there a systematic programming methodology here, one that would lead the developer to an optimized solution matched to the underlying cost model mentioned earlier (the cost model in which steady flow works extremely well at high rates and with large data volumes, but RTT looks somewhat costly)?  How could one start with a specification of a file system, or a specification of Paxos, and systematically arrive at a solution like Derecho?

I find it fascinating that despite all the work that has been done on stream processing in database systems, there seems to be very little prior work on the kind of protocol optimization I've outlined.  The underlying "memory model" is really just an exaggerated NUMA one, with a cache-line memory coherence model, remote memory access, but with unusually high latencies relative to the extremely  high bandwidth.   By and large, my colleagues in the PL community view the resulting questions as being variations on what they are already used to: they have extensive experience with NUMA programming.  Yet here we can see that we actually are after solutions that can be quite different from the classic ones: Derecho doesn't look a lot like Paxos at first glance, and the optimality we achieved emerges from the Paxos protocol transformations that let us optimize for the RDMA environment.  So there was something non-trivial happening here, even though it happens over the standard NUMA model.

I don't know if we have PL readers of this blog, but I would love to hear their thoughts, if so!

Sunday, 18 February 2018


Recent revelations about troll postings to Facebook, Twitter and other media sites create an obvious question: can we trust the authenticity of online commentary, or should we basically distrust everything?  After all, even a posting by my brother or an email from my mother could be some kind of forgery.

The question has broader dimensions.  Twenty-two years ago, David Cooper and I published a little paper on secure and private email.  In this paper, we asked whether there is a way for two people (who know one-another) to send and receive emails in a monitored environment, with some form of untrusted observer trying to detect that communication has occurred, and hoping to retrieve the message itself, too.

Our 1995 solution uses cryptography and fake traffic to solve the problem.  The fake traffic ensures that there is a steady flow of bytes, whether or not the two people are communicating.  Then we designed a kind of shared storage system that plays the role of an email server: you can hide data inside it.  The email itself was encrypted, but also broken into bits in the storage layer and hidden inside vast amounts of noise.  Then the act of sending or receiving an email was mapped to a vast amount of reading and rewriting of blocks in the storage system.  We showed that an observer learns very little in this case, and yet you can send and receive emails in a way that guarantees the authenticity of the messages.

This week a Cornell visitor told us about ways to improve on that style of older system, but the broad framework was similar.  I have always loved solutions built from little cryptographic building blocks, so I thought this was a really fun talk.   The problem, of course, is that nobody adopts tools like these, and unless everyone is using them, the mere fact of having a copy of the software might tip bad actors off to your interest in secret communication (then they can abduct you and force you to reveal everything).  To really work, we would need a universally adopted standard, one that nearly everyone was using even without realizing it -- the WhatsApp of secure email.  That way, when they come to question you, you can pretend to have absolutely no idea what they are talking about.

The other problem is that in contemporary society, there is a slight bias against privacy.  While most people would agree that we have a right to privacy, they seem to mean "unless you are trying to hide a secret we want to know about."  So there is a contradiction in the sense that we accept the right to privacy, yet also seem to believe in a broader societal right to intrude, particularly if the individuals are celebrities -- as if privacy rights vanish with any form of fame or notoriety.  There is also a significant community that assumes that privacy is something people would want primarily as a way to hide something: an unusual sexual preference, or criminal activity, or terrorism.

Back to the trolls.  In the cases recently publicized by the FBI, CIA and NSA, we learned that Russia has at least one (maybe more) companies, with large numbers of employees (80 or more) who work full time, day in and day out, planting fake news, false commentaries and incendiary remarks in the US and European press and social networks.  Here in Ithaca, the local example seems to be a recent event in which a dispute arose about the diversity of casting for a high school play (although the lead role is that of Carmen, a gypsy woman and hence someone who would normally have dark skin, the casting didn't reflect that aspect of the role).  This was then cited as part of a pattern, and a controversy around casting erupted.

Any small town has such episodes, but this one was unusual because suddenly, a torrent of really vile postings, full of racist threads, swamped the local debate.  One had the sense that Ithaca (a northern town that once had a big role on the underground railway for helping escaped slaves reach freedom) was some sort of a hotbed of racism.  But of course there is another easy explanation: perhaps we are just seeing the effects of this Russian-led trolling.  The story and this outburst of racism are precisely in line with what the FBI reported on.  In fact, some of the nasty stuff is home grown and purely American.  But these trolling companies apparently are masters at rabble-rousing and unifying the Archie Bunkers of the world to charge in whatever direction they point.

So here we have dual questions.  With Facebook, or in the commentary on an article in a newspaper, I want to be confident that I'm seeing a "legitimate" comment, not one manufactured in a rabble-rousing factory in Russia.  Arguably, this formulation is at odds with anonymity, because just knowing an account name won't give me much confidence that the person behind the account is a real person.  Trolls create thousands of accounts and use them to create the illusion that massive numbers of people agree passionately about whatever topic they are posting about.  They even use a few as fake counter-arguers to make it all seem more real.

So it isn't enough that Facebook has an account named Abraham Lincoln, and that the person posting on that account has the password.  There is some sense in which you want to know that this is really good old honest Abe posting from the great beyond, and not an imposter (or even a much younger namesake).  Facebook doesn't try to offer that assurance.

This is a technical question, and it may well have a technical answer, although honestly, I don't see an immediate way to solve it.  A quick summary:
  • Desired is a way to communicate, either one-to-one (email), or one-to-many (Facebook, within a social network), or one-to-all (commentary on newspapers and other public media web sites).
  • If the individuals wish to do so privately, we would wish for a way to do this that reveals no information to the authorities, under the assumption that "everyone uses social media".  So there should be a way to communicate privately that somehow hides itself as completely normal web site browsing or other normal activities.
  • If the individuals wish to post publically, others should be able to authenticate both the name of the person behind the posting (yes, this is the real "Ken Birman," not a forged and twisted fake person operated as a kind of web avatar under control of trolls in St. Petersburg), and the authenticity of the posting (nobody forged this posting, that sort of thing).
  • In this public mode, we should have several variations:
    • Trust in the postings by people in our own social network.
    • Indirect trust when some unknown person posts, but is "vouched for" by a more trusted person.  You can think of this as a form of "trust distance".
    • A warning (think of it as a "red frame of shame") on any posting that isn't trustworthy at all.  The idea would be to put a nice bright red frame around the troll postings. 
  • When someone reacts to a troll posting by reposting it, or replying to it, it would be nice if the social networking site or media site could flag that secondary posting too ("oops! The poster was trolled!").  A cute little icon, perhaps?  This could become a valuable tool for educating the population at large about the phenomenon, since we often see secondary commentary without understanding the context in which the secondary remark was made.
Then we would want these ideas widely adopted by email systems, Facebook, Twitter, Google, the New York Times, the Breitbart News, and so forth.  Ideally, every interaction would offer these options, so that any mail we send, posting we make, or any that we read, is always protected in the intended manner.

Could this agenda be carried out?  I believe so, if we are willing to trust the root authentication system.  The Cornell visitor from this week pointed out that there is always an issue of the root of trust: once someone can spoof the entire Internet, you don't have any real protections at all.  This extends to your computer too: if you are using a virtualized computer that pretends to support the trust framework but in reality, shares your information with the authorities, all privacy bets are (obviously) off.  If the display system carefully and selectively removes some red frames, and inserts others where they don't belong, we're back to square zero.

So there are limits.  But I think that with "reasonable assumptions" the game becomes one of creating the right building blocks and then assembling them into solutions with the various options.  Then industry would need to be convinced to adopt those solutions (perhaps under threat of sanctions for violating European privacy rules, which are much tougher than the ones in the US).  So my bet is that it could be done, and frankly, when we consider the scale of damage these hackers and trolls are causing, it is about time that we did something about it.