A Few Thoughts on Distributed Computing: June 2018

Thursday, 28 June 2018

When open source is the right model

At DSN, I found myself in conversation with some entrepreneurs who were curious to know why in an era when people are making billions on relatively small ideas, we aren't adopting a more mercenary IP stance with Derecho. For them, our focus on papers at DSN and TOCS and open software that really works was a strange choice, given that we do have a shot at products and startups that could be pretty lucrative.

Here are some dimensions of the question worth pondering.

Academic research is judged by impact, meaning broad adoption, lots of citations, etc.
We started our project with DARPA MRC funding. DARPA insisted that we use open source licensing from the start, but we can pretend it didn’t and still reach the same conclusion.
Publicly funded research should benefit the taxpayers who wrote the checks. For a system like Derecho, this means the system needs to be useful, adopted by US companies that can leverage our ideas in their products, and help them hire lots of people in high-paying jobs. Derecho should enable high value applications that would not have been possible without it.

Should Derecho be patented and licensed for a fee?

Patents don’t protect mathematical proofs or theorems (or algorithms, or protocols). Patents protect artifacts. I often end up in debate with theory people who find this frustrating. Yet it is the way the system works. Patents simply cannot be used to protect conceptual aspects, even those fundamental to engineering artifacts that use those concepts in basic ways. They protect the actual realization: the physical embodiment, the actual lines of code.
Thus my group could perhaps patent Derecho itself, the actual software, through Cornell (this ownership assignment is defined under the US Bayh-Dole act). But we cannot pursue a patent on state machine replication, the 1980’s model (the theory) underlying Derecho. Our patent would be narrow, and would not stop you from creating your own system, Tornado, with similar algorithms inspired directly by our papers. Sure, we could work with a lawyer to arrive at tricky patent-claim wording that a naive reader might believe to cover optimal state machine replication. Yet even if the USPTO were to allow the resulting claims no judge would uphold the broad interpretation and rule in our favor against Tornado, because patents are defined to cover artifacts, not mathematical theories. Software patent claims must be interpreted as statements about Derecho embodiment of the principle, not the principle itself. This is just how patents work.
Wait, am I some kind of expert on patents? How would I know this stuff about patent law? Without belaboring the point, yes, I actually am an expert on software IP and patents, although I am not an IP lawyer. I got my expertise by fighting lawsuits starting in the 1990’s, most recently was the lead expert witness in a case with a lot of money at stake ($Bs), and I’ve worked with some of the country’s best legal IP talent. My side in these cases never lost, not once. I also helped Cornell develop its software IP policies.
Can open source still be monetized? Sure. Just think about Linux. RedHat and other companies add high value, yet Linux itself remains open and free. Or DataBricks (Spark). Nothing stops us from someday following that path.

So, why should this imply that Derecho should be free, open source?

There are software systems that nobody wants to keep secret. Thousands of people know every line of the Linux kernel source code, maybe even tens of thousands, and this is good because it enables Linux to play a universal role: the most standard “device driver” available to the industry, taking the whole machine to be the device. We need this form of universal standard, and we learned decades ago that without standards, we end up with a Tower of Babel: components that simply don’t interoperate. The key enabler is open source.
That same issue has denied us a standard, universal solution for state machine replication. We want this to change, and for Derecho to be the standard.
There are already a ton of open source, free, software libraries for group communication. Linux even has a basic block replication layer, built right in. You need to value an artifact by first assessing the value of other comparable things, then asking what the value-add of your new technology is, then using the two to arrive at a fair market value and sales proposition.
But this suggests that because Derecho competes with viable options (“incumbents”) that have a dollar value of zero, then even if it has a high differentiated value as a much better tool, the highest possible monetary value for it would need to be quite low, or the market would reject it. So yes, we could perhaps charge $5 for a license, but we would be foolish to try and charge $500K.
You might still be able to construct a logic for valuing Derecho very high and then licensing is just once. The broader market would reject the offering, but some single company might consider taking an exclusive license. So you could protect Derecho with a patent, then sell it. The system would end up as a proprietary product owned fully by the buyer: your US tax dollars hard at work on behalf of that one lucky buyer. But now what’s happens? In fact, someone else could simply create a new free version. Examples? Think about HDFS and Hadoop and Zookeeper (created to mimic GFS, MapReduce and Chubby, all proprietary). To me the writing is on the wall: if Derecho isn’t offered for free, the market will reject it in favor of less performent free software, and then if the speed issue becomes a problem, ultimately someone else would build a Derecho clone, offering it as free software to fill the gap. They would find this fairly easy, given our detailed papers, and it would be legal: recall that you can’t patent protocols. This would be totally legal.

Conclusion? To maximize impact, Derecho needs to be open source.

Thursday, 14 June 2018

If RDMA is just 4x faster, will that doom adoption?

I’m staring at our latest Derecho numbers and am thinking about their implications.

With additional tuning, by now Weijia and Sagar have Derecho on LibFabrics over TCP running just 4x slower than Derecho mapped to RDMA on the same NIC. I guess this number makes sense, since on our clusters the memcpy speed was about 3.75GB for uncached objects, while RDMA transfers run at a peak rate of 14GB: just about a 4x improvement. So what we are seeing is that with a memcpy in the critical path from user to kernel, and one more memcpy from receiver kernel up to the user (TCP lives in the kernel), throughout drops to the speed of the copy operation. In Derecho, adequately deep pipelines can absorb this effect. Derecho is far slower if the window size is limited to too small a value.

Latency measurements are hard to carry out accurately, at these speeds, but once we get them, I’m sure we will see that latency is higher with TCP: rising from about 1.5us for small objects by some factor associated with memcpy delay and thread scheduling: LibFabrics has extra threads that RDMA avoids. Weijia’s preliminary results suggest that one-way latency rises by tens of microseconds for small objects and hundreds of microseconds for large ones. But today’s cloud applications live happily with much higher latency due to multitasking and scheduling overheads, so this may not be a primary concern for many users.

Moreover, to the extent that memcpy is the culprit, data center hardware vendors could speed up memcpy if they really set out to do so. They could extend the DRAM itself to do the transfer, and offer that functionality via a special instruction. That would probably double memcpy speeds. But the demand hasn’t been there. One could also modify TCP to do some form of scatter-gather with a DMA operation directly from the user-space object, avoiding the memcpy. Of course this would break the semantics of the standard synchronous TCP send, but you could offer the fast path purely with asynchronous I/O, in which case the semantics wouldn’t need to change.

Derecho is very tolerant of high latency. Systems like Tensor Flow are too, as would be any event-stream system using a queuing service (Kafka, SQS) to interconnect its components. But some applications would care more, namely those dominated by RPC. Are those still common?

Leading to my question: if RDMA only gives us 4x speedup on datacenter networks, and latency increases but only by fractions of a millisecond, will operators adopt the technology, given the complexity of deployment? As you know, if you’ve read this blog, Microsoft and Google are gaining experience with a datacenter RDMA, using DCQCN or TIMELY for congestion control and various tricks to run RDMA on Converged Ethernet (RoCE) with flow isolation. They are succeeding, but finding it fairly hard to pull off. Normally, a technology that is this hard to manage and brings less than a 10x benefit doesn’t make it.

One case for RDMA might be based on CPU loads. TCP will peg a CPU doing all this copying, so with Derecho over TCP we see the typical 12-core NUMA node acting more like an 11-core machine, and maybe even like a 10-core machine since there are other threads involved, too. As the datacenter owner, this 10-15% tax on your compute resources could be a big issue, giving RDMA that magic 10x benefit not in one year, but probably over a four or five year period, considering that it was 4x faster in the first place.

A second point is that not every project has all-star developers to do the tuning. So our 4x may translate to a 20x difference for MPI, or a 25x for Oracle. That would drive towards RDMA adoption after all. What Weijia and Sagar did came down to adjusting the SSTMC window size in Derecho. But you need to figure out that for Derecho in this modem the window size is the dominating factor. Who knows what it would be for MVAPICH (MPI), or Oracle, or Tensor Flow? RDMA takes that puzzle, and that datacenter tax, off the table.

My cards are still on RDMA. But these results over TCP definitely came as a surprise!