I’m staring at our latest Derecho numbers and am thinking about their implications.
With additional tuning, by now Weijia and Sagar have Derecho on LibFabrics over TCP running just 4x slower than Derecho mapped to RDMA on the same NIC. I guess this number makes sense, since on our clusters the memcpy speed was about 3.75GB for uncached objects, while RDMA transfers run at a peak rate of 14GB: just about a 4x improvement. So what we are seeing is that with a memcpy in the critical path from user to kernel, and one more memcpy from receiver kernel up to the user (TCP lives in the kernel), throughout drops to the speed of the copy operation. In Derecho, adequately deep pipelines can absorb this effect. Derecho is far slower if the window size is limited to too small a value.
Latency measurements are hard to carry out accurately, at these speeds, but once we get them, I’m sure we will see that latency is higher with TCP: rising from about 1.5us for small objects by some factor associated with memcpy delay and thread scheduling: LibFabrics has extra threads that RDMA avoids. Weijia’s preliminary results suggest that one-way latency rises by tens of microseconds for small objects and hundreds of microseconds for large ones. But today’s cloud applications live happily with much higher latency due to multitasking and scheduling overheads, so this may not be a primary concern for many users.
Moreover, to the extent that memcpy is the culprit, data center hardware vendors could speed up memcpy if they really set out to do so. They could extend the DRAM itself to do the transfer, and offer that functionality via a special instruction. That would probably double memcpy speeds. But the demand hasn’t been there. One could also modify TCP to do some form of scatter-gather with a DMA operation directly from the user-space object, avoiding the memcpy. Of course this would break the semantics of the standard synchronous TCP send, but you could offer the fast path purely with asynchronous I/O, in which case the semantics wouldn’t need to change.
Derecho is very tolerant of high latency. Systems like Tensor Flow are too, as would be any event-stream system using a queuing service (Kafka, SQS) to interconnect its components. But some applications would care more, namely those dominated by RPC. Are those still common?
Leading to my question: if RDMA only gives us 4x speedup on datacenter networks, and latency increases but only by fractions of a millisecond, will operators adopt the technology, given the complexity of deployment? As you know, if you’ve read this blog, Microsoft and Google are gaining experience with a datacenter RDMA, using DCQCN or TIMELY for congestion control and various tricks to run RDMA on Converged Ethernet (RoCE) with flow isolation. They are succeeding, but finding it fairly hard to pull off. Normally, a technology that is this hard to manage and brings less than a 10x benefit doesn’t make it.
One case for RDMA might be based on CPU loads. TCP will peg a CPU doing all this copying, so with Derecho over TCP we see the typical 12-core NUMA node acting more like an 11-core machine, and maybe even like a 10-core machine since there are other threads involved, too. As the datacenter owner, this 10-15% tax on your compute resources could be a big issue, giving RDMA that magic 10x benefit not in one year, but probably over a four or five year period, considering that it was 4x faster in the first place.
A second point is that not every project has all-star developers to do the tuning. So our 4x may translate to a 20x difference for MPI, or a 25x for Oracle. That would drive towards RDMA adoption after all. What Weijia and Sagar did came down to adjusting the SSTMC window size in Derecho. But you need to figure out that for Derecho in this modem the window size is the dominating factor. Who knows what it would be for MVAPICH (MPI), or Oracle, or Tensor Flow? RDMA takes that puzzle, and that datacenter tax, off the table.
My cards are still on RDMA. But these results over TCP definitely came as a surprise!
amazing improvement
ReplyDeleteUpdated with some preliminary insights about latency.
ReplyDelete