What could cause a 5x slowdown?
As it turns out, the issue is easy to understand and points to a deeper and rather interesting puzzle. RDMA is blazingly fast, as you know if you've read my earlier postings on the topic. Basically, transfers occurs entirely in hardware: the NIC on machine A talks to the memory unit on machine A and grabs chunks of data, which zip straight over the wire to the NIC on machine B, which stores the data directly into the memory unit on B. This gives a rate of data transfer that can be much higher than any single-core memcpy operation could approach: even if there is a hardware instruction for copying blocks of memory, that instruction still will operate by a series of fetch and store operations and will need to talk to the memory unit twice for each cache-line-sized chunk of data.
So RDMA tends to run 2x faster than copying. If you look at an end-to-end pipeline, you'll generally see that RDMA delays are sharply higher than the latency of interacting with local DRAM, but the data transfer speed for a big transfer can maintain this 2x benefit, from data on A all the way to the receive buffer on B.
This is what my student ran into last night. In his case, by using Derecho to send long null-terminated strings he benefitted because strings are easy to check for correct content ("Hello world, this is update 227!") but very costly to create and transmit. Derecho was running 5x too slowly because it spent all its time waiting for his expensive string creation code to run, and for Derecho's orderedsend to marshall the objects before sending. Yet our instinct tells us that those should be viewed as fast operations. Well, instinct has ceased to be correct: RDMA is a new world.
In fact you've probably thought about the following question. You are driving on highway 101 in a fancy Tesla sports car with one of those insane speed buttons. Elon Musk has really pushed the limits and the car can reach 2/3rs of the speed of light. Yet your commute from Menlo Park to South San Jose takes exactly as long as it took back when you were driving your old Subaru Imprezza! The key insight isn't a very deep one: even with a supercar, the "barriers" on the highway will still limit you to roughly the same total commute time, if those barriers are frequent enough.
In modern operating systems and languages, these kinds of barriers are pervasive.
Today's most widely used standard systems maintain data in various typed data structures: strings, classes defined by developers, etc. Even an object as simple as a string may require byte by byte copying, just to find the null terminating character, and by itself, this will be a further 2.5x slower than RDMA. So send a string, and your end-to-end numbers might be 5x slower than the best we get from Derecho with byte arrays of known size. Send a class that needs complex marshaling and the costs go even higher (if the fields can be copied directly from memory, scatter-gather is an option, but otherwise we would often need to first copy the data into a send buffer, then send it, then free the buffer: a sequence that could push us even beyond that 5x slowdown).
What you can see here is that at speeds of 100Gbps or higher, copying is a devastingly slow operation! Yet in fact, modern operating systems copy like crazy:
- They copy data from user space into kernel space prior to doing I/O, and back later.
- Modern languages are very relaxed about creating clones of objects.
- Other than C++ 14, every method call copies arguments onto the stack, item by item.
- Garbage collectors copy and compact all over the place.