Thursday, 17 August 2017

The systems-area obsession with peak performance

In systems, there has always been a completely understandable focus on peak performance.  For me personally, Butler Lampson's early papers on "Hints for Operating System Design" (which basically argued that we need to unclutter the critical path), the famous End to End paper by Saltzer, Reed, and Clark (which argued for taking unnecessary functionality out of the network), and the Birrell and Nelson paper on the performance of Firefly RPC (which argued for taking unnecessary mechanism out of the remote procedure call path) were awe-inspiring classics: papers you reread decades later, and that still amaze.


In fact for people who get pleasure from programming, there is a natural tendency to build systems and evaluate them, and obviously any such task centers on a peak performance perspective.  Optimizing peak performance is fun, and honestly, can be addicting: there is such a direct feedback when you succeed.  It is very rare to see a first version of a system that can't be sped up by a factor of 10 or more even just by doing basic optimizations, and in some cases, we end up with speedups of 100 or 1000-fold, or even more.  What a thrill!
Yet there is a sense in which our love for speed worries me: I wonder if this classical way of thinking about systems might be fading as a pure form of innovation in the eyes of the field as a whole.  The core issue is that existing systems (here I mean the mainstream workhorses: the operating system, the network layer, perhaps the compiler) all work pretty well.  Of course, speeding them up is a worthy endeavor, but it may no longer matter enough to be a justifiable goal in its own right.  Research on performance is just not a compelling story, if you focus on this particular layer.
Why should this matter?  In fact, it might not matter at all, were the systems community itself aligned with these larger external forces that shape the way we are perceived by other communities and by computer science as a discipline.  But right now, I suspect, there is a substantial disconnect: people like me are addicted to speed (hmm... that doesn't sound quite right), while people who hang out at conferences like NPS and KDD don't really spend much time worrying about the performance of the systems components my crowd focuses upon, like the latest version of Linux running on the latest multicore hardware platform.
As I write this blog entry, this same dynamic is evident even within my own research group.  For us, it plays out as a tension between telling the Derecho story as a story about a new concept ("smart memory") and telling it as a story about raw speed ("fastest Paxos and Atomic Multicast, ever!"). 
It seems to me that the broader field tends to value  impact more than what might be called "narrow" metrics, such as the speed of the Linux I/O path.  Invent a new way of doing things, and you can change the world in interesting ways.  So the puzzle that emerges is this: if the systems community has started to drift relative to the broader computer science community as a whole, don't we then run some risk of becoming marginalized, by virtue of over-emphasizing aspects that the broader computer science community views as unimportant, while actually rejecting innovations that the broader community might be thrilled to hear about?

Take Spark, a recent home run story from Berkeley.  If you think back, the first research papers on Spark told a story (primarily) about huge speedups for MapReduce/Hadoop, which  obtained by smarter in-memory caching of files (they call them RDD: Resilient Distributed Data objects) and smarter task scheduling, so that computations would tend to exhibit affinity relative to the cached content.  Years later, it seems clear that the more significant aspect of Spark -- the more impactful innovation --  was that it created a longer term computing "model" in which data loaded into the Databricks system (the new name for Spark) lives in memory, is transformed through various stages of computation, and where the end-user has a powerful new experience of data mining with vastly better performance because these RDDs remain resident in memory as long as there is enough space, if they continue to be used now and then.  Systems people may think of this as a story of performance... but NIPS and KDD people perceive this as a paradigm shift.  As a systems person, it seems to me that our community in fact accepted the right papers, but for the wrong reason, and in fact that the early advising of the Spark inventors (by their faculty mentors) may even have misunderstood the real contribution and steered them down the less vital path.  Of course, the enthusiasm for Spark quickly reset the focus, and today, the Databricks company that offers commercial support for the Spark platform focuses on high-productivity data mining with blazing performance, rather than portraying the story as "speeding up your Hadoop jobs."


It isn't a realistic thing to wish for, but I'll wish for it anyhow: as a field, it seems to me that we need to try and pivot, and to embrace change in terms of what styles of computing really matter.  The world has changed drastically in the past decade or two: what matters most, right now, is machine learning.  This is partly because systems work pretty well.  Disruption comes from big reaches, not small 10x optimizations to things that already worked fairly well.


I don't know anything more about the future than anyone else.  My focus, in Derecho, is on "smart memory," but will this ever become a recognized field of research, one that other people would work on?  Does the world really need smart memory services for IoT applications?  I hope so, but of course that question will be answered by other people, not by me.    And so one can easily understand why my students love the raw speed story: For them, fast replication is a more well-defined systems topic, with an obvious and specific role in existing systems.  People use replication solutions.  So it makes sense for them to gravitate towards speed records.


Indeed, for them, viewing machine learning as the real goal, and performance as just one dimension, makes systems research feel secondary to machine learning research.  Nobody wants to feel like the plumber or the electrician: we all want to build the house itself.  Yet perhaps this is the new reality for systems researchers.
Will such a pivot be feasible?  Perhaps not: the systems addiction to speed runs deep. But at the same time, when I visit colleagues in industry, I find them deeply embedded into groups that are doing important practical tasks that often center on a machine learning objective.  So it seems to me that if we don't evolve in this necessary way, we'll simply fade in importance to the broader field.  We just have to try, even if we might not succeed.

Thursday, 10 August 2017

Zero-copy computing

Last night, one of my group members tossed together a simple experiment on Derecho as the first step towards a much fancier experiment he needs to run.  To his surprise, the performance was a fifth of what we've been seeing in our experiments that will go into the ACM TOCS submission we plan to send out any day now.


What could cause a 5x slowdown?


As it turns out, the issue is easy to understand and points to a deeper and rather interesting puzzle.  RDMA is blazingly fast, as you know if you've read my earlier postings on the topic.  Basically, transfers occurs entirely in hardware: the NIC on machine A talks to the memory unit on machine A and grabs chunks of data, which zip straight over the wire to the NIC on machine B, which stores the data directly into the memory unit on B.  This gives a rate of data transfer that can be much higher than any single-core memcpy operation could approach: even if there is a hardware instruction for copying blocks of memory, that instruction still will operate by a series of fetch and store operations and will need to talk to the memory unit twice for each cache-line-sized chunk of data.

So RDMA tends to run 2x faster than copying.  If you look at an end-to-end pipeline, you'll generally see that RDMA delays are sharply higher than the latency of interacting with local DRAM, but the data transfer speed for a big transfer can maintain this 2x benefit, from data on A all the way to the receive buffer on B.


This is what my student ran into last night.  In his case, by using Derecho to send long null-terminated strings he benefitted because strings are easy to check for correct content ("Hello world, this is update 227!") but very costly to create and transmit.  Derecho was running 5x too slowly because it spent all its time waiting for his expensive string creation code to run, and for Derecho's orderedsend to marshall the objects before sending.  Yet our instinct tells us that those should be viewed as fast operations.  Well, instinct has ceased to be correct: RDMA is a new world.


In fact you've probably thought about the following question.  You are driving on highway 101 in a fancy Tesla sports car with one of those insane speed buttons.  Elon Musk has really pushed the limits and the car can reach 2/3rs of the speed of light.  Yet your commute from Menlo Park to South San Jose takes exactly as long as it took back when you were driving your old Subaru Imprezza!  The key insight isn't a very deep one: even with a supercar, the "barriers" on the highway will still limit you to roughly the same total commute time, if those barriers are frequent enough. 

In modern operating systems and languages, these kinds of barriers are pervasive.


Today's most widely used standard systems maintain data in various typed data structures: strings, classes defined by developers, etc.  Even an object as simple as a string may require byte by byte copying, just to find the null terminating character, and by itself, this will be a further 2.5x slower than RDMA.  So send a string, and your end-to-end numbers might be 5x slower than the best we get from Derecho with byte arrays of known size.  Send a class that needs complex marshaling and the costs go even higher (if the fields can be copied directly from memory, scatter-gather is an option, but otherwise we would often need to first copy the data into a send buffer, then send it, then free the buffer: a sequence that could push us even beyond that 5x slowdown).


What you can see here is that at speeds of 100Gbps or higher, copying is a devastingly slow operation!  Yet in fact, modern operating systems copy like crazy:
  • They copy data from user space into kernel space prior to doing I/O, and back later.
  • Modern languages are very relaxed about creating clones of objects.
  • Other than C++ 14, every method call copies arguments onto the stack, item by item.
  • Garbage collectors copy and compact all over the place.
This suggests that it may be time for someone to create a zero-copy operating system, and perhaps also to start looking carefully at what it would take to do zero-copy versions of modern programming language frameworks like Python, Go, C# and so forth (C++ is pretty good in this way).  Otherwise, as RDMA pushes towards 400 Gbps (200Gbps is already becoming fairly common), and 1Tbps within a decade, we'll find that our RDMA path seems nearly instant... but that the applications just can't benefit!