A Few Thoughts on Distributed Computing: The systems-area obsession with peak performance

In systems, there has always been a completely understandable focus on peak performance. For me personally, Butler Lampson's early papers on "Hints for Operating System Design" (which basically argued that we need to unclutter the critical path), the famous End to End paper by Saltzer, Reed, and Clark (which argued for taking unnecessary functionality out of the network), and the Birrell and Nelson paper on the performance of Firefly RPC (which argued for taking unnecessary mechanism out of the remote procedure call path) were awe-inspiring classics: papers you reread decades later, and that still amaze.

In fact for people who get pleasure from programming, there is a natural tendency to build systems and evaluate them, and obviously any such task centers on a peak performance perspective. Optimizing peak performance is fun, and honestly, can be addicting: there is such a direct feedback when you succeed. It is very rare to see a first version of a system that can't be sped up by a factor of 10 or more even just by doing basic optimizations, and in some cases, we end up with speedups of 100 or 1000-fold, or even more. What a thrill!
Yet there is a sense in which our love for speed worries me: I wonder if this classical way of thinking about systems might be fading as a pure form of innovation in the eyes of the field as a whole. The core issue is that existing systems (here I mean the mainstream workhorses: the operating system, the network layer, perhaps the compiler) all work pretty well. Of course, speeding them up is a worthy endeavor, but it may no longer matter enough to be a justifiable goal in its own right. Research on performance is just not a compelling story, if you focus on this particular layer.
Why should this matter? In fact, it might not matter at all, were the systems community itself aligned with these larger external forces that shape the way we are perceived by other communities and by computer science as a discipline. But right now, I suspect, there is a substantial disconnect: people like me are addicted to speed (hmm... that doesn't sound quite right), while people who hang out at conferences like NPS and KDD don't really spend much time worrying about the performance of the systems components my crowd focuses upon, like the latest version of Linux running on the latest multicore hardware platform.
As I write this blog entry, this same dynamic is evident even within my own research group. For us, it plays out as a tension between telling the Derecho story as a story about a new concept ("smart memory") and telling it as a story about raw speed ("fastest Paxos and Atomic Multicast, ever!").
It seems to me that the broader field tends to value impact more than what might be called "narrow" metrics, such as the speed of the Linux I/O path. Invent a new way of doing things, and you can change the world in interesting ways. So the puzzle that emerges is this: if the systems community has started to drift relative to the broader computer science community as a whole, don't we then run some risk of becoming marginalized, by virtue of over-emphasizing aspects that the broader computer science community views as unimportant, while actually rejecting innovations that the broader community might be thrilled to hear about?

Take Spark, a recent home run story from Berkeley. If you think back, the first research papers on Spark told a story (primarily) about huge speedups for MapReduce/Hadoop, which obtained by smarter in-memory caching of files (they call them RDD: Resilient Distributed Data objects) and smarter task scheduling, so that computations would tend to exhibit affinity relative to the cached content. Years later, it seems clear that the more significant aspect of Spark -- the more impactful innovation -- was that it created a longer term computing "model" in which data loaded into the Databricks system (the new name for Spark) lives in memory, is transformed through various stages of computation, and where the end-user has a powerful new experience of data mining with vastly better performance because these RDDs remain resident in memory as long as there is enough space, if they continue to be used now and then. Systems people may think of this as a story of performance... but NIPS and KDD people perceive this as a paradigm shift. As a systems person, it seems to me that our community in fact accepted the right papers, but for the wrong reason, and in fact that the early advising of the Spark inventors (by their faculty mentors) may even have misunderstood the real contribution and steered them down the less vital path. Of course, the enthusiasm for Spark quickly reset the focus, and today, the Databricks company that offers commercial support for the Spark platform focuses on high-productivity data mining with blazing performance, rather than portraying the story as "speeding up your Hadoop jobs."

It isn't a realistic thing to wish for, but I'll wish for it anyhow: as a field, it seems to me that we need to try and pivot, and to embrace change in terms of what styles of computing really matter. The world has changed drastically in the past decade or two: what matters most, right now, is machine learning. This is partly because systems work pretty well. Disruption comes from big reaches, not small 10x optimizations to things that already worked fairly well.

I don't know anything more about the future than anyone else. My focus, in Derecho, is on "smart memory," but will this ever become a recognized field of research, one that other people would work on? Does the world really need smart memory services for IoT applications? I hope so, but of course that question will be answered by other people, not by me. And so one can easily understand why my students love the raw speed story: For them, fast replication is a more well-defined systems topic, with an obvious and specific role in existing systems. People use replication solutions. So it makes sense for them to gravitate towards speed records.

Indeed, for them, viewing machine learning as the real goal, and performance as just one dimension, makes systems research feel secondary to machine learning research. Nobody wants to feel like the plumber or the electrician: we all want to build the house itself. Yet perhaps this is the new reality for systems researchers.
Will such a pivot be feasible? Perhaps not: the systems addiction to speed runs deep. But at the same time, when I visit colleagues in industry, I find them deeply embedded into groups that are doing important practical tasks that often center on a machine learning objective. So it seems to me that if we don't evolve in this necessary way, we'll simply fade in importance to the broader field. We just have to try, even if we might not succeed.

A Few Thoughts on Distributed Computing

Thursday, 17 August 2017

The systems-area obsession with peak performance

No comments:

Post a Comment