A Few Thoughts on Distributed Computing: We're actually all engineers

When Roger Needham was very ill, near the end, his lab members threw a final party for him. Roger wasn't strong enough to attend, but he sent a video, in which he is wearing an engineer's hard hat. He explains that he did some theory and came up with some clever principles and deep ideas, but he hopes to always be remembered as an engineer. And indeed, this is how I remember him.

The image comes to mind because as I travel around on sabbatical, I'm increasingly struck by the degree to which the field of distributed systems (and systems more broadly) has begun to pivot towards the engineering side of the discipline. This poses really hard puzzles for academic research.

For most of the first three or four decades of research in distributed computing, there were genuinely deep, mathematically hard questions we could wrestle with. My favorite among these problems has been replication of data, and its cousin, coordinated distributed computation (state machine replication and variants of that model). But one should also point to the problem of asynchronous consensus, with its FLP impossibility result and the various failure-oracle models for guaranteed progress, Byzantine agreement in various models, convergent consistency. The logic of knowledge, and ideas of causality and time. Gossip protocols and ideas of convergent consistency, and peer-to-peer overlays. Inexpensive synchronization mechanisms. Honestly, we've had a good run.

But every good run eventually comes to a point where fundamentally, things need to change.

When I entered the field, operating systems researchers were just beginning to "reject" abstractions: there was a big pivot underway back to basics. As John Ousterhout once put it, we had plenty of abstractions and the ones in Linux worked well for our users, so we needed to focus on keeping those working well as we scaled them out, migrated to new hardware and put them under new patterns of load. But he was negative about new abstractions, or at least preferred not to see those as part of the OS research agenda, and he wasn't much more charitable towards new theory: he felt that we had the theory we needed, just as we had the abstractions we needed. In retrospect, he was more right than I understood at the time. Today, nearly two decades after he made that point, Linux has evolved in many ways yet the grab-bag of abstractions are pretty similar to what existed back then.

It may be time to embrace Ousterhout's point from back then: distributed computing is becoming an engineering discipline, and the best work in the coming years may be dominated by great engineering, with less and less opportunity to innovate through new theory results, or new styles of distributed computing, or new kinds of abstractions. Like Roger Needham, we need to put on our hard hats and roll up our sleeves and turn to those engineering roots.

And there are plenty of great engineering challenges. I'm very excited about RDMA, as you know if you've read my other postings, or read the paper on Derecho (we've concluded that 12 pages is just too short and are treating that version as a technical report; we'll send a longer version to a journal soon). I think Derecho has some very cool abstractions, obviously: the "torrent of data" supported by RDMC, and the distributed shared memory model enabled by SST. These innovate in the way they are engineered, and because those engineering tricks lead to somewhat peculiar behaviors, we then get to innovate by matching our protocols to our new building blocks -- and yet the protocols themselves are actually quite similar to the original Isis gbcast: the first really practical protocol strong enough to solve consensus, and in fact the first example of what we now think of as the family of Paxos protocols. Derecho isn't identical to the Isis gbcast: the Isis protocols had an optimistic early delivery mode, for reasons of performance, and you had to invoke flush to push the pipeline of messages to their targets; we eliminated that feature in Derecho, because we no longer need it. So we innovated by getting rid of something... but this is not the kind of innovation people usually have in mind, when I sit down with my Cornell colleagues and they rave about the latest accomplishment of unsupervised deep learning systems.

Recasting that gbcast protocol into this new programming style has been incredibly effective: Derecho is (so far as I can tell) a constructive lower bound for such protocols, optimal in every sense I can think of, including the mapping from protocol to hardware. Yet the work is basically engineering, and while I could definitely make a case that the innovations are foundational, deep down, what I'm most proud of remains the engineering of the system. That's the innovation.

We'll have more such challenges getting Derecho to scale to half a million nodes, or to run over a WAN: Engineering challenges. To solve the problems we'll encounter, we'll need to innovate. Yet Derecho will remain, at its core, a reengineering of a concept that existed back in the late 1980's, when we first invented that early version of Isis gbcast (I guess we should have called it Paxos, particularly given that the name Isis is now seared into the public consciousness as pretty much the epitome of evil).

I could go on and talk about other areas within operating systems and systems in general, but I'm going to skip to my point instead: across the board, the operating systems community is finding that we already know the theory and already have the core solutions in hands, to the extent that those solutions have any kind of foundational flavor to them. But we face huge practical challenges: engineering challenges. The tough questions center on making new hardware easy to use and to match familiar ways of computing to the sweet spot for the new computing hardware we're computing upon.

This situation is going to be very tough for academic research departments. I'll give one example. Multicore hardware is fantastic for heavily virtualized cloud computing data centers, so we have Amazon AWS, Microsoft Azure (if you haven't looked at Azure recently, take a new look: very impressive!), Google's cloud infrastructure, and the list goes on. All of them are basically enabled by the chips and the cost-effective sharing model they support. But virtualization for the cloud isn't really deeply different from virtualization when the idea first surfaced decades ago.

So academic research departments, seeking to hire fresh talent, have started to see a dismaying pattern: the top students are no longer doing work on radical new ways of programming with multicore. Instead, the best people are looking at engineering issues created by the specific behaviors of multicore hardware systems or other kinds of cutting edge hardware. The solutions don't create new abstractions or theory, but more often draw on ideas we've worked with for decades, adapting them to these new platforms. My colleagues and I have had "help wanted" signs out for years now, and hoped that brilliant foundational systems researchers would show up. We just haven't known how to deal with the amazingly strong engineers who came knocking.

These days, the best systems candidates are certainly innovators, but where is the next FLP impossibility result, or the next Byzantine agreement protocol? One hears non-systems people complaining: those of us in systems were all talking about Corfu a few years back. Our colleagues were baffled, asking all sorts of questions that betray a deep disconnect: "What's the big deal about Corfu, Microsoft's amazingly fast append-only-log, with its Paxos guarantees: didn't we have logs? Didn't we have Paxos? Is an SSD such a big deal?" For me as an engineer, the answer is that Corfu embodies a whole slew of novel ideas that are incredibly important -- engineering ideas, and engineering innovations. But for our colleagues from other fields, it has become harder and harder to explain that the best work in systems isn't about foundational concepts and theory anymore.

The effect of this is to make it harder and harder to hire systems people, and harder for them to succeed in the field. The very best certainly do incredibly well. Yet the numbers of applicants are way down, and the shape of these successes looks more and more practical, and less and less foundational.

The sheer scale of the cloud stimulated a huge wave of brilliant engineering, and wonderful research papers from industry. Yet we in academics lack testbeds or real use cases that are even remotely as ambitious, and our development teams are small, whereas companies like Microsoft and Google routinely put fifteen or twenty people onto major projects. How can academic researchers even compete in this kind of game?

I could go on at some length on this theme, but I hope you catch the basic drift: systems has a vibrant future, but the future of systems will be more and more closely linked to the hardware and to the deployment models. Don't expect your next systems hires to be theory students who are inventing radical new concepts that will transform our conception of consistency: we know what consistency is; we figured it out thirty years ago. You'll need to come to grips with hiring superb engineers, because increasingly, the engineering side of the field is the important, hot part of the domain. Meanwhile, the mathematical side struggles to say anything relevant that wasn't said long ago: without new abstractions to reason about and prove things about, it is hard for them to position themselves as applied mathematicians: mathematicians, certainly, but not necessarily very applicable to anything that matters to Amazon, Google or Microsoft!

Honestly, this worries me. I can definitely see how one could create a new kind of systems community from scratch, with totally new conferences dedicated to what could be called brilliantly principled systems engineering. I love that sort of stuff; so did Roger Needham. But the main conferences already exist. And the dynamic that has become established is one that really squeezes academic research to a smaller and smaller niche, for all the reasons I've noted: we lack the testbeds, and the user-pull, and can't easily validate our work, and don't have large enough teams.

One answer might be for the main conferences to start to target "great systems engineering" and to treat really neat engineering ideas with the same awe that we accord to Google's latest globe-spanning consistent database (built using a custom worldwide network, dedicated satellite links, geosynchronized clocks that even worry about relativistic effects...) This might be hard, because academic systems work these days is often beautiful in small, crystalline ways, and our program committees are full of people who were trained to look for big ideas, big innovations, and big demonstrations.

My own preference would be for us to deliberately take an innovation Hank Levy introduced but to push it even further. Hank, worried that papers were "backing up" and clogging the system, convinced OSDI and SOSP to double the number of accepted papers. The downside is that the conferences started to have very long schedules. I think we might consider going even further, and doubling the accept rates again, basically by taking every solid paper, so long as it innovates in some recognizable way, including through experiments or just quality of engineering. Then we'll have way too many papers for any sane conference schedule, and the PC would have to pick some for plenary presentations, and delegate others to parallel sessions or (my preference) to big poster sessions, maybe with a WIPS style introduction where authors would get 3 minutes to advertise their work.

If we treat such papers as first class ones: a SOSP paper is a SOSP paper, whether presented in plenary session or not, I think we could perhaps revive the academic side of the field.

What if we fail to tackle these issues? I think the answer is already evident, and I see it as I travel. The field has fewer and fewer young faculty members and young researchers, and the ones doing PhDs are more and more inclined to join industry, where the rewards and the type of work are better aligned. We'll just age out and fade into irrelevance. When we do hire, our new hires will either do fancy but sort of shallow work (a traditional response to such situations), or will have trouble getting tenure: letter writers will stumble looking for those core "foundational" innovations, those new abstractions, and they won't validate engineering brilliance to nearly the degree that we should.

Does systems have a future? I sure hope so, because I truly do love fantastic engineering, and there are fantastic engineering questions and opportunities as far into the future as one can see. Clever tricks with new hardware, clever ways to do the same old stuff but faster and at greater scale and better then ever before -- I love that sort of work. I enjoy talking to great engineers, and to people who just are amazingly strong software developers.

But academics haven't learned to accept the validity of such work. And this is a mistake: we really have to adopt new performance metrics and a new mindset, because otherwise, academic systems will surely perish as this new era continues to transform the field.

A Few Thoughts on Distributed Computing

Thursday, 19 January 2017

We're actually all engineers

No comments:

Post a Comment