A Few Thoughts on Distributed Computing: Systems engineering viewed as a science

At Cornell University, where I've done my research and taught since 1982, we've always had a reputation for being a top theory department, and because Cornell is relatively small compared to some of our peers, and because many of my colleagues are famous for rigorous systems theory, this reputation includes systems. I wouldn't call it a bad thing: I actually like to specify protocols as carefully as possible and to tease out a correctness argument, and my work has often benefitted from rigor.

And yet I'm reminded of something Roger Needham loved to stress. You've probably heard the story, but Roger headed the most widely known systems group in Europe for many years, at Cambridge University, and he went on to found the Microsoft Research Laboratory in Cambridge.

Roger's health finally failed, and his colleagues gathered to celebrate Roger's accomplishments. This came near the end of his life, when the cancer he was suffering from had left him very frail. As a result, Roger was unable to physically attend, but he did send a video, and in this video we see him in a wheelchair holding an engineer's hard hat on his lap. He starts out by saying a few words about his career (which included a wide range of work, some very theoretical, some very practical). And then he puts the hat on. Roger looks directly at the camera and says that he hasn't very long now, and wants people to remember him this way: wearing a hard hat, very much the engineer.

For Roger, computer systems research is an engineering discipline first: that our primary obligation is to really build the things we invent, and to build production quality software that people will really want to use. Roger loved ideas, but for him, an unimplemented idea was inferior to a working one, and a working system that people use was the gold standard. Throughout his career, Roger repeatedly acknowledged the value of rigor and used theoretical tools to achieve rigor. But the key thing is that for him, theory in the systems arena was a tool, secondary to the actual engineering work itself. The real value resided in the artifact: the working system.

For me this was one of the most iconic images in my entire career: Roger wearing that hat and underscoring that we are at our best when we build things of high value: useful things, beautifully designed concrete implementations. And there is a genuine esthetic of systems creation too: a meaningful sense in which great systems are beautiful things.

Today, I see a huge need to elevate our view of systems engineering: rather than thinking of it as mere implementation work, we need to begin to appreciate the science and beauty of great systems work. There is an inherent value to this act of creation: building really good software, that solves some really hard problem, and having it work, and be used. We're at our best when we create artifacts.

When I visit people at the French air traffic control agency, I'm kind of awed that the software we built for them in 1990 is still working today without a single unmanaged failure, in 27 years. The NYSE floor trading system I developed ran the show for ten years, and there was never a single disruption to trading in all that time. Every time you install Oracle, watch closely: in the middle of the install script Oracle sets up my old Isis toolkit system as its network management tool. And this has worked for nearly 30 years. Nobody phones me to ask how to fix obscure bugs. (Of course, once Derecho is finally out there, I bet many will call to ask how to migrate to it).

We built these really difficult engineered infrastructures with incredibly demanding specifications, came up with a beautiful and elegant model (virtual synchrony, which turns out to be a variation on the Paxos state machine replication model), and it actually worked! This is as good as it gets.

Our field continues to struggle with the tension between its mathematically-oriented theory side and its roots in systems engineering. One sees this again and again. Papers are brushed to the side by conferences because the contributions are primarily about sound engineering and great performance: "minor and incremental" aspects in the eyes of some PC member (who has probably never coded a large C++ program in his life)! Funding programs somehow assume that the massive data collection infrastructures that will be needed as hosting environments for their fancy new machine learning technologies are trivial and will just build themselves. You write a reference letter for a student, and back comes an email query: yes, ok, the student is a gifted distributed systems engineer. But do they have a talent for theory? And if not, where's the originality? I've become accustomed to that sinking feeling of needing to translate my enthusiasm for systems engineering to make sense to a person who just doesn't understand the creativity and originality that the best systems research demands.

So here's my little challenge to the community: we need to start going out of our way to seek out and accept papers on the best systems, and to deliberately validate and celebrate the systems engineering side of the field as we do so. Over time, we need to recreate a culture of artifacts: by deliberate choice, we need to tilt the field back towards valuing systems to a greater degree. Roger was right: we're at our best when we wear those hard hats.

A few concrete suggestions:

Let's start to write papers that innovate by revealing clever ways to build amazing things.
Those of us in a position to do so should press program committees to set aside entire sessions to highlight real systems: practical accomplishments that highlight the underlying science of systems engineering.
We should be giving awards to the best academic systems research work. Somehow, over time, prizes like the ACM software systems prize started going purely to massive projects done by big teams in industry (which is fine, if ACM wants to orient the prize that way). But we also need prizes to recognize the best systems built in academic and research settings.

Why do these things? Because we're at our best when we create working code, and when we teach others to appreciate the elegance of the code itself: the science behind the engineering. Systems building is what we do, and we should embrace that underlying truth.

A Few Thoughts on Distributed Computing

Wednesday, 15 February 2017

Systems engineering viewed as a science

No comments:

Post a Comment