A Few Thoughts on Distributed Computing: Infrastructures matter too

This posting is kind of a gripe, so I'll try to keep is short and more blog-like.

In my research area, which is broadly considered to be "systems" and more narrowly has names like fault-tolerance, distributed computing or cloud computing, funding used to be extremely generous. I think that this was perhaps because the US military (DARPA, AFRL/AFOSR, etc) were finding it difficult to deal with their distributed infrastructures and wanted to catalyze change. We had a great run, in many senses: systems research led to today's massive cloud computing platforms, gave us big leaps forward in terms of the performance of all kinds of systems, has given us genuinely exciting new security options, and enabled far higher productivity. Our students (at every level: PhD, MS and MEng, even undergraduates) did projects that prepared them for jobs at the top companies. Some actually went on to found those companies, or to lead their engineering sides.

So the investment paid off.

Even so, under the Bush/Cheney administration, there was a sudden and very striking shift. First, security research flourished and expanded hugely. Somewhat ironically, I actually led the initial studies that created DARPA's original security research programs, but that was back in 1995; I myself never became a real security researcher. But be that as a it may, there was a huge increase in funding for work on security (by now we see the value of this: many great ideas emerged from all that investment). Today there are new kinds of trusted platform modules with really remarkable capabilities, which definitely trace to this big push to create new secure computing options.

Meanwhile, systems research continued to receive attention, and that mattered because with the emergence of scaled-out systems for the cloud, we needed non-industry study of all the questions that aren't of immediate, "we need it yesterday" importance to companies like Microsoft and Google. But there was an odd flavor to it: to do systems work in that period, you often needed a big brother: a company that somehow was viewed favorably by the Cheney crowd and the DARPA leadership. Jokes were made about needing a "FOT project leader" ("Friend of Tony", for Tony Tether). And many projects were cancelled abruptly or suspended for months pending reviews to make sure they were really innovative. The FOT projects usually avoided those reviews; the non-FOT projects suffered. Over at NSF, the Bush/Cheney administration imposed sharp cuts, which Congress then tended to reverse. A very complex picture.

This was often justified by funding leaders who would explain that the US government shouldn't invest in IT research if industry will do the work anyhow, or if there isn't some unique need specific to the government, or the military.

In fact, you can count on industry to solve today's problems, and yesterday's problems. But if your needs are a bit unusual (as in the critical computing sector, or the military, or hospitals), don't hold your breath. Anything that isn't aligned with the largest market sectors will languish, and our job as academic researchers is to solve the special needs of those (important) areas, while also educating the students who go on to fill all the key roles in industry.

Fast forward eight years. Bush and Cheney are now out, replaced by the Obama administration. One might have expected research funding to flourish.

In fact, at least in my area (but I think more broadly), a major problem emerged instead.

As you perhaps know, President Obama and the Congress he had to work with simply didn't see eye to eye, on anything. One side-effect of the disagreement is that big discretionary funding cuts were imposed: the so-called "sequester". For the national research programs at DARPA and NSF and other large players, this all took place just as machine learning and AI began to have really dramatic payoff. Yet at the government budgeting level, agency research budgets stopped growing and at best, slipped into a mode in which they barely tracked inflation. We had what is called a "zero-sum" situation: one in which someone has to lose if someone else wins. So as funding shifted towards security on the systems side, and towards AI and ML on the other side, money for systems research dried up. This happened first in the database area, and then crept into all areas of systems with the possible exception of computer networks. For whatever reason, networking research was much more favorably treated. A new area called Cyberphysical Systems emerged and was funded fairly well, but perhaps this is because it has unusually effective leadership. Still, in the large picture, we saw growth on the AI/ML side coupled with shrinkage in the areas corresponding to systems.

When I talk to people at DARPA or NSF, I get a very mixed and often, inaccurate story of how and why this happened, and how they perceive it in retrospect. NSF program managers insist that their programs were and remain as healthy as ever, and that the issue is simply that they can't find program managers interested in running programs in systems areas. They also point to the robust health of the network research programs. But such comments usually brush over the dearth of programs in other areas, like distributed and cloud computing, or databases, or other kinds of scaled-out systems for important roles, like controlling self-driving cars.

NSF also has absurdly low success rates now for programs in these areas. They run a big competition, then fund one proposal out of 25, and halve the budget. This may seem like a success to the NSF program officers but honestly, when you get allocated one half a student per year for your research, and have to submit 5 proposals to see 1 funded, it doesn't look like such a wonderful model!

So that summarizes what I tend to hear from folks at NSF. At DARPA one hears two kinds of messages. One rather frequent "message" takes the form of blame. Several years ago, DARPA hired a researcher named Peter Lee from CMU, and gave him a special opportunity: to create a new DARPA advanced research topics office. This would have owned the whole area I'm talking about.

But when Rick Rachid announced retirement plans at Microsoft, Peter took Rick's old role, and DARPA couldn't find anyone as strong as Peter to replace him. After a bit of dithering around, DARPA folded those new programs into I2O, but I2O's main agenda is AI and ML, together with a general search for breakthrough opportunities in associated areas like security. Maybe asking I2O to also tackle ambitious cloud computing needs is too much? At any rate, after an initially strong set of programs in cloud computing, I2O backed away in the face of the sequester and funding was cut very deeply.

If you ask about this cut, specifically, you hear a second kind of message -- one that can sound cold and brutal. People will tell you that the perception at DARPA is that today, the best systems research happens in industry now and that if DARPA can't create and shape the industry, then DARPA has little interest in investing in the research. And they will tell you that whether NSF acknowledges this or not, the same view holds in the front offices of NSF, AFOSR, ONR, you name it.

Then you hear that same rather tired nonsensical claim: that there "are no program officers interested in these topics." Obviously, there would be candidates if NSF and DARPA and DOE and so forth tried in a serious way to recruit them. But the front offices of those agencies don't identify these areas as priorities, and don't make a show of looking for stars to lead major programs. Given the downplayed opportunity, who would want the job? A program manager wants to help shape the future of a field, not to babysit a shrinking area on the road to funding extinction. So blaming the lack of program officers just ignores the obvious fact that the program officers are a mirror image of the priorities of the front office leadership. Recruit and thou shall find.

Three reasons (one especially doubtful). But you know what? I think all three are just dumb and wrong.

One problem with this situation is that the industry needs top-quality employees, who need to be trained as students: trained by people like me in labs like mine. Without research funding, this pipeline stalls, and over time, industry ends up with fewer and fewer young people. The older ones become established, less active, and move towards managerial roles. So innovation stalls. As a result, the current priorities guarantee that systems platforms will stop evolving in really radical ways. Of course if systems platforms are boring, useless, unimportant -- well, then nobody should care. But I would argue that systems platforms are core parts of everything that matters. So of course we should care! And if we strangle the pipeline of people to build those platforms, they are not going to evolve in exciting ways.

Second, innovation requires out of the box thinking, and under pressure to compete and be first to market with new cloud computing concepts or other new functionality, there is a furious focus on just a subset of topics. So by starving academic research, we end up with great systems work on the topics industry prioritizes most urgently, but little attention to topics that aren't of near-term impact. And you can see this concretely: when MSR shut down their Silicon Valley lab, the word most of us heard was that the work there was super, but that the degree of impact on the company was deemed to be fairly low. MSR no longer had the resources, internally, to invest in speculative projects and was sending a message: have impact, or do your work elsewhere. Google's employees hear that same message daily. Honestly, almost all the big companies have this mindset.

So there is a sense in which the government pivot away from systems research, however it may have been intended, starves innovation in the systems area, broadly defined (so including distributed systems, databases, operating systems, etc). All of us are spending more and more time chasing less and less funding.

The bottom line as I see it is this. On the one hand, we all see merit in self-driving cars (I would add "... provided that they get lots of help from smart highways"), smart power grids, other kinds of smart city, building and infrastructures. So all of this is for the good.

But when you look at these kinds of Internet of Things use cases, there is always a mix of needs: you need a platform to robustly (by which I mean securely, perhaps privately, fault-tolerantly) capture data. You need the AI/ML system to think about this data and make decisions. And then you need the platform to get those decisions back to the places where the actions occur.

This dynamic is being ignored: we are building the AI/ML systems as if only the self-driving car runs continuously. All the cloud-hosted aspects are being designed using data in static files, ignoring the real-time dynamics and timing challenges. The cloud platforms today are, if anything, designed to be slow and to pipeline their actions, trading staleness at the edge for higher transaction rates deep inside: a good tradeoff for Facebook, but less evident for a self-driving car control system! And the needed research to create these needed platforms is just not occurring: not in industry, and not in academic settings either.

Even if you only care about AI/ML you actually should care about this last point. Suppose that the "enemy" (maybe some kind of future belligerent country) has fighter planes with less effective AI/ML (they only use the open source stuff, so they run a few years behind us). But their AI/ML isn't terrible (the open source stuff is pretty good). And now suppose that they invested in those ignored platform areas and managed to get everything to run 10x or 100x faster, or to scale amazingly well, just by leveraging RDMA hardware, or something along those lines.

So suppose now that we have an incoming Foo Fighter closing at Mach 4.5 and armed with last year's AI/ML, but running that code 100x faster. And our coalition fighter has the cutting edge new AI/ML, maybe some kind of deep neural network trained from data collected from the best pilots in the USAF. But the software platform is sluggish and doesn't even make use of the cutting edge NetFPGA cards and GPU accelerators because the 2010 version of Linux they run didn't support those kinds of things. Who would you bet on? I myself might bet on the Foo Fighter.

Slightly dumber AI on a blazingly fast hardware/software infrastructure beats AI++ on old crappy platforms. This is just a reality that our funding leadership needs to come to grips with. You can't run shiny new AI on rusty old platforms and expect magic. But you definitely can take open source AI and run it on an amazing breakthrough platform and you might actually get real magic that way.

Platforms matter. They really do. Systems research is every bit as important as AI/ML research!

We have a new administration taking office now, and there will clearly be a shakeup in the DC research establishment. Here's my hope: we need a bit of clear thinking about balance at NSF and DARPA (and at DOE/ARPAe, AFOSR, ONR, AFRL, you name it). All of this investment in security and machine learning, is great, and should continue. But that work shouldn't be funded at the expense of investment in platforms for the real-time Internet of Things, or new kinds of databases for mobile end-users (including mobile human users, mobile cars, mobile drones, mobile warfighters and police officers and firemen...). We need these things, and the work simply takes money: money to do the research, and as a side-effect, to train the students who will later create the products.

I'm not casting a net for my own next job: I love what I do at Cornell and plan to continue to do it, even if research funding is in a drought right now. I'm managing to scrape together enough funding to do what I do. Anyhow, I don't plan to move to DC and fix this. The real issue is priorities and funding allocations: we've been through a long period during which those priorities have gotten way out of balance, and the effect is to squeeze young talent out of the field entirely, to dry up the pipeline, and to kill innovation that departs from what industry happens to view as its top priorities.

The issues will vanish if priorities align with the real needs and money is allocated by Congress. Let's hope that with the changing of the guard comes a rebalancing of priorities!

1 comment:

Unknown23 January 2017 at 11:42
Thank you Ken for this background, and for driving home a great point with your Foo Fighter illustration. Infrastructures still matter indeed!

This is a poor excuse, but I would guess many people, myself included, will read about some of the great work you and others are engaged in, and then take it for granted that infrastructure R&D and next gen production environments are keeping pace with those of applications.

Let's get together and talk about how we can re-invigorate funding in this arena.

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.

Thursday, 5 January 2017

Infrastructures matter too

1 comment: