Monday, 30 January 2017

What's the right role for research?

Something tells me that under the new administration, we'll soon be under fire to defend the core principle that government should fund academic research.  It won't be good enough to say "we invited the Internet", because policy makers might just respond "it would have been invented in industry soon, in any case."

With this in mind, what's our best option for making the case the government investment in academic research is worthwhile and should be sustained, and if anything, redoubled?

To me this isn't actually as obvious a proposition as it may sound, in part because over the course of my career I've actually seen quite a few failures by government research policy.  Here are some examples that comes to mind, and I could easily summon up additional ones:
  • You may be aware that I worked with the French to revamp their air control system architecture in the 1990's, and built a reliable communications architecture for the US Navy AEGIS and a floor trading communication system for the New York Stock Exchange in that timeframe too  (this is back when I was commercializing a software library we created at Cornell).  What was striking was that in all three cases, there were advanced prototyping and research programs dedicated to those projects that failed to produce the needed solutions.  In fact the French ATC project approached me at the same time as a much-touted US FAA rebuild started, using cutting edge ideas from industry.  The US project collapsed just years later, and billions were lost.  The French project was actually launched in a spirit of catch-up, and they approached me with some sense of frustration that within France, the country lacked the technology know-how to solve the problem.  And this is true of the other projects I mentioned too: each one of them turned to me after the "official" pipeline of technology broke down and didn't deliver what they needed.  So here we have three examples, and I could give many more, in which the main government R&D projects that should have solved the problems stumbled badly.
  • Back in the 1992 timeframe, a program manager at DARPA proudly told me that he had made a gutsy decision to kill all DARPA research investments in database technologies.  Today, as we look at the amazing successes that machine learning on big data has already delivered, it is sobering to realize that DARPA actually turned its back on the whole area back when in fact, it should have been investing to stimulate the development of these exciting opportunities, and to ensure that the military side of the research didn't get overlooked (this latter point is because the military often needs chocolate flavored research in a world that often limits itself to vanilla when focused purely on non-classified commercial opportunities.  If nobody is pushing to create chocolate flavored technology, it doesn't necessarily get done all by itself).
  • Under the Bush-Cheney administration, the head of DARPA was a fellow named Tony Tether, who had some sort of complex family tie to Dick Cheney.  And Tony felt that our research mostly helped Indian and Chinese students learn American technology secrets which they would then sneak back home to commercialize.  The upshot of this zenophobic view was that DARPA would only fund academic research as subcontracts to proposals from companies, and surprisingly often, one got strong hints that those companies should be from a group people started calling the "FOT four": the companies run by "friends of Tony."  The evidence, years later?  Tony did a ton of damage, but it harmed the military much more than anyone else.  Today's military has really struggled to take advantage of the cloud (the new "government cloud" projects at NSA are signs that we seem to finally be getting there), deep learning, and even the new Internet of Things technology spaces. 
  • A while back remember being told by an NSF leader, in one of those hallway conversations, voices lowered, that I needed to quit pushing quite so hard for the money for the top proposals in a particular research program where I was on the selection panel.  The message was: a certain percentage simply has to go to weak proposals from institutions that weren't know for doing quality research at all, for political reasons, and that by fighting to divert that earmarked money back into the pool (to spend it on work from a place like MIT), I was just making trouble. 
So what you can see from these examples is that several kinds of failings combine to cause trouble.  One failure is a tendency of government officials to not understand the direct importance and value technology has in key aspects of what we do as a nation, and notably in critical infrastructure and military settings.  Here, the issue is that when our research leadership (and our funding programs) stumble, rather than some kind of clever shot over the bows at India and China, we just set our own military back by 15 years or so in terms of leveraging the hottest new tech.  Potentially, we end up with big projects like air traffic control stumbling.

A second side of the failing is the vanilla flavored technology issue.  If we ask why industry didn't have solutions that could succeed in the US FAA rebuild, or for that matter the French ATC project, or the NYSE of the time, what stands out is that US industry is heavily optimized by Wall Street to focus narrowly and exclusively on short-term commercial opportunities with very large pay-back for success.  By and large, industry and investors punish companies that engage in high-risk research even if there is a high possible reward (the point being that, by definition, the work wouldn't be highly risky if it didn't run a big risk of failing, and investors want a high likelihood of a big payday).  So nobody looks very hard at niche markets, or at technical needs that will matter in 10 years but that haven't yet become front and foremost in their priorities, or that represent stabs in the dark -- attempts to find new ways to solve very stubborn problems, but without even a premise that there might be some kind of high reward payout in the near term (high-risk research usually at least has a potential to hit some kind of identifiable home run and in a fairly short period of time).

A third is the one I mentioned briefly in the final point.  When the folks who write the checks deeply misunderstand the purpose of research, it makes perfect sense for them to insist that some portion of the money go to the big state school in their home district, or be dedicated to research by teams that aren't really representative of the very best possible research ideas but that do have other attributes the funding agency wants to promote.  I'm all for diversity, and for charity to support higher education, seriously, but not to the point where my assessment of the quality of a research proposal would somehow change if I realized that the proposal was put forward by a school that has never had a research effort before, or by a team that had certain characteristics.  To me the quality of a research idea should be judged purely by the idea and by asking whether the team behind it has the ability to deliver a success, not by other "political" considerations.

If I distill it down, it comes to the following.  The value of research is really quite varied.  One role is to invent breakthroughs, but I think this can be overemphasized and that doing so is harmful, because it reflects a kind of imbalance: while some percentage of research projects certainly should yield shiny new toys that the public can easily get excited about, if we only invest in cool toys, we end up with the kinds of harmful knowledge gaps I emphasized above.

So a second role is to bridge those knowledge gaps: to invest also in research that seeks to lower the conceptual barriers that can prevent us from solving practical problems or fully leveraging new technologies.  Often this style of research centers on demonstrating best of breed ways to leverage new technologies and new scientific concepts, and then reducing the insights to teaching materials that can accompany the demonstrations themselves.  This second kind of research, in my view, is often somewhat missed: it comes down to "investing in engineering better solutions", and engineering can seem dull side by side with totally out-there novelties.  Yet if we overlook the hard work to turn an idea into something real...

And finally, we've touched upon a third role of research, which is to educate, but not merely to educate the public.  We also have the role of educating industry thought leaders, students, government, and others in decision-making situations.  This education involves showing what works, and what doesn't, but also showing our cliental a style of facts-driven thinking and experimentally grounded validation that they might otherwise lack. 

So these are three distinct roles, and for me, research really needs to play all three at once.  Each of them requires a different style of investment strategy.

So where does all of this lead?  Just as there isn't one style of research that is somehow so much better than all others as to dominate, I think there isn't a single answer, but really a multitude of narrow answers that live under a broader umbrella, and we need a new kind of attention to the umbrella.  The narrow answers relate to the specific missions of the various agencies: NSF, DARPA, DOE, OSD, AFOSR, AFRL, ONR, ARO, NIH, etc.    These, I think, are fairly well-defined; the real issue is to maintain attention on the needs of the technology consumers, and on the degree to which the funding programs offer balanced and full coverage of the high priority needs.

The umbrella question is the puzzle, and I think it is on the umbrella point that many past mistakes have centered, including the examples I highlighted.

What it comes down to is this: when we have serious technology failings, I think this points to a failure of a kind of oversight.  Government leadership needs to include priority-setting and a requires a process for deliberately parceling out ownership, at least for urgent technology questions that are of direct relevance to the ability of our nation to do things that really matter, like operating the bulk electric power grid in ways that aren't only safe (we've been exceptionally good at power safety), but that are also defensible against new kinds of terrorist threats, and that are also effective in leveraging new kinds of energy sources.  We have lots of narrow research, but a dearth of larger integrative work.

That's one example; the issue I'm worried about isn't specific to that case but more generic.  My point is that it was during this integrative step that the failures I've pointed to all originated: they weren't so much narrow failures, but rather reflected a loss of control at the umbrella level: a failure to develop a broad mission statement for government-sponsored research, broadly construed, and to push that research mission down in ways that avoid leaving big gaps.

This issue of balance is, I think, often missed, and I would say that it has definitely been missing in the areas where I do my work: as noted in a previous one of these postings, there has been a huge tilt towards machine learning and AI -- great ideas -- but at the expensive of research on systems of various kinds -- and this is the mistake.  We need both kinds of work, and if we do one but starve the other, we end up with an imbalanced outcome that ill-serves the real consumers, be those organizations like the ones that run the stock exchange, or power grid, or air traffic control systems, or even industry, where the best work ultimately sets broad directions for whole technology sectors.

A new crowd is coming in Washington, and sooner or later, they will be looking hard at how we spend money to incent research.  Those of us doing the work need to think this whole question through, because we'll be asked, and we need to be ready to offer answers.  It will come down to this: to the degree to which we articulate our values, our vision, and our opportunities, there is going to be some opportunity for a realignment of goals.  But conversely, we mustn't assume that the funding agencies will automatically get it right. 

This will require considerable humility: we researchers haven't always gotten it right in the past, either.  But it seems to me that there is a chance to do the right thing here, and I would hope that if all of us think hard not just in parochial terms about our own local self-interest, but also in broader terms about the broader needs, we have chance to really do something positive.

Sunday, 29 January 2017

In 2015, Russia hacked the Ukraine power grid. How we can protect our grid? (Part 3 of 3)

The first two of my postings on this topic looked at what happened in 2015/2016, and then asked whether a similar attack could be successful here, concluding that in fact there is little doubt that a limited but highly disruptive event would be feasible: limited, because of the innate diversity of technologies in use by the nation's 10 RTO and ISO control centers, and the 15 or so TOs that do bulk transmission, but disruptive even so because an attack successful against even just one of these could be tremendously harmful to the US economy, to national confidence in the power grid, and could even result in substantial loss of life and damage to equipment.

There are two broad ways to think about cybersecurity.  One focuses on how good the procedures are that we use to install patches, monitor our systems, audit the logs later to catch previously unnoticed events, and so forth.  The US sets a global standard in these respects: we treat power systems as nationally critical infrastructures, and are employing security policies that couldn't be better if these were sensitive military assets.  Honestly, if there is a known cybersecurity protection concept, someone in the government regulatory agencies that oversee the grid has thought about applying it to the grid, and if the conclusion was that doing so would be beneficial, it has been done.

The problem is that Ukraine wasn't so terrible -- the cybersecurity posture of that country may not have been quite a sophisticated as ours, but it wasn't sloppy, not by a long shot.  So we have a situation here where we are a little better than Ukraine, but shouldn't assume that a little better really represents absolute protection.  Quite the contrary: the right conclusion is that this is a good posture, and reassuring, but not enough.

This leads to the second point of view: where cybersecurity threats are concerned, one must also accept that extremely large, complex computing systems are so fundamentally insecure, by their very nature, that security is just not achievable.  In the case of the power grid, while we have long lists of the people authorized to access the systems in our control centers, one wouldn't want to assume that those lists aren't accessible to Russia or China, and that there isn't a single one of those people with a gambling issue or a serious financial problem that might make them vulnerable.  We bring new hardware into these centers continuously: things like new printers, and new telephones, and new routers, and new advanced digital relays (this last is a reference to building-sized computer-operated systems that do power transformations from high to low tension, or from DC to AC and back, and are controlled by banks of built-in computers).  My guess is that in a single day, a typical RTO deploys 1000 or more computing devices into their infrastructure.  Would anyone in their right mind really be ready to bet that not one of those has a vulnerability?

So as a practical matter, one really could imagine the kinds of exploits seen in Ukraine, or similar ones, succeeding in the US. 

We have to open our eyes to the serious possibility of waking up some morning with the power out, and with the SCADA systems needed to restore power so badly compromised that we can't trust them while restoring and operating the grid for the months that it might take to rip them all out and replace them!  Worse, the news might well be reporting that a nuclear power control system has also been disabled, and that all the nuclear reactions made by such-and-such a vendor have been shut down worldwide, as a precautionary measure.

Fortunately, there really are steps we could take right now so that the morning this all happens, our situation would be stronger.  As of today, the biggest worry I would express is simply that the morning after, we'll be blind as well as knocked down: knocked down in the sense that the grid would be out, but blind because we wouldn't trust our standard control solutions, or they might have been so badly damaged in a physical sense that we just can't restart them (like a car after someone dumps sugar into the gas tank: until you've cleaned the engine, you won't get it started even if you try really, really hard).

But this issue of blindness is one we can tackle.  At Cornell my group has been working with the New England Independent Service Operator (ISO NE), the New York Power Authority (NYPA) and has had a great dialog with a number of other ISOs and RTOs and similar organizations.  We've been prototyping systems that could offer a survivable monitoring and data mining capability that would be an option to which power operators might turn the morning of the day after.

Our GridCloud system is a hardened technology that uses ideas from secure cloud computing to create a powerful data collection and archiving framework. Right now GridCloud captures data mostly from what are called synchrophasor measurement units, or PMUs, but the plan is to also begin to track network models, SCADA system outputs (so-called EMS data), data from ADRs, etc.  In fact there is a great deal of telemetry in the modern grid, and we really think one could stream all of it into such a system, and archive everything, cloud-style.  It would be a big-data system, but the cloud teaches that big-data shouldn't scare us.  In fact, GridCloud actually runs on today's commercial clouds: we support AWS from Amazon (in its virtually private cloud mode, of course), and are about to add Microsoft's Azure container environment as a second option, also in a secured mode of operation.  You encrypt all the communication connections, send all the data in triplicate, and run the entire cloud system mirrored -- our software can handle this -- and voila. 

Internally, GridCloud includes a self-management component called CM (we used to think of this as standing for CloudMake, because the solution mimics the Linux makefile model, but lately we are finding that CloudManage might be more evocative for users).  CM keeps the GridCloud wired together, helping restart failed elements and telling each component which connections to restore.

Another component of GridCloud is our Freeze Frame File System, mentioned here previously.  FFFS is exciting because it incorporates a very flexible and powerfile temporal data retrieval option, where you can pull up data from any time you like, with millisecond precision.  The retrieved data looks like a snapshot of the file system, although in fact we materialize it only when needed and retain it only as long as applications are actively reading the files.  These snapshots are very fast to extract,  look just like normal file system contents, and are also tamper-proof: if Tom Cruise zip-lines into one of our systems and changes something, he won't get away undetected.  FFFS will noticed instantly and will pinpoint the change he made.

Then we have tools for analysis of the data.  A very useful one simply reconstructs the power system state from the captured data (this is called a linear state estimation problem), and we use a version developed by researchers at Washington State University for this purpose.  Their solution could run at the scale of the entire country, in real-time, with delays from when data enters to when we visualize the output that can be as small as 100 to 200ms including network transmission latency.

Our current push is to add more analysis flexibility and power, in the form of simple ways of dynamically extracting data from the power system archives we create (for example, time-varying matrices or tensors that contain data sampled directly from the input we captured and stored into FFFS), and then allowing our users to define computations over these little data objects.

We have a dialog with Internet 2 to figure out ways to run on their networking technology even if the entire Internet itself was disabled (they have a special deal that lets them run on isolated network links and to use clusters of computers on military bases or in places like hospitals as their operating nodes).  So we could get a minimal computing functionality running this way, firewall it against new attacks by the bad guys, and bootstrap GridCloud inside, all with very strong security.

The idea is that the morning after, rather than waking up blind, a GridCloud operator would have an option: to use GridCloud itself as a backup SCADA solution.  The system experts could data mine to understand what the bad guys just did, and to inventory the damage and identify healthy aspects of the system.  Maybe the exploit against the nuclear reactor left a recognizable signature in the power grid itself: if so, GridCloud would have seen it prior to the crash, and archived it.  Moreover, since terrorists really do engage in repeated dry runs, maybe they tried out the same idea in the past: we can data mine to see when and how they did it, and perhaps even to figure out who within the ISO staff might have helped them.  Plus, we can then look for evidence of similar dry runs at other nuclear plants -- finding them would validate keeping those plants offline, but not seeing signs of trouble could reassure the owners enough to restart the ones that haven't been compromised.

All in all, this kind of active response would give us a day-after capability that might get power back up and running far more quickly than without such a tool: maybe hours instead of days, or days instead of weeks.  We would reduce the risk of the whole event escalating into a war, because politicians with more knowledge of the facts are less likely to act reflexively on the basis of incomplete data or theories, and are far less likely to be tricked into blaming the wrong parties.

Could our backup solution be compromised, too?  Sure, but my belief is that each separate exploit that the attacker would need to successfully carry out diminishes the chance of a real success.  Tom Cruise can zipline into one control room, use his magic knock-out gas to incapacitate a few guards, and maybe plant one of his team members in the group that services the control room computers too.  But could he mount an exploit 20x more difficult, simply by virtue of needing to compromise more and more distinct systems, all watching one-another, all sensing divergences in view?  I doubt it.

So while GridCloud may sound like just one more layer of monitoring tool, actually having one more layer may be exactly what the doctor ordered. 

Friday, 27 January 2017

In 2015, Russia hacked the Ukraine power grid. How big is the risk here? (Part 2 of 3)

If you've read part 1, hopefully I've convinced you that we do face a genuine risk.

But suppose someone set out to attack us.  How damaging could an attack really be?

Let's imagine that an attacker invests the needed resources (and let's not fool ourselves: the country that attacked Ukraine must have been Russia, but Russia isn't the only country able to prepare and carry off such exploits: at a minimum, China can too, and North Korea may have found ways to hire the needed expertise on the black market.  So there are at least three possible nation-state actors to consider.  Beyond that, there is probably at least one and perhaps more than one free-lance group out there with very high skill sets, working for organized criminals who use their skills for theft, blackmail and so forth.  And then there are friendly countries with deep hacking skills, like the UK, the rest of Europe, Israel.  So there are a bunch of potential bad actors.

For the sake of argument, let's not rehash part 1: assume that some bad actor has already done his homework.  So Russia, or China, or whatever has penetrated, perhaps, a handful of US RTO and ISO organizations.  How much harm would they be able to do, if they were inclined to attack

With control over a SCADA system, there are a few styles of attack that become feasible.  First and easiest is to just disable the SCADA control system itself.  With modern computers, if you reformat the disks and reflash the PROMs used for booting, you can render the machine pretty much useless, at least without a lot of hassle.    You can also cause some nasty crashes simultaneous with a blackout, and toss in additional barriers to block restarts, even using clean computers:  In Ukraine, the uninterruptible power supplies that were supposed to guarantee power for the grid operations center were hacked too: not only did they not supply clean power, but they were reprogrammed to actually cause power surges rather than to protect against them.  So that's easy.  Given years during which the attack was being planned, you could probably also compromise backup systems and hide some backdoor options, so that even after being discovered, there might be a way back in.

This alone is probably enough to cause a week or two of total chaos, similar to what Ukraine experienced right after the attack.  They ended up cobbling together a power grid and running it with purely human operations for a while.

But in fact you could do far more. 

Having broken in, you don't necessarily have to start by destroying the SCADA system.  Another option is to subvert the SCADA system to attack some of the technology components used in the grid itself.  A power grid has all sorts of elements: switching stations of the kind usually stuck in obscure locations out in the woods, dams and coal burning generators and nuclear reactors, wind farms, etc.  Some of these could be damaged by power surges, and others would probably be vulnerable to bizarre and incorrect control commands.  In fact, one thing you could take advantage of is that because SCADA systems are presumed to be secure, they usually have special direct ways to communicate into the control centers that operate such components.  

So you could explore ways of logging into the control center for a nuclear reactor and messing with its protections against core meltdown, or maybe look into the possibility of opening drainage in a series of dams in succession to generate a massive downstream flood.  Perhaps you could trick wind turbines into tearing themselves apart by deliberately configuring the wind-vanes to put them under as much stress as possible.  

Now we have to start to imagine an attack that could destabilize nuclear plants, flood entire cities, and leave wind farms in shreds: attacks with very real physical consequences!  And that kind of damage could take months to repair, or even years.

Of course, such attacks wouldn't be so easy to prepare: installing and debugging the exploit becomes the hardest step.  No way that you could do this to 10 RTO/ISO control systems simultaneously, and to their associated TOs, and to a large number of nuclear plants and dams and wind farms.  And keep in mind: a nuclear plant control room may accept requests (to increase or cut power production) from the local RTO, but this isn't the same, at all, as being wide-open to massive intrusion through the connection used to send those commands.  By and large, you'll be blocked at every step by firewalls, multi-factor authentication requirements, monitoring systems of varying levels of sophistication, you name it.  Try to deploy an intrusion capable of doing damage at the scale of the whole US and you'll be detected long before you can launch the attack.

But maybe you could set a smaller goal and succeed.

With years to prepare, and unlimited national backing, my guess is that a really professional team would overcome the barriers at least in some settings.  In Ukraine, an attack focused mostly on SCADA compromise was already enough wreak utter havoc: their system was down for weeks.  One might assume that in the US the impact would be more limited and shorter, but my guess is exactly the converse: I think that a cleverly planned exploit could be far more harmful here, even if pretty narrow in scope, simply because we depend so strongly on electrically powered technologies. Moreover, you could take advantage of our tendency to panic: in the US, we tend to overreact in extreme ways to certain types of fears.

For example, suppose that one icy cold winter morning we awoke with the power out for the northern 20% of the US: just a mundane blackout, but even so, lots of houses would suddenly feel very cold.  Worse, suppose that as the government was taking stock of the situation, several large hydroelectric generators suddenly malfunction in ways that indicated serious damage: perhaps,  two massive transformers took irreparable hits and will take a year or more to replace.  And then suddenly in comes a report that a nuclear reactor control system may have been compromised too: a particular reactor shut down into a fail-safe mode, and every nuclear reactor from that same vendor in the whole  US has been taken offline too, as a precautionary measure, expanding our regional problem into a massive nation-wide power shortage.  And just to enliven things, perhaps a few other mishaps occur at the same time: A jet landing at JFK collided with a plane on the runway, killing 500 passengers.  Two more near-miss events of the same kind have been reported at Denver and Atlanta airports, and nationwide air traffic control is also shutting down.  A train carrying toxic chemicals has derailed in downtown Atlanta.  A huge explosion has been reported in the new Keystone oil pipeline, and it has shut down.  Things like that.

Well, we know what would very likely happen next.

But let's not even go there.  Instead, in part 3, I'll offer some thoughts on how to make things better.

In 2015, Russia hacked the Ukraine power grid. Are we next? (Part 1 of 3)

In a widely publicized episode, the electric power grid in Ukraine was attacked during late 2015 and 2016, using computer viruses that destabilized the supervisory control and data acquisition  (SCADA) system.  Experts are about as certain as one can be that the Russian government was behind the episode.

Could this occur in the United States, and if so, what could we do about it?  If it happens, should we assume that the attack originated in Russia, like the Ukrainian event?

To answer such a question, we really need to break it down.  Here, I'll start with the basics: what do we really know about the Ukraine attack?  My sources, though, are unclassified (in fact there is one Wired article that had such detail that I'm tending to trust it heavily).  I mention this because we do live in an era of fake news, and there could be a deeper layer of insights, not be available to me.  For a question of national aggression -- literally an act of war by one country against another -- one needs to go deep and not trust the superficial!

But I'm just a regular guy without a clearance, and if I knocked on doors at the CIA and NSA, I wouldn't get very far.  Here's what I've learned from public materials.

First, in the wake of the event there were a series of very reputable groups that flew to Ukraine and participated in really careful studies of the precise modality of the attack.  The can be little doubt that the attack was extremely sophisticated and carefully planned, that Ukraine was not some sort of banana republic with an incompetent management of its national grid (it turns out that Ukraine was highly professional and pretty close to the state of the art).  From this, we'll have little choice but to acknowledge that the US grid is probably vulnerable, too.

Ukraine built its grid during a period when it was fairly wealthy, and was in a position to buy cutting edge technologies.  The system was a relatively standard high-quality SCADA solution, obtained from the same vendors who sell such systems here in the US.  Moreover, the country knew of the threat from outside, and managed its system quite professionally, using a "military" security standard.   However, Ukraine wasn't the most paranoid you could imagine.  In particular, it did allow operators to use computers attached to its network to receive email with links and attachments, and they could access Internet web sites from their office computers. 

Apparently, this was the first portal the attackers leveraged: they sent some form of normal looking work-related email, but it lured operators to a poisoned web site, which downloaded a virus that connected back to the attacker control system.  The trick being: the web site did whatever it was nominally supposed to do, so the operators never realized they had been compromised.

For example, think of the first time you visited the real-time market feed data site provided by your bank or retirement fund: you probably agreed to install some form of web browser plug-in to see the animated graphs of market activity.  The first step of the Ukraine attack was a bit like that, but that plug-in (in addition to doing what it promised), did a bit more.

Before assuming that this first step already rules out such an attack in the US, and that this could never happen here, one has to pause and realize that many of us receive emails from the HR organizations of our employers that require clicking links.   Many of us work for companies that use plug-ins to offer all sorts of functionality through browser extensions.  In fact, I work for such a company: Cornell University does this too.  If you were familiar with Cornell's web page layouts and logos, and knew how to compose a professional looking email with the right content, even a security-conscious person like me might follow the link without much thought. 

The core insight is that because of the so-so state of security on our computer operating systems, web browsers and other technologies, even normal news sites and other mundane web sites can potentially be a launch-point for attacks.   So this first step of the Ukraine attack could be successful in the US too, today.  In fact I know of very similar events that led to intrusions right into top-secret DoD systems and ones used within the White House, and that's without even having access to the classified version of the picture.  This definitely could happen in the US, even in highly sensitive systems.

Ok, but in fact, Ukraine's office computers weren't actually connected to the SCADA systems.  In fact the vast majority of  the state-operated power grid company employees had mundane jobs, like taking new orders, billing, scheduling repair crews.  Breaking into their computers wouldn't lead anywhere. So what happened next?

The initial exploit gave the attackers a toehold: it left them with hooks they could use to (in effect) log into a few computers, inside the Ukraine power grid operations center, but not ones concerned with actual power grid operations.  Those systems were much better protected.  So, our hackers needed to break through a second firewall.

As I understand it, this required finding systems used by operators who actually had permissions to log into the SCADA network.  Apparently, the Ukraine system masked the roles of the computers, and figuring this out wasn't simple and took months.  Nonetheless, step by step, the intruders managed to identify several such systems. 

In their next step, it seems that the hackers used so-called root kits to attack these computers from inside the Ukraine power system corporate network.  A root kit is a package of software, collected by hackers over decades, that takes advantage of subtle software bugs to sneak into a computer and grant the attacker superuser control, unnoticed by the owner of the machine.  There are a surprisingly large number of such systems -- you can download dozens from the web.  And then beyond that are specialized ones created by national intelligence services: they often use vulnerabilities that their designers discovered, and that nobody else was even aware of.

Software has bugs, and older software systems are worst of all.  Don't fool yourself: any system you can buy or use today has vulnerabilities.  Some are child's play to break into, and some are much more resilient, but none is foolproof.

So, our intruders laid low, figured out which machines to attack, and then finally after months of effort, managed to compromise a machine with VPN access to the SCADA platform.  But VPN software expects passwords and often more: two-factor systems that use RDA keychain dongles, fingerprints, special cards -- all such things are common.  I take it that Ukraine was using such a system.

To circumvent those issues, the trick is to modify the operating system itself, so that the next time a legitimate operator logs into the VPN, you can ride along with him or her, sneaking in for a little while and then ultimately, if luck is on your side, to leave a subtle open doorway, perhaps in the form of a legitimate-looking data feed that actually is carrying your covert traffic. 

So, after waiting for someone to activate the VPN from that machine, our attackers eventually managed to leapfrog into the secured environment, at some instant when the VPN connection was open.  Moreover, once in, even more steps were needed to actually penetrate and ultimately, incapacitate the SCADA platform, and those had to occur without site security systems noticing the intrusion.   They apparently had further layers of passwords to crack (here, access to a supercomputer can be helpful), and all of this had to occur without tripping the monitoring systems.

So who was behind this?  President Trump talks often about the 300lb pimply kid sitting in his bedroom.  Was it him?

Notice that the first steps required fluency in written Ukranian, and detailed knowledge of HR emails and other corporate emails within the organization.  Subsequent steps required knowing the software systems and versions and their vulnerabilities, and there may even have been a step at which a supercomputer was used to break a password by brute force.  Step by step, one would have to understand what monitoring tools were in use and how to avoid detection by it.  Military-quality root kits aren't so easy to come by.

The bottom line?  It is unquestionable that Russia was behind this attack.   And Russia happened to benefit from it too, enormously.  They had the means, the motive, and the timing coincided with a major flare-up in military tensions between Russia and Ukraine over Crimea, which Russia had just annexed.

The imaginary fat kid could never have managed this.  What we see here is how very sophisticated exploit that took years to prepare was carried out very slowly and deliberately.  Without getting even more detailed, investigators were able to show that the intruders had debugged the exploit over an extended period, using all sorts of pre-exploit testing and trial runs of aspects of the ultimate attack.  Even attack software needs to be debugged!

Ukraine was victim of a genuinely sophisticated team of expert attackers, backed by a country or organization with massive national resources, and the patience to chip away at the system for literally years, entirely undetected.  And that country was Russia.

The US media has tended to portray Ukraine as a backwoods country that set itself up for trouble, but as I read the story, that interpretation is totally invalid.  Most of what I've described definitely could occur in the US.   Our power grid operators may have plugged the specific holes that have now been identified in the products deployed in Ukraine, and may be better at monitoring the system and applying patches, but honestly, I wouldn't bet too heavily on that sort of thing. 

I once was invited to an unclassified NSA briefing in which the question of breaking into systems came up.  Their expert said that well, he couldn't get specific, but that honestly, it was impossible to build a system today that his people couldn't break into.  He said that modern computers have millions of lines of code, not to mention devices of many kinds that themselves include computers (routers, printers, even network interface cards).  NSA had made a science out of finding back doors in. 

He said that we should imagine a little village where the houses all had bowls of jewels on the main dining room table.  And all the doors and windows are wide open.  Even if they weren't, most of the windows have no latches and the doors have locks that share a single key.  And even if you fixed the doors and windows, the walls themselves are made of plywood screwed into wood beams, and with a screwdriver and a few minutes, you could make a new door just for yourself.  To say nothing of using a ladder to try breaking in upstairs, where the air conditioners turn out to not actually be attached and can just be pushed out of their slots.

What the US NSA can do, the Russian intelligence service can do as well.  Plus, they have tons of people who are fluent in Ukrainian human resources memo-writing.

So could the same thing happen to us?  Sure, without question.

Oddly, our main strength isn't that we operate our systems better or that we monitor them better.  We don't, and it isn't for lack of trying: these kinds of systems can't be defended against that sort of attack.

Our real advantage (a small one) is that to compromise our entire national grid, all at once, you would need to pull off at least 10 and perhaps more like 25 Ukraine-style attacks, because there are roughly 10 large scale regional transmission operators and independent system operators (RTOs and ISOs), and they work with an additional smaller 15 or so transmission operating entities.  Each makes its own technical choices, although there are some popular technologies that are near monopolies in their particular roles.  Thus what worked in Ukraine probably could work here, but might "only" knock out some subset of the overall national grid.

 (But this is plenty for one blog entry, so lets pause for a quick coffee and then we can resume...)

Thursday, 19 January 2017

We're actually all engineers

When Roger Needham was very ill, near the end, his lab members threw a final party for him.  Roger wasn't strong enough to attend, but he sent a video, in which he is wearing an engineer's hard hat.  He explains that he did some theory and came up with some clever principles and deep ideas, but he hopes to always be remembered as an engineer.  And indeed, this is how I remember him.

The image comes to mind because as I travel around on sabbatical, I'm increasingly struck by the degree to which the field of distributed systems (and systems more broadly) has begun to pivot towards the engineering side of the discipline.  This poses really hard puzzles for academic research.

For most of the first three or four decades of research in distributed computing, there were genuinely deep, mathematically hard questions we could wrestle with.  My favorite among these problems has been replication of data, and its cousin, coordinated distributed computation (state machine replication and variants of that model).  But one should also point to the problem of asynchronous consensus, with its FLP impossibility result and the various failure-oracle models for guaranteed progress, Byzantine agreement in various models, convergent consistency.  The logic of knowledge, and ideas of causality and time.  Gossip protocols and ideas of convergent consistency, and peer-to-peer overlays.  Inexpensive synchronization mechanisms.  Honestly, we've had a good run.

But every good run eventually comes to a point where fundamentally, things need to change.

When I entered the field, operating systems researchers were just beginning to "reject" abstractions: there was a big pivot underway back to basics.  As John Ousterhout once put it, we had plenty of abstractions and the ones in Linux worked well for our users, so we needed to focus on keeping those working well as we scaled them out, migrated to new hardware and put them under new patterns of load.  But he was negative about new abstractions, or at least preferred not to see those as part of the OS research agenda, and he wasn't much more charitable towards new theory: he felt that we had the theory we needed, just as we had the abstractions we needed.  In retrospect, he was more right than I understood at the time.  Today, nearly two decades after he made that point, Linux has evolved in many ways yet the grab-bag of abstractions are pretty similar to what existed back then. 

It may be time to embrace Ousterhout's point from back then: distributed computing is becoming an engineering discipline, and the best work in the coming years may be dominated by great engineering, with less and less opportunity to innovate through new theory results, or new styles of distributed computing, or new kinds of abstractions.  Like Roger Needham, we need to put on our hard hats and roll up our sleeves and turn to those engineering roots.

And there are plenty of great engineering challenges.  I'm very excited about RDMA, as you know if you've read my other postings, or read the paper on Derecho (we've concluded that 12 pages is just too short and are treating that version as a technical report; we'll send a longer version to a journal soon).  I think Derecho has some very cool abstractions, obviously: the "torrent of data" supported by RDMC, and the distributed shared memory model enabled by SST.  These innovate in the way they are engineered, and because those engineering tricks lead to somewhat peculiar behaviors, we then get to innovate by matching our protocols to our new building blocks -- and yet the protocols themselves are actually quite similar to the original Isis gbcast: the first really practical protocol strong enough to solve consensus, and in fact the first example of what we now think of as the family of Paxos protocols.  Derecho isn't identical to the Isis gbcast: the Isis protocols had an optimistic early delivery mode, for reasons of performance, and you had to invoke flush to push the pipeline of messages to their targets; we eliminated that feature in Derecho, because we no longer need it.  So we innovated by getting rid of something... but this is not the kind of innovation people usually have in mind, when I sit down with my Cornell colleagues and they rave about the latest accomplishment of unsupervised deep learning systems.

Recasting that gbcast protocol into this new programming style has been incredibly effective: Derecho is (so far as I can tell) a constructive lower bound for such protocols, optimal in every sense I can think of, including the mapping from protocol to hardware.  Yet the work is basically engineering, and while I could definitely make a case that the innovations are foundational, deep down, what I'm most proud of remains the engineering of the system.  That's the innovation.

We'll have more such challenges getting Derecho to scale to half a million nodes, or to run over a WAN:  Engineering challenges.  To solve the problems we'll encounter, we'll need to innovate.  Yet Derecho will remain, at its core, a reengineering of a concept that existed back in the late 1980's, when we first invented that early version of Isis gbcast (I guess we should have called it Paxos, particularly given that the name Isis is now seared into the public consciousness as pretty much the epitome of evil).

I could go on and talk about other areas within operating systems and systems in general, but I'm going to skip to my point instead: across the board, the operating systems community is finding that we already know the theory and already have the core solutions in hands, to the extent that those solutions have any kind of foundational flavor to them.  But we face huge practical challenges: engineering challenges.  The tough questions center on making new hardware easy to use and to match familiar ways of computing to the sweet spot for the new computing hardware we're computing upon. 

This situation is going to be very tough for academic research departments.  I'll give one example.  Multicore hardware is fantastic for heavily virtualized cloud computing data centers, so we have Amazon AWS, Microsoft Azure (if you haven't looked at Azure recently, take a new look: very impressive!), Google's cloud infrastructure, and the list goes on.  All of them are basically enabled by the chips and the cost-effective sharing model they support.  But virtualization for the cloud isn't really deeply different from virtualization when the idea first surfaced decades ago.

So academic research departments, seeking to hire fresh talent, have started to see a dismaying pattern: the top students are no longer doing work on radical new ways of programming with multicore.  Instead, the best people are looking at engineering issues created by the specific behaviors of multicore hardware systems or other kinds of cutting edge hardware.  The solutions don't create new abstractions or theory, but more often draw on ideas we've worked with for decades, adapting them to these new platforms.  My colleagues and I have had "help wanted" signs out for years now, and hoped that brilliant foundational systems researchers would show up.  We just haven't known how to deal with the amazingly strong engineers who came knocking. 

These days, the best systems candidates are certainly innovators, but where is the next FLP impossibility result, or the next Byzantine agreement protocol?  One hears non-systems people complaining: those of us in systems were all talking about Corfu a few years back.  Our colleagues were baffled, asking all sorts of questions that betray a deep disconnect: "What's the big deal about Corfu, Microsoft's amazingly fast append-only-log, with its Paxos guarantees: didn't we have logs?  Didn't we have Paxos?  Is an SSD such a big deal?"  For me as an engineer, the answer is that Corfu embodies a whole slew of novel ideas that are incredibly important -- engineering ideas, and engineering innovations.  But for our colleagues from other fields, it has become harder and harder to explain that the best work in systems isn't about foundational concepts and theory anymore. 

The effect of this is to make it harder and harder to hire systems people, and harder for them to succeed in the field.  The very best certainly do incredibly well.  Yet the numbers of applicants are way down, and the shape of these successes looks more and more practical, and less and less foundational.

The sheer scale of the cloud stimulated a huge wave of brilliant engineering, and wonderful research papers from industry.  Yet we in academics lack testbeds or real use cases that are even remotely as ambitious, and our development teams are small, whereas companies like Microsoft and Google routinely put fifteen or twenty people onto major projects.  How can academic researchers even compete in this kind of game?

I could go on at some length on this theme, but I hope you catch the basic drift: systems has a vibrant future, but the future of systems will be more and more closely linked to the hardware and to the deployment models.  Don't expect your next systems hires to be theory students who are inventing radical new concepts that will transform our conception of consistency: we know what consistency is; we figured it out thirty years ago.  You'll need to come to grips with hiring superb engineers, because increasingly, the engineering side of the field is the important, hot part of the domain.  Meanwhile, the mathematical side struggles to say anything relevant that wasn't said long ago: without new abstractions to reason about and prove things about, it is hard for them to position themselves as applied mathematicians: mathematicians, certainly, but not necessarily very applicable to anything that matters to Amazon, Google or Microsoft!

Honestly, this worries me.  I can definitely see how one could create a new kind of systems community from scratch, with totally new conferences dedicated to what could be called brilliantly principled systems engineering.  I love that sort of stuff; so did Roger Needham.  But the main conferences already exist.  And the dynamic that has become established is one that really squeezes academic research to a smaller and smaller niche, for all the reasons I've noted: we lack the testbeds, and the user-pull, and can't easily validate our work, and don't have large enough teams.

One answer might be for the main conferences to start to target "great systems engineering" and to treat really neat engineering ideas with the same awe that we accord to Google's latest globe-spanning consistent database (built using a custom worldwide network, dedicated satellite links, geosynchronized clocks that even worry about relativistic effects...)  This might be hard, because academic systems work these days is often beautiful in small, crystalline ways, and our program committees are full of people who were trained to look for big ideas, big innovations, and big demonstrations.

My own preference would be for us to deliberately take an innovation Hank Levy introduced but to push it even further.  Hank, worried that papers were "backing up" and clogging the system, convinced OSDI and SOSP to double the number of accepted papers.  The downside is that the conferences started to have very long schedules.  I think we might consider going even further, and doubling the accept rates again, basically by taking every solid paper, so long as it innovates in some recognizable way, including through experiments or just quality of engineering.  Then we'll have way too many papers for any sane conference schedule, and the PC would have to pick some for plenary presentations, and delegate others to parallel sessions or (my preference) to big poster sessions, maybe with a WIPS style introduction where authors would get 3 minutes to advertise their work.

If we treat such papers as first class ones: a SOSP paper is a SOSP paper, whether presented in plenary session or not, I think we could perhaps revive the academic side of the field. 

What if we fail to tackle these issues?  I think the answer is already evident, and I see it as I travel.  The field has fewer and fewer young faculty members and young researchers, and the ones doing PhDs are more and more inclined to join industry, where the rewards and the type of work are better aligned.  We'll just age out and fade into irrelevance.  When we do hire, our new hires will either do fancy but sort of shallow work (a traditional response to such situations), or will have trouble getting tenure: letter writers will stumble looking for those core "foundational" innovations, those new abstractions, and they won't validate engineering brilliance to nearly the degree that we should.

Does systems have a future?  I sure hope so, because I truly do love fantastic engineering, and there are fantastic engineering questions and opportunities as far into the future as one can see.  Clever tricks with new hardware, clever ways to do the same old stuff but faster and at greater scale and better then ever before -- I love that sort of work.  I enjoy talking to great engineers, and to people who just are amazingly strong software developers. 

But academics haven't learned to accept the validity of such work.  And this is a mistake: we really have to adopt new performance metrics and a new mindset, because otherwise, academic systems will surely perish as this new era  continues to transform the field.

Thursday, 12 January 2017

Disconnected life in a connected world

While on sabbatical, moving from country to country, we've periodically lost reasonable connectivity options: an iPhone is more than happy to roam, but once you notice that the $10/day charges are accumulating you quickly take the phone off the network.  The days of open WiFi are long past, and so you find yourself moving from one pool of WiFi connectivity to another with big disconnected gaps in between.

This, of course, is a common pattern: all of us experience such events when driving or in trains, or in situations with poor cellular signals, and for people in the military, police, fire, or other first-responder roles, there is also the issue of being in a setting that may have experienced a major disruption, so that things like cellular phones are down.

Life in that disconnected edge has become precarious and in some ways, dangerous.  A soldier who loses connectivity is at risk from "friendly fire": the good guys won't know who her or she is, but might see movement and fire.  An emergency responder entering an area devastated by a tornado won't know where the other team members are, what's been searched and what hasn't, and might have trouble calling for reinforcements.   The automated self-driving car might find itself in fully autonomous mode, and as I've mentioned several times, I happen to think that self-driving cars that don't slave themselves to smart highway systems will be a menace.

The core problem is that as the world migrates more and more strongly to a cloud-centric model, we are on the one hand increasingly dependent on cloud-hosted applications and infrastructure, and yet on the other hand, are also forced to operate without that infrastructure support whenever disconnected.

What can be done?  I used to work on gossip and peer to peer protocols, and some believe that mesh networks offer a possible response: at a minimum, an emergency response team or a squad of ground troops would have connectivity between themselves, which is already an improvement, and if one of the team members has visibility to a satellite or an orbiting drone that can relay network messages, we could even cobble together a form of Internet connectivity.

But mesh connectivity is, I think, of limited value.  The protocols for establishing a peering relationship between mobile nodes are surprisingly slow, especially if it needs to be authenticated, and TCP performs poorly on such connections because of their relatively high loss rates.  UDP datagram protocols would be a better option, but we lack a form of UDP that could avoid the initial peering dialog: you would want a kind of UDP "anycast" (accepted by any node within range), but you only get something more like UDP tunneled over a hardware level secure channel that needs a whole three-way handshake to establish.

Similarly, while I love gossip protocols, I honestly don't see much of a fit.  We want a very limited kind of communication between nearby peers, and reach-back to the cloud when feasible.  Gossip imagines a whole different model based on one to all broadcast (like for Bitcoin) and to me, that just isn't a match to this use case.

Today's approach is useful, even if not universal.  For example, a kind of long-term "quasi-static" mesh connectivity (based on connections that are costly to create but that endure for a long time) can work in situations like a military patrol, where connectivity is continuous even though the squad loses reach-back to the base.  But for other cases, like groups of self-driving cars, I don't see it as a good option.  In this I'm at odds with the self-driving car community; they are great fans of convoy-style protocols in which groups of cars get near one-another, connect, and cooperate for short periods of time; my belief is that forming the communication group will take so long that the cars could easily already have banged into one-another before the first message can be sent. 

I think we need new hardware and new software, and a new push to create a mode in which cloud computing applications could routinely disconnect for periods of time but continue to operate with lightweight mobile peering relationships.  Thus first, you need a model of cloud computing that understands disconnection; we have parts of the solution, but not all.  Next, we need hardware that can support an unpeered wireless connectivity model in which nearby nodes could sense one-another very rapidly and exchange unreliable datagrams without any prior setup (if the military or police would like a special secure mode with stronger connections, no problem, but it should run over this weaker packet mode, not the other way around).  And finally, we would want to use this functionality to show that we could create peered groups of cars in milliseconds, rather than seconds to minutes as seems to be the case today. 

None of these strikes me as technically doubtful.  The big win is that we could move towards an increasingly nimble style of sometimes-connected functionality.  When connected to the cloud, applications would have all the benefits of cloud computing that we enjoy (and are becoming very dependent upon) today.  But when connectivity is poor, at least nearby cars could coordinate to avoid crashing into one-another, or nearby emergency responders could cooperate to efficiently search a damaged building for survivors.  For IoT scenarios, we could leverage this same functionality to support rapid interrogation of nearby devices, or quick sharing of data with them.

Up to now, the cloud hasn't really needed disconnected mobility, so perhaps it isn't surprisingly that the cloud and the edge devices try quite so hard to operate in a mode that mimics TCP, or even TCP with SSL security.  But times are changing, and we need a new edge model that can better match the real use cases.

Thursday, 5 January 2017

Infrastructures matter too

This posting is kind of a gripe, so I'll try to keep is short and more blog-like.

In my research area, which is broadly considered to be "systems" and more narrowly has names like fault-tolerance, distributed computing or cloud computing, funding used to be extremely generous.  I think that this was perhaps because the US military (DARPA, AFRL/AFOSR, etc) were finding it difficult to deal with their distributed infrastructures and wanted to catalyze change.  We had a great run, in many senses: systems research led to today's massive cloud computing platforms, gave us big leaps forward in terms of the performance of all kinds of systems, has given us genuinely exciting new security options, and enabled far higher productivity.  Our students (at every level: PhD, MS and MEng, even undergraduates) did projects that prepared them for jobs at the top companies.  Some actually went on to found those companies, or to lead their engineering sides.

So the investment paid off.

Even so, under the Bush/Cheney administration, there was a sudden and very striking shift.  First, security research flourished and expanded hugely.  Somewhat ironically, I actually led the initial studies that created DARPA's original security research programs, but that was back in 1995; I myself never became a real security researcher.  But be that as a it may, there was a huge increase in funding for work on security (by now we see the value of this: many great ideas emerged from all that investment). Today there are new kinds of trusted platform modules with really remarkable capabilities, which definitely trace to this big push to create new secure computing options. 

Meanwhile, systems research continued to receive attention, and that mattered because with the emergence of scaled-out systems for the cloud, we needed non-industry study of all the questions that aren't of immediate, "we need it yesterday" importance to companies like Microsoft and Google.  But there was an odd flavor to it: to do systems work in that period, you often needed a big brother: a company that somehow was viewed favorably by the Cheney crowd and the DARPA leadership.  Jokes were made about needing a "FOT project leader" ("Friend of Tony", for Tony Tether).  And many projects were cancelled abruptly or suspended for months pending reviews to make sure they were really innovative.  The FOT projects usually avoided those reviews; the non-FOT projects suffered.  Over at NSF, the Bush/Cheney administration imposed sharp cuts, which Congress then tended to reverse.  A very complex picture.

This was often justified by funding leaders who would explain that the US government shouldn't invest in IT research if industry will do the work anyhow, or if there isn't some unique need specific to the government, or the military.

In fact, you can count on industry to solve today's problems, and yesterday's problems.  But if your needs are a bit unusual (as in the critical computing sector, or the military, or hospitals), don't hold your breath.  Anything that isn't aligned with the largest market sectors will languish, and our job as academic researchers is to solve the special needs of those (important) areas, while also educating the students who go on to fill all the key roles in industry.

Fast forward eight years.  Bush and Cheney are now out, replaced by the Obama administration.  One might have expected research funding to flourish.

In fact, at least in my area (but I think more broadly), a major problem emerged instead.

As you perhaps know, President Obama and the Congress he had to work with simply didn't see eye to eye, on anything.  One side-effect of the disagreement is that big discretionary funding cuts were imposed: the so-called "sequester".  For the national research programs at DARPA and NSF and other large players, this all took place just as machine learning and AI began to have really dramatic payoff.  Yet at the government budgeting level, agency research budgets stopped growing and at best, slipped into a mode in which they barely tracked inflation.  We had what is called a "zero-sum" situation: one in which someone has to lose if someone else wins.  So as funding shifted towards security on the systems side, and towards AI and ML on the other side, money for systems research dried up.  This happened first in the database area, and then crept into all areas of systems with the possible exception of computer networks.  For whatever reason, networking research was much more favorably treated.  A new area called Cyberphysical Systems emerged and was funded fairly well, but perhaps this is because it has unusually effective leadership.  Still, in the large picture, we saw growth on the AI/ML side coupled with shrinkage in the areas corresponding to systems.

When I talk to people at DARPA or NSF, I get a very mixed and often, inaccurate story of how and why this happened, and how they perceive it in retrospect.  NSF program managers insist that their programs were and remain as healthy as ever, and that the issue is simply that they can't find program managers interested in running programs in systems areas. They also point to the robust health of the network research programs.  But such comments usually brush over the dearth of programs in other areas, like distributed and cloud computing, or databases, or other kinds of scaled-out systems for important roles, like controlling self-driving cars. 

NSF also has absurdly low success rates now for programs in these areas.  They run a big competition, then fund one proposal out of 25, and halve the budget.  This may seem like a success to the NSF program officers but honestly, when you get allocated one half a student per year for your research, and have to submit 5 proposals to see 1 funded, it doesn't look like such a wonderful model!

So that summarizes what I tend to hear from folks at NSF.  At DARPA one hears two kinds of messages.  One rather frequent "message" takes the form of blame.  Several years ago, DARPA hired a researcher named Peter Lee from CMU, and gave him a special opportunity: to create a new DARPA advanced research topics office.  This would have owned the whole area I'm talking about.

But when Rick Rachid announced retirement plans at Microsoft,  Peter took Rick's old role, and DARPA couldn't find anyone as strong as Peter to replace him.  After a bit of dithering around, DARPA folded those new programs into I2O, but I2O's main agenda is AI and ML, together with a general search for breakthrough opportunities in associated areas like security.  Maybe asking I2O to also tackle ambitious cloud computing needs is too much?  At any rate, after an initially strong set of programs in cloud computing, I2O backed away in the face of the sequester and funding was cut very deeply.

If you ask about this cut, specifically, you hear a second kind of message -- one that can sound cold and brutal.  People will tell you that the perception at DARPA is that today, the best systems research happens in industry now and that if DARPA can't create and shape the industry, then DARPA has little interest in investing in the research.  And they will tell you that whether NSF acknowledges this or not, the same view holds in the front offices of NSF, AFOSR, ONR, you name it.

Then you hear that same rather tired nonsensical claim: that there "are no program officers interested in these topics."  Obviously, there would be candidates if NSF and DARPA and DOE and so forth tried in a serious way to recruit them.  But the front offices of those agencies don't identify these areas as priorities, and don't make a show of looking for stars to lead major programs.  Given the downplayed opportunity, who would want the job?  A program manager wants to help shape the future of a field, not to babysit a shrinking area on the road to funding extinction.  So blaming the lack of program officers just ignores the obvious fact that the program officers are a mirror image of the priorities of the front office leadership.  Recruit and thou shall find.

Three reasons (one especially doubtful).  But you know what?  I think all three are just dumb and wrong.

One problem with this situation is that the industry needs top-quality employees, who need to be trained as students: trained by people like me in labs like mine.  Without research funding, this pipeline stalls, and over time, industry ends up with fewer and fewer young people.  The older ones become established, less active, and move towards managerial roles.  So innovation stalls.  As a result, the current priorities guarantee that systems platforms will stop evolving in really radical ways. Of course if systems platforms are boring, useless, unimportant -- well, then nobody should care.  But I would argue that systems platforms are core parts of everything that matters.  So of course we should care!  And if we strangle the pipeline of people to build those platforms, they are not going to evolve in exciting ways.

Second, innovation requires out of the box thinking, and under pressure to compete and be first to market with new cloud computing concepts or other new functionality, there is a furious focus on just a subset of topics.   So by starving academic research, we end up with great systems work on the topics industry prioritizes most urgently, but little attention to topics that aren't of near-term impact.  And you can see this concretely: when MSR shut down their Silicon Valley lab, the word most of us heard was that the work there was super, but that the degree of impact on the company was deemed to be fairly low.  MSR no longer had the resources, internally, to invest in speculative projects and was sending a message: have impact, or do your work elsewhere.  Google's employees hear that same message daily.  Honestly, almost all the big companies have this mindset.

So there is a sense in which the government pivot away from systems research, however it may have been intended, starves innovation in the systems area, broadly defined (so including distributed systems, databases, operating systems, etc).  All of us are spending more and more time chasing less and less funding.

The bottom line as I see it is this.  On the one hand, we all see merit in self-driving cars (I would add "... provided that they get lots of help from smart highways"), smart power grids, other kinds of smart city, building and infrastructures.  So all of this is for the good.

But when you look at these kinds of Internet of Things use cases, there is always a mix of needs: you need a platform to robustly (by which I mean securely, perhaps privately, fault-tolerantly) capture data.  You need the AI/ML system to think about this data and make decisions.  And then you need the platform to get those decisions back to the places where the actions occur.

This dynamic is being ignored: we are building the AI/ML systems as if only the self-driving car runs continuously.  All the cloud-hosted aspects are being designed using data in static files, ignoring the real-time dynamics and timing challenges.  The cloud platforms today are, if anything, designed to be slow and to pipeline their actions, trading staleness at the edge for higher transaction rates deep inside: a good tradeoff for Facebook, but less evident for a self-driving car control system!  And the needed research to create these needed platforms is just not occurring: not in industry, and not in academic settings either.

Even if you only care about AI/ML you actually should care about this last point.  Suppose that the "enemy" (maybe some kind of future belligerent country) has fighter planes with less effective AI/ML (they only use the open source stuff, so they run a few years behind us).  But their AI/ML isn't terrible (the open source stuff is pretty good).  And now suppose that they invested in those ignored platform areas and managed to get everything to run 10x or 100x faster, or to scale amazingly well, just by leveraging RDMA hardware, or something along those lines. 

So suppose now that we have an incoming Foo Fighter closing at Mach 4.5 and armed with last year's AI/ML, but running that code 100x faster.  And our coalition fighter has the cutting edge new AI/ML, maybe some kind of deep neural network trained from data collected from the best pilots in the USAF.  But the software platform is sluggish and doesn't even make use of the cutting edge NetFPGA cards and GPU accelerators because the 2010 version of Linux they run didn't support those kinds of things.  Who would you bet on?  I myself might bet on the Foo Fighter.

Slightly dumber AI on a blazingly fast hardware/software infrastructure beats AI++ on old crappy platforms.  This is just a reality that our funding leadership needs to come to grips with.  You can't run shiny new AI on rusty old platforms and expect magic.  But you definitely can take open source AI and run it on an amazing breakthrough platform and you might actually get real magic that way.

Platforms matter.  They really do.  Systems research is every bit as important as AI/ML research!

We have a new administration taking office now, and there will clearly be a shakeup in the DC research establishment.  Here's my hope: we need a bit of clear thinking about balance at NSF and DARPA (and at DOE/ARPAe, AFOSR, ONR, AFRL, you name it).  All of this investment in security and machine learning, is great, and should continue.  But that work shouldn't be funded at the expense of investment in platforms for the real-time Internet of Things, or new kinds of databases for mobile end-users (including mobile human users, mobile cars, mobile drones, mobile warfighters and police officers and firemen...).  We need these things, and the work simply takes money: money to do the research, and as a side-effect, to train the students who will later create the products.

I'm not casting a net for my own next job: I love what I do at Cornell and plan to continue to do it, even if research funding is in a drought right now.   I'm managing to scrape together enough funding to do what I do.  Anyhow, I don't plan to move to DC and fix this.  The real issue is priorities and funding allocations: we've been through a long period during which those priorities have gotten way out of balance, and the effect is to squeeze young talent out of the field entirely, to dry up the pipeline, and to kill innovation that departs from what industry happens to view as its top priorities. 

The issues will vanish if priorities align with the real needs and money is allocated by Congress.  Let's hope that with the changing of the guard comes a rebalancing of priorities!