A Few Thoughts on Distributed Computing: In 2015, Russia hacked the Ukraine power grid. How we can protect our grid? (Part 3 of 3)

The first two of my postings on this topic looked at what happened in 2015/2016, and then asked whether a similar attack could be successful here, concluding that in fact there is little doubt that a limited but highly disruptive event would be feasible: limited, because of the innate diversity of technologies in use by the nation's 10 RTO and ISO control centers, and the 15 or so TOs that do bulk transmission, but disruptive even so because an attack successful against even just one of these could be tremendously harmful to the US economy, to national confidence in the power grid, and could even result in substantial loss of life and damage to equipment.

There are two broad ways to think about cybersecurity. One focuses on how good the procedures are that we use to install patches, monitor our systems, audit the logs later to catch previously unnoticed events, and so forth. The US sets a global standard in these respects: we treat power systems as nationally critical infrastructures, and are employing security policies that couldn't be better if these were sensitive military assets. Honestly, if there is a known cybersecurity protection concept, someone in the government regulatory agencies that oversee the grid has thought about applying it to the grid, and if the conclusion was that doing so would be beneficial, it has been done.

The problem is that Ukraine wasn't so terrible -- the cybersecurity posture of that country may not have been quite a sophisticated as ours, but it wasn't sloppy, not by a long shot. So we have a situation here where we are a little better than Ukraine, but shouldn't assume that a little better really represents absolute protection. Quite the contrary: the right conclusion is that this is a good posture, and reassuring, but not enough.

This leads to the second point of view: where cybersecurity threats are concerned, one must also accept that extremely large, complex computing systems are so fundamentally insecure, by their very nature, that security is just not achievable. In the case of the power grid, while we have long lists of the people authorized to access the systems in our control centers, one wouldn't want to assume that those lists aren't accessible to Russia or China, and that there isn't a single one of those people with a gambling issue or a serious financial problem that might make them vulnerable. We bring new hardware into these centers continuously: things like new printers, and new telephones, and new routers, and new advanced digital relays (this last is a reference to building-sized computer-operated systems that do power transformations from high to low tension, or from DC to AC and back, and are controlled by banks of built-in computers). My guess is that in a single day, a typical RTO deploys 1000 or more computing devices into their infrastructure. Would anyone in their right mind really be ready to bet that not one of those has a vulnerability?

So as a practical matter, one really could imagine the kinds of exploits seen in Ukraine, or similar ones, succeeding in the US.

We have to open our eyes to the serious possibility of waking up some morning with the power out, and with the SCADA systems needed to restore power so badly compromised that we can't trust them while restoring and operating the grid for the months that it might take to rip them all out and replace them! Worse, the news might well be reporting that a nuclear power control system has also been disabled, and that all the nuclear reactions made by such-and-such a vendor have been shut down worldwide, as a precautionary measure.

Fortunately, there really are steps we could take right now so that the morning this all happens, our situation would be stronger. As of today, the biggest worry I would express is simply that the morning after, we'll be blind as well as knocked down: knocked down in the sense that the grid would be out, but blind because we wouldn't trust our standard control solutions, or they might have been so badly damaged in a physical sense that we just can't restart them (like a car after someone dumps sugar into the gas tank: until you've cleaned the engine, you won't get it started even if you try really, really hard).

But this issue of blindness is one we can tackle. At Cornell my group has been working with the New England Independent Service Operator (ISO NE), the New York Power Authority (NYPA) and has had a great dialog with a number of other ISOs and RTOs and similar organizations. We've been prototyping systems that could offer a survivable monitoring and data mining capability that would be an option to which power operators might turn the morning of the day after.

Our GridCloud system is a hardened technology that uses ideas from secure cloud computing to create a powerful data collection and archiving framework. Right now GridCloud captures data mostly from what are called synchrophasor measurement units, or PMUs, but the plan is to also begin to track network models, SCADA system outputs (so-called EMS data), data from ADRs, etc. In fact there is a great deal of telemetry in the modern grid, and we really think one could stream all of it into such a system, and archive everything, cloud-style. It would be a big-data system, but the cloud teaches that big-data shouldn't scare us. In fact, GridCloud actually runs on today's commercial clouds: we support AWS from Amazon (in its virtually private cloud mode, of course), and are about to add Microsoft's Azure container environment as a second option, also in a secured mode of operation. You encrypt all the communication connections, send all the data in triplicate, and run the entire cloud system mirrored -- our software can handle this -- and voila.

Internally, GridCloud includes a self-management component called CM (we used to think of this as standing for CloudMake, because the solution mimics the Linux makefile model, but lately we are finding that CloudManage might be more evocative for users). CM keeps the GridCloud wired together, helping restart failed elements and telling each component which connections to restore.

Another component of GridCloud is our Freeze Frame File System, mentioned here previously. FFFS is exciting because it incorporates a very flexible and powerfile temporal data retrieval option, where you can pull up data from any time you like, with millisecond precision. The retrieved data looks like a snapshot of the file system, although in fact we materialize it only when needed and retain it only as long as applications are actively reading the files. These snapshots are very fast to extract, look just like normal file system contents, and are also tamper-proof: if Tom Cruise zip-lines into one of our systems and changes something, he won't get away undetected. FFFS will noticed instantly and will pinpoint the change he made.

Then we have tools for analysis of the data. A very useful one simply reconstructs the power system state from the captured data (this is called a linear state estimation problem), and we use a version developed by researchers at Washington State University for this purpose. Their solution could run at the scale of the entire country, in real-time, with delays from when data enters to when we visualize the output that can be as small as 100 to 200ms including network transmission latency.

Our current push is to add more analysis flexibility and power, in the form of simple ways of dynamically extracting data from the power system archives we create (for example, time-varying matrices or tensors that contain data sampled directly from the input we captured and stored into FFFS), and then allowing our users to define computations over these little data objects.

We have a dialog with Internet 2 to figure out ways to run on their networking technology even if the entire Internet itself was disabled (they have a special deal that lets them run on isolated network links and to use clusters of computers on military bases or in places like hospitals as their operating nodes). So we could get a minimal computing functionality running this way, firewall it against new attacks by the bad guys, and bootstrap GridCloud inside, all with very strong security.

The idea is that the morning after, rather than waking up blind, a GridCloud operator would have an option: to use GridCloud itself as a backup SCADA solution. The system experts could data mine to understand what the bad guys just did, and to inventory the damage and identify healthy aspects of the system. Maybe the exploit against the nuclear reactor left a recognizable signature in the power grid itself: if so, GridCloud would have seen it prior to the crash, and archived it. Moreover, since terrorists really do engage in repeated dry runs, maybe they tried out the same idea in the past: we can data mine to see when and how they did it, and perhaps even to figure out who within the ISO staff might have helped them. Plus, we can then look for evidence of similar dry runs at other nuclear plants -- finding them would validate keeping those plants offline, but not seeing signs of trouble could reassure the owners enough to restart the ones that haven't been compromised.

All in all, this kind of active response would give us a day-after capability that might get power back up and running far more quickly than without such a tool: maybe hours instead of days, or days instead of weeks. We would reduce the risk of the whole event escalating into a war, because politicians with more knowledge of the facts are less likely to act reflexively on the basis of incomplete data or theories, and are far less likely to be tricked into blaming the wrong parties.

Could our backup solution be compromised, too? Sure, but my belief is that each separate exploit that the attacker would need to successfully carry out diminishes the chance of a real success. Tom Cruise can zipline into one control room, use his magic knock-out gas to incapacitate a few guards, and maybe plant one of his team members in the group that services the control room computers too. But could he mount an exploit 20x more difficult, simply by virtue of needing to compromise more and more distinct systems, all watching one-another, all sensing divergences in view? I doubt it.

So while GridCloud may sound like just one more layer of monitoring tool, actually having one more layer may be exactly what the doctor ordered.

A Few Thoughts on Distributed Computing

Sunday, 29 January 2017

In 2015, Russia hacked the Ukraine power grid. How we can protect our grid? (Part 3 of 3)

No comments:

Post a Comment