Thursday 13 September 2018

Will HPC survive the cloud?

I just got back a from an HPC workshop, where a lot of the discussion was focused on the impact of cloud computing on HPC.  Here are a few of the main take-aways.
  • First, to get this up in front, HPC is alive and well.  A whole slew of amazing new computers are about to be powered up, operating at speeds that just defy human understanding.  So HPC isn't about to collapse and die tomorrow.  (Ten years out, though, is a more complex question).
  • Some of the really big financial drivers for HPC are things that genuinely need massive compute infrastructures: tasks like weather prediction, scientific computing from experiments like the LIGO gravitational-wave observatory, modelling the air flow around a supersonic jet.
  • But more and more HPC tasks have an embarrassingly parallel structure and really map down to huge numbers of subtasks that don't need massive computers to perform.  One person estimated that 90 to 95% of the workload on today's biggest computers consists of vast numbers of smaller jobs that run as a batch, but could easily be performed on smaller machines if they had the right hardware.
  • And several speakers put pictures of big cloud computing data centers up, and pointed out that no matter how exciting those new HPC systems will be, even a small cloud data center has vastly more compute power in it, and vastly more storage capacity.
  • On top of this, we have the runaway success of Microsoft's Azure HPC, which has become a genuinely hot cloud platform -- the demand far exceeds what people had expected, based on industry articles I've followed over the past few years.  Azure HPC offers smallish clusters that might have, say, 48 machines, but those machines would then be running the same "bare metal" platforms you see on massive HPC supercomputers.  And nothing stops Azure HPC from ramping up and starting to offer larger and larger configurations.  Rather than run MPI over RoCE, Microsoft just puts a second network infrastructure on their Azure HPC clusters, using InfiniBand for MPI and treating the standard ethernet as a control network for general TCP/IP uses.

So this is the looming threat to the HPC community: not so much that HPC might suddenly loose its steam, but rather that we could see some non-trivial percentage of the HPC jobs migrate towards platforms like Azure HPC.  And in fact one speaker at the workshop was the head of computing for a large research university, who told us about a consortium being formed to promote just that transition.  What he explained was that while really big HPC still needs the big data centers, like the U. Texas XSEDE systems, most of the campus needs could be adequately served with smaller resources.  This makes it appealing for universities to rent, rather than own, and by forming consortia, they could have the bargaining power to make financially compelling deals with big cloud HPC operators like Microsoft (and not just Microsoft -- he pointed out that as a buyer shopping around, he was getting bids from quite a few cloud providers).

The issue this raises is that it redirects money that would in the past have flowed to the HPC data centers towards those big providers.  Imagine a world in which, say five years from now, 30% of today's HPC has moved to cloud solutions.  The loss of that income base could make it very hard for the big data centers to continue to invest and upgrade.  Meanwhile, all that cash flowing to the data center owners would incent them to explore more and more ambitious cloud-hosted HPC products, trying to find the sweet spot that maximizes income without overstretching them. 

The second issue I'm seeing relates to my new favorite topic: the intelligent, reactive cloud edge.  Recall from my past few blog postings that I've been fascinated by the evolution of the first tier of the cloud: machines inside the data center, but that are on the front line, running services that directly handle incoming data from IoT devices, real-time uses like smart cars or smart homes, or other time-critical, highly demanding, applications.  Part of my interest is that I'm really not fond of just working on web servers, and these intelligent IoT applications need the mix of fault-tolerance and consistency that my group specializes in: they seem like the right home for our Derecho technology and the file system that runs over it, Freeze Frame.

But this has an HPC ramification too: if companies like Microsoft want Azure HPC to be a player in their cloud applications, they will invest to strengthen the options for using HPC solutions as part of real-time edge applications.  We'll see a growing range of HPC platforms that can tie deeply right into the Azure IoT Edge, for example, and HPC could start to perform demanding tasks under real-time pressure.

Right now, those standards aren't at all mature -- HPC systems are casual about endlessly slow startup (I did one experiment with MPI and was shocked to realize that multi-minute delays are totally common between when a job "starts" and when the full configuration of the job is actually up and ready to run my application).  We could talk about why this is the case: they do silly things like pulling the container images one by one on the nodes as they launch, and sometimes actually pull DLLs in one by one as needed too, so the causes are totally mundane.  Derecho (or even its RDMC component) could be "life transforming" for this kind of thing!  But the real point is that it can be fixed.

So imagine that over a five year period, the Azure edge, and similar systems from Amazon and other providers, start to really do a great job of integrating HPC into the cloud.  The rich and extensive tool set the HPC community has developed suddenly becomes available to cloud application creators, for use in real-time situations, and it becomes easy to capture data and "farm it out" to HPC with no significant delays at all (I mean milliseconds, whereas today, that might be minutes...).  Wow, what an exciting thing this could yield!!!

For example, in the electric power grid settings I've worked on, one could do micro-predictions of wind patterns or sunshine patterns and use that to anticipate the power output from wind farms or solar farms.   You could adjust the wind turbines dynamically to maximize their productivity. Someday, with enough knowledge of the communities connected to the grid, we could even predict the exact power output from city-scale rooftop solar deployments.  Just that one use case could be transformative!

Then you can imagine all kinds of image processing and data fusion tasks that would be feasible today in offline settings, but way out of reach for real-time applications.  Suddenly they could become HPC subtasks in this hybrid cloud built as a fast reactive edge with HPC clusters as a module available to the edge applications.  HPC could become a major player in the cloud ecosystem.

This is the bigger threat to the traditional HPC community, as I see it: a threat of explosive innovation that could win by just being more exciting, faster growing, lucrative, and massive in scale.  It wouldn't take long before HPC on the cloud would be the hot setting for young researchers to tackle, and HPC on traditional supercomputers would begin to starve simply because it would look more and more like a legacy world.

At the workshop, we actually had one speaker who made the case that HPC supercomputers were in a "race" to own time-critical (real-time) HPC compute tasks.  But there were many speakers, myself included, who argued that no, the race is already over -- the cloud won before the HPC community even knew that the opportunity existed.  Today, the real race is the race to be a player in this new thing: the intelligent IoT-based edge.  And HPC as a component of that story clearly has a very bright future.  

1 comment:

  1. Ken, thanks for an informative blog...while i dont follow the HPC market/technology, edge computing is definitely an exciting area which should encourage development of new computational models or perhaps retooling current methods to deliver predictable low latency results. There are interesting developments in mobile edge computing standards that look to specify platform requirements including transport and link layer parameters.

    ReplyDelete

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.