Saturday, 19 September 2020

Who will host the future edge cloud?

As you know if you've followed my blog, I'm particularly excited about the potential of the IoT edge.  The term is a bit ambiguous, so I'll explain what I mean by it first, but then I want to speculate a little about who will end up "hosting" this important emerging real-estate.  The host of any large-scale cloud has the potential to earn quite a bit of revenue (witness AWS and Azure), so the question has big monetary implications.

With the explosive growth of machine learning in the standard cloud, it can be easy to overlook a key point: in fact, very little of the cloud is capable of reacting in real-time to evolving data.  The core issue is that ML emerged from the back-end big data analytic frameworks: massive storage infrastructures coupled to scalable computing infrastructures (think of Spark side-by-side with HDFS or Ceph).  Such a structure is designed for sharding, so we can first upload our data in a spread-out way (either using a randomized placement, or with some form of control over that step), then initiate computations which will run in a decentralized manner.  The resulting performance and scalability can be astonishing.

But if we work in this manner, we end up with batched jobs that recompute the models or perform other tasks after some delay, perhaps minutes and in many settings, days or even weeks.  Thus today when the cloud edge performs a task using a machine-learned model, that model will probably be somewhat stale.

By cloud edge, I have in mind the first tiers of a typical data center.  In systems like AWS, Azure, Facebook, Google (and the list, of course, is endless...), a data center is typically split between that back-end infrastructure and a front-end system.  This front-end portion is deployed as a series of layers: the first tier is a highly elastic pool of servers that handle the majority of incoming requests, such as to build web pages or process video uploads from webcams.  The idea is that a first tier server should be fast, lightweight and specialized, and we work with a pay for what you use model in which these servers are launched on demand, run while they are actively doing something, then terminate.  Thus any given node in the first-tier pool can be repurposed quite often.  

The increasingly popular "function" model is an example of a first-tier capability. In Google's cloud or AWS, the first tier can launch user-provided code: computing on demand, as needed.  (Azure supports a function service too, but one that is strangely slow: Whereas a Google or AWS function runs within as little as 50-100ms, Azure functions often need many seconds to launch... and hence are probably not something that will be useful in the future edge cloud I have in mind, unless Microsoft reimplements that tier to offer competitive speed).

The nodes supporting the function tier generally lack very high-end GPUs, simply because a GPU  needs a lot of care and feeding: we need to pre-provision it with the kernels (code libraries) we plan to use, perhaps would need to preload hyperparameters and models, etc.  My point about Azure being slow applies: If one wanted to leverage these kinds of resources, launching a function might take seconds... far too much for a first-tier that nominally will be elastic at a scale of fractions of a second, running totally different sets of functions on those same nodes.  This all adds up to a simple rule of thumb: A service running in this tier isn't really the obvious candidate for heavyweight ML tasks.

We solve this problem by pairing our first tier with a tier (call it the second tier) of services that provide supporting functionality.  The second tier servers are managed by something called the "App Service", and again are quite dynamic: a collection (or more often, a graph) of service pools, each pool providing some functionality specialized to a particular task, and interconnected because these pools might need help from one-another too.  Amazon once remarked that a typical web page actually reflects contributions from as many as 100 distinct servers: we have a web page builder in the first tier, and then those 100 servers are actually members of 100 service pools, each pool containing an elastically varied set of servers.  Thus, for Amazon's own uses, the second tier would be crowded with 100's of service pools, to say nothing of second tier uses associated with AWS customers.

What might a second-tier service be doing?  Well, think about the "also recommended" product list on your Amazon page.  Some ML model was used to realize that if you are looking at the Da Vinci Code (which you bought years ago), you might also enjoy reading  The Lost Symbol (a prior book of Brown's), or The Templar Codex (another "chase through Europe seeking a medieval artifact"), or perhaps An Instance of the Fingerpost (an obscure but excellent historical novel...).  Amazon has a whole service to compute recommendations, and it in turn is using other services to pull in sub-recommendations in various genre's, all specialized for "people like you" based on your history of browsing and purchases.

So just this one example might give us several services, each one operating as a pool because the load on it would vary so greatly as Amazon shoppers poke around.  We would run such a pool with a small number of service instances when load drops, but the App Service can spawn more on demand.  

Unlike a first tier, second tier service pools deal mostly with internal traffic (the first tier is triggered by requests coming over the Internet, from browsers and other web-enabled apps), and whereas the first tier spins on a dime, spawning or terminating instances in fractions of a second, the second tier might be much less quick to adapt: think in terms of processes that do run on multi-purpose nodes, but once they launch tend to own the hardware for minutes or hours.  So it makes sense to put GPUs and FPGAs on these second tier instances.

The second tier is closely paired to a third tier of vendor-supplied services, which often can be customized in various ways.  So in this third tier we have storage solutions (file systems, DHTs of various kinds), databases, AI engines, etc.  Some modern developers work purely with the pre-supplied options.  On the other hand, if you are tackling an unusual situation, like automating a dairy farm, the odds are that you won't find the mix of services you really need.  You'll pretty much be forced to either work with slow back-end systems (those run in batched style, once every few hours or days), or you might need to build a new second-tier service of your own.

With this context, let's revert to my real topic.  

Cloud operators like AWS and Azure make a great deal of their revenue in these second and third tiers of the cloud.  Yet in adopting this model, a hidden dynamic has emerged that favors the big warehouse style of clouds: the vast majority of today's web is centered around web pages and "high functionality" mobile devices.  Any rapidly-changing data lives and is used right on your mobile device; for example, this is where Google Maps runs.  Thus the data center isn't under enormous time pressure, even in the first and second tiers.  We tend to focus more on pipelines: we want a steady stream of web pages constructed within 100ms per page.  That may seem fast, but consider the speed at which a self-driving car needs to react -- on a high-speed road, 100ms is an eternity.   Perhaps this observation explains why Azure has been competitive despite its strangely slow function tier: if people currently write functions that often run for minutes, a few seconds to launch them wouldn't be noticeable.

But we are seeing more and more uses that need instantaneous responses.  Today's cloud edge is really just the outside of the cloud, and not responsive enough.  As we deploy IoT devices into the environment, we are starting to see a need for something else: an edge cloud (did you notice that the word order flipped?)  By this I mean a way to situate my computing closer to the devices.  The cloud edge might be too far away, and the latency of talking to it could cripple my IoT solution.

The need, then, is for a new form of data center: an edge cloud data center.  What might it look like, and where would it run?

My view is that it would look like the Google or AWS lambda tier (and I bet that by the time this plays out, the Azure function tier will have been fixed: the Azure cloud clearly is equal to any rival in every arena that has competitive importance right now).  Basically, the edge cloud will host first, second and third-tier software, probably using the exact same APIs we see today.

It would run on something like Microsoft's Azure IoT Edge: a small compute cluster (maybe even just a little box running a NUMA CPU chip), situated close to the sensors, with a WAN network link reaching back into the cloud for any tasks that can't be performed on this platform.

So who will run these boxes?  If we think of them in the aggregate, we are suddenly talking about a true edge cloud that could host third-party apps, charge for cycles and other resources consumed, and hence generate revenue for the operator.  These apps could then support smart homes, smart grid, smart roadways to guide smart cars... whatever makes sense, and represents a potential market.  Yet there is a chicken-and-egg puzzle here: the scale at which one can really call something a "cloud" is certainly national and perhaps global.  So any operator able to play in this market would need to deploy and manage an infrastructure at planetary scale: otherwise, we end up with just a handful of boxes of type A in the US, others of type B in China, each owned and operated by a distinct player, and each with its own customized standards.  I find it hard to see how that kind of heterogeneity could give us a true edge cloud.

I've been involved with 5G and for a while, wondered if telephony operators might be the answer.  They already need to deploy 5G point of presence systems, which often look quite similar to what I've described.  Indeed, Azure's IoT Edge seems to be an early winner in this market.  But the resulting deployments are still operated by literally thousands of players, and perhaps I'm underestimating by a factor of 10.  The number of likely 5G operators is simply too large for more than a few to tackle this opportunity (I can see AT&T doing it, for example, or Comcast -- but not some random little 5G operator that sells services through phone SIM cards).

Another possibility would be that one of the big cloud vendors goes all in on the edge cloud and deploys these boxes through various deals: perhaps the 5G players agree to rent the needed physical space, power and other resources, but the box itself belongs to Alphabet, Amazon, Microsoft or AT&T.  This would be a story similar to the land-grab that occurred when Akamai emerged as dominant in the content-serving market back in 1998.  On the other hand, it is notable that as Akamai grew, their deployment consolidated: around 2000, they were out to place an Akamai compute rack in every single ISP switching office in the world.  By 2010, Akamai was operating a smaller number of data centers, mostly ones they themselves owned, and yet obtaining better performance.  In fact, although I can't find any public studies of the question, I wouldn't be surprised if Akamai's global datacenter deployment today, in 2020, actually resembles the AWS, Azure and Google ones.  The question is of interest because first, it may suggest that Akamai has a bit of an incumbency advantage.  And second, any structure performant enough for content serving would probably work for IoT.

But perhaps we will soon know the answer.  The industry clearly is waking up to the potential of the IoT Edge, and the edge cloud has such obvious revenue appeal that only a very sleepy company could possibly overlook it.  If you look at the Google ngram graphs for terms like edge cloud, the trend is obvious: this is an idea whose time has arrived.