A Few Thoughts on Distributed Computing: Inventing the intelligent, active, edge

At the recent Microsoft faculty research summit, I was thrilled (and also somewhat relieved) to see that the company decided to run a whole workshop emphasizing systems (including networking), and the ways that systems can support machine learning. It seemed clear that Microsoft has decided to become a major player in what they call "intelligent edge" computing and is urging all of us to jump on board.

These terms may be new, so I thought I might summarize the trend, based on my current understanding of it. A good place to start is with a little mini-tutorial on what cloud computing infrastructures have been doing up to now, because the intelligent active edge really builds on the current architecture (over time, it will be more and more differentiated, but today, the overlap is substantial).

So: Today, we have a cloud dominated by a style of computing that prevailed in the 2000-2010 web server and services period. In a first draft of this essay I wrote a page or so about this topic but it got heavy on jargon and I felt that it was taking too long to get to the point. So I'll get there very quickly:

Many people think of the cloud as a way to rent Linux containers, but the bigger and more exciting trend focuses on elastic platforms that are event driven, connected by various ways to pass objects from layer to layer, and customizable by providing little event handlers: "functions".
Examples of platforms like this include Amazon Lambda, Microsoft Azure Functions, Google Tensor Flow, Spark/DataBricks RDDs.
The connections tend to be via some form of queuing service (Amazon SQS, Apache Kafka, Azure Service Queues, IBM's MQSeries, OpenSplice, etc). Big objects are often just stored into a large file system (S3, Google GFS, Hadoop HDFS, etc).
Everything is sharded from start to finish. Data shows up on HTTP connections (web requests to web services), but programmable edge routers (like Amazon Route 53) extract keys and use standard distributed hashing schemes to vector the requests into "shards", within which they may additionally load-balance.
We cache everything in sight, using DHTs like Amazon Dynamo, Cassandra, Microsoft FaRM.
The long-term storage layers are increasingly smart, like Azure Cosmos-DB. They may do things like deduplication, compression, image segmentation and tagging, creation of secondary objects, etc. Often they are backed by massive long-term storage layers like Azure Pelican.
Then of course we also have standard ways to talk to databases, pre-computed files, other kinds of servers and services, back-end systems that can run MapReduce (Hadoop) or do big-data tasks, etc.
The heavy lifting is hardware accelerated using GPU, TPU, FPGA and similar technologies, and as much as possible, we move data using RDMA and store it into memory-mapped non-volatile memory units (SSD or the newer 3D-XPoint NVMs like Optane).

Whew! I hope you are still with me...

The nice thing about this complex but rather "standard" structure is that the developer simply writes a few event handlers for new web requests and most of the rest is automated by the AWS Lambda, Google Tensor Flow or Azure Functions environment. Learning to work in this model is a bit of a challenge because there is a near total lack of textbooks (my friend Kishore Kumar is thinking of writing one), and because the technologies are still evolving at an insane pace.

The big players have done what they can to make these things a little easier to use. One common approach is to publish a whole suite of case-study "demos" with nice little pictures showing the approach, like you would find here for Azure, or here for AWS. In the best cut-and-paste fashion, the developer just selects a design template similar to what he or she has in mind, downloads the prebuilt demo, then customizes it to solve their own special problem by replacing the little functions with new event handlers of his or her own design, coded in any language that feels right (Python is popular, but you typically get a choice of as many as 40 popular options including JavaScript), and that will run in a little containerized VM with very fast startup -- often 1ms or less to launch for a new request.

This is the opposite of what we teach in our undergraduate classes, but for the modern cloud is probably the only feasible way to master the enormous complexity of the infrastructures.

So... with this out of the way, what's the excitement about the intelligent edge (aka active edge, reactive edge, IoT edge...)?

The key insight to start with is that the standard cloud isn't a great fit for the emerging world of live machine-learning solutions like support for self-driving cars, smart homes and power grids and farms, you name it. First, if you own a huge number of IoT devices, it can be an enormous headache to register them and set them up (provisioning), securely monitor them, capture data privately (and repel attacks, which can happen at many layers). Next, there is an intense real-time puzzle here: to control self-driving cars or drones or power grids, we need millisecond reaction times plus accurate, consistent data. The existing cloud is more focused on end-to-end web page stuff where consistency can be weak and hence the fast reactions can use stale data. So CAP is out the window here. Next, we run into issues of how to program all of this. And if you solve all of this in the cloud, you run into the question of what to do if your farmer happens to have poor connectivity back to the cloud.

So the exciting story about Azure IoT Edge was that Microsoft seems to have tackled all of this, and has a really coherent end-to-end vision that touches on every element of the puzzle. This would be endless if I shared everything I learned, but I'll summarize a few big points:

They have a clean security and provisioning solution, and a concept of IoT life cycle with monitoring, visualization of "broken stuff", ways to push updates, etc.
They work with ISPs to arrange for proper connectivity and adequate bandwidth, so that applications can safely assume that the first-hop networking won't be a huge barrier to success.
They have a concept of an Azure IoT "hub" that can run as a kind of point-of-presence. Apparently it will eventually even be able to have a disconnected mode. Many companies operate huge numbers of PoP clusters and a smaller number of cloud data centers, so here Azure IoT Edge is taking that to the next level and helping customers set those up wherever they like. You could imagine a van driving to a farm somewhere with a small rack of computers in it and running Azure IoT "hub" in the van, with a connection while the van is back at the home office, but temporarily disconnected for the couple of hours the company is working at that particular farm.
Then the outer level of Azure itself would also run the Azure IoT edge framework (same APIs) but now with proper connectivity to the full cloud. And the framework has a strong emphasis o topics dear to me like real-time, ways of offering consistency, replication for parallelism or fault-tolerance, etc. I'm looking at porting Derecho into this setting so that we can be part of this story, as a 3rd party (open source!) add-on. They have a plan to offer a "marketplace" for such solutions.

As a big believer in moving machine learning to the edge, this is the kind of enabler I've been hoping someone would build - right now, we've lacked anything even close, although people are cobbling solutions together on top of Amazon AWS Lambda (which perhaps actually is close, although to me has less of a good story around the IoT devices themselves), or Google Tensor Flow (which is more of a back-end story, but has some of the same features). As much as I love Spark/Databricks RDDs, I can't see how that could be an IoT story anytime soon.

So I plan to dive deep on this stuff, and will share what I learn in the coming year or so! Stay tuned...

A Few Thoughts on Distributed Computing

Wednesday, 22 August 2018

Inventing the intelligent, active, edge

No comments:

Post a Comment