A Few Thoughts on Distributed Computing: March 2019

I've been fascinated by a puzzle that will probably play out over several years. It involves a deep transformation of the cloud computing marketplace, centered on a choice. In one case, IoT infrastructures will be built the way we currently build web services that do things like intelligent recommendations or ad placements. In the other, edge IoT will require a "new" way of developing solutions that centers on creating new and specialized services... ones that embody real-time logic for making decisions or even learning in real-time.

I'm going to make a case for bespoke, handbuilt, services: the second scenario. But if I’m right, there is hard work to be done and whoever starts first will gain a major advantage.

So to set the stage, let me outline the way IoT applications work today in the cloud. We have devices deployed in some enterprise setting, perhaps a factory, or an apartment complex, or an office building. These might be quite dumb, but they are still network enabled: they could be things like temperature and humidity sensors, motion detectors, microphones or cameras, etc. Because many are dumb, even the smart ones (like cameras and videos with built-in autofocus, deblurring, depth perception) are treated in a sort of rigid manner: the basic model is of a device with a limited API that can be configured, and perhaps can be patched if the firmware has issues, but then generates simple events with meta-data that describes what happens.

In a posting a few weeks ago, I noted that unmanaged IoT deployments are terrifying for system administrators, so the world is rapidly shifting towards migrating IoT device management into systems like Azure's infrastructure for Office 365. Basically, if my company already uses Office for other workplace tasks, it makes sense to also manage these useful (but potentially dangerous) devices through the same system.

Azure's IoT Hub handles that managerial role: secure connectivity to the sensors, patches guaranteed to be pushed as soon as feasible... and in the limit, maybe nothing else. But why stop there? My point a few weeks back was simply that even just managing enterprise IoT will leave Azure in a position of managing immense numbers of devices -- and hence, in a position to leverage the devices by bringing new value to the table.

Next observation: this will be an "app" market, not a "platform" market. In this blog I don't often draw on marketing studies and the like, but for the particular case, it makes sense to point to market studies that explain my thinking (look at Lecture 28 in my CS5412 cloud computing class to see charts from the studies I drew on).

Cloud computing, perhaps far more than most areas of systems, is shaped by the way cloud customers actually want to use the infrastructure. In contrast, an area like databases or big data is about how people want to use the data, which shapes access patterns. But they aren't trying to explicitly route their data through FPGA devices that will transform it in some way, or doing computations that can't keep up unless they run in GPU clusters. So, because my kind of cloud customers migrate to the clouds that make it easier to build their applications, they will favor the cloud that has the best support for IoT apps.

A platform story basically offers minimal functionality, like bare metal running Linux, and leaves the developers to do the rest. They are welcome to connect to services but not required to do so. Sometimes this is called the hybrid cloud.

Now, what's an app? As I'm using the term, you would want to visualize the iPhone or Android app store: small programs that share many common infrastructure components (the GUI framework, the storage framework, the motion sensor and touch sensors, etc), and then that connect to their bigger cloud-hosted servers over a Web Services layer that tends to match nicely with the old Apache-dominated cloud for doing highly concurrent construction of web pages. So this is the intuition.

For IoT, though, an app model wouldn't work in the same way -- in fact, it can't work in the same way. First, IoT devices that want help from intelligent machine-learning will often need support from something that learns in real-time. In contrast, today's web architecture is all about learning yesterday and then serving up read-only data at ultra-fast rates from scalable caching layers that could easily be stale if the data was actually changing rapidly. So suddenly we will need to do machine learning, decision making and classification, and a host of other performance-intensive tasks at the edge, under time pressure, and with data changing quite rapidly. Just think of a service that guides a drone surveying a farming area that wants to optimize its search strategy to "sail on the wind" and you'll be thinking about the right issues.

Will the market want platforms, or apps? I think the market data strongly suggests that apps are winning. Their relatively turnkey development advantages outweigh the limitations of programming in a somewhat constrained way. If you do look at the slides from my course, you can see how this trend is playing out. The big money is in apps.

And now we get to my real puzzle. If I'm going to be creating intelligent infrastructure for these rather limited IoT devices (limited by power, and by compute cycles, and by bandwidth), where should the intelligence live? Not on the devices: we just bolted them down to a point where they probably wouldn't have the capacity. Anyhow, they lack the big picture: if 10 drones are flying around, the cloud can build a wind map for the whole farm. But any single drone wouldn't have enough context to create that situational picture, or to optimize the flight plan properly. There is even a famous theoretical result on the "cost of anarchy", showing that you don't get the global optimum if you have a lot of autonomous agents making individually optimal choices. No, you want the intelligence to reside in the cloud. But where?

Today, machine intelligence lives at the back, but the delays are too large. We can’t control today’s drones with yesterday’s wind patterns. We need intelligence right at the edge!

Azure and AWS both access their IoT devices through a function layer ("lambdas" in the case of AWS). This is an elastic service that hosts containers, launching as many instances of your program as needed on the basis of events. Functions of this kind are genuine programs and can do anything they need to do, but they run what is called a "stateless" mode, meaning that they flash into existence (or are even warm-started ahead of time, so that when the event arrives, the delay is minimal). Then they handle the event, but they can't save any permanent data locally, even though the container does have a small file system that works perfectly well: as soon as the event handling ends, the container will garbage collect itself and that local file system will evaporate.

So, the intelligence and knowledge and learning has to occur in a bank of servers. One scenario, call it the PaaS mode, would be that Amazon and Microsoft pre-build a set of very general purpose AI/ML services, and we code all our solutions by parameterizing those and mapping everything into them. So here you have AI-as-a-service. Seems like a guaranteed $B startup concept! But very honestly, I'm not seeing how it can work. The machine learning you would do to learn wind patterns and direct drones to sail on the wind is just too different from what you need to recognize wheat blight, or to figure out what insect is eating the corn.

The other scenario is the "bespoke" one. My Derecho library could be useful here. With a bespoke service, you take some tools like Derecho and build a little cluster-hosted service of your very own, which you then tell the cloud to host on your behalf. Then your functions or lambdas can talk to your services, so that if an IoT event requires a decision, the path from device to intelligence is just milliseconds. With consistent data replication, we can even eliminate stale data issues: these services would learn as they go (or at least, they could), and then use their most recent models to handle each new stage of decision-making.

But without far better tools, it will be quite annoying to create these bespoke services, and this, I think, is the big risk to the current IoT edge opportunity: do Microsoft and Amazon actually understand this need, and will they enlarge the coverage of VSCode or Visual Studio or in Amazon's case, Cloud9, to "automate" as many aspects of service creation as possible, while still leaving flexibility for the machine learning developer to introduce the wide range of customizations that her service might require?

What are these automation opportunities? Some are pretty basic (but that doesn't mean they are easy to do by hand)! To actually launch a service on a cloud, there needs to be a control file created, typically in a JSON format, with various fields taking on the requisite values. Often, these include magically generated 60-hexidecimal-digit keys or other kinds of unintuitive content. When you use these tools to create other kinds of cloud solutions, they automate those steps. By hand, I promise that you’ll spend an afternoon and feel pretty annoyed by the waste of your time. A good hour will be lost on those stupid registry keys alone.

Interface definitions are a need too. If we want functions and lambdas talking to our new bespoke micro-services ("micro" to underscore that these aren't the big vendor-supplied ones, like CosmosDB), the new micro-service needs to export an interface that the lambda or function can call at runtime. Again, help needed!

In fact the list is surprisingly long, even though the items on it are (objectively) trivial. The real point isn’t that these are hard to do, but rather that they are arcane and require looking for the proper documentation, following some sort of magic incantation, figuring out where to install the script or file, testing your edited version of the example they give, etc. Here are a few examples:

Launch service
Authenticate if needed
Register micro/service to accept RPCs
There should be an easy way to create functions able to call the service, using those RPC APIs
We need an efficient upload path for image objects
There will need to be tools for garbage collection (and tools to track space use)
… and tools for managing the collection of configuration parameter files and settings for an entire application
.… and lifecycle tools, for pushing patches and configuration changes in a clean way.

Then there are some more substantial needs:

Code debugging support for issues missed in development and then arising at runtime
Performance monitoring, hotspot visualization and performance optimization (or even, performance debugging) tools
Ways to enable a trusted micro-service to make use of hardware accelerators like RDMA or FGPA even if the end user might not be trusted to safely to so (many accelerators save money and improve performance but are just not suitable for direct access by hordes of developers with limited skill sets. Some could destabilize the data center or crash nodes, and some might have security vulnerabilities.

This makes for a long list, but in my view, a strong development team at Amazon or Microsoft, perhaps allied with a strong research group to tackle the open ended tasks, could certainly succeed. Success would open the door to mature intelligent edge IoT. Lacking such tools, though, it is hard not to see edge IoT as being pretty immature today: huge promise, but more substance is needed.

My bet? Well, companies like Microsoft need periodic challenges to set in front of their research teams. I remember that when I visited MSR Cambridge back in 2016, everyone was asking what they should be doing as researchers to enable the next steps for the product teams... the capacity is there. And those market slides I mentioned make it clear: The edge is a huge potential market. So I think the pieces are in place, and that we should jump on the IoT edge bandwagon (in some cases, “yet again”). This time, it may really happen!

A Few Thoughts on Distributed Computing

Wednesday, 13 March 2019

Intelligent IoT Services: Generic, or Bespoke?