Wednesday, 24 July 2019

In theory, asymptotic complexity matters. In practice...

Derecho matches Keidar and Shraer’s lower bounds for dynamically uniform agreement:  No Paxos protocol can  safely deliver messages with fewer "information exchange" steps.  But does this matter?

Derecho targets a variety of potential deployments and use cases.  A common use would be to replicate state within some kind of "sharded" service -- a big pool of servers but broken into smaller replicated subservices that use state machine replication in subsets of perhaps 2, 3 or 5.  A different use case would be for massive replication -- tasks like sharing a VM image, a container, or a machine-learned model over huge numbers of nodes.  In those cases the number of nodes might be large enough for asymptotic protocol complexity bounds to start to matter -- Derecho's optimality could be a winning argument.  But would an infrastructure management service really stream high rates of VM images, containers, and machine-learned models? I suspect that this could arise in future AI Systems... it wouldn't today.

All of which adds up to an interesting question: if theoretical optimality is kind of a "meh" thing, what efficiency bounds really matter for a system like Derecho?  And how close to ideal efficiency can a system like this really come?

To answer this question, let me start by arguing that 99% of Derecho can be ignored.  Derecho actually consists of a collection of subsystems: you link your C++ code to one library, but internally, that library has several distinct sets of "moving parts".  A first subsystem is concerned with moving bytes: our data plane.  The second worries about data persistency and versioning.  A third is where we implement the Paxos semantics: Derecho's control plane.  In fact it handles more than just Paxos -- Derecho's control plane is a single thread that loops through a set of predicates, testing them one by one and then taking triggered actions for any predicate that turns out to be enabled.  A fourth subsystem handles requests that query the distributed state: it runs purely on data that has become stable and is totally lock-free and asynchronous -- the other three subsystems can ignore this one entirely.  In fact the other three subsystems are as lock-free and asynchronous as we could manage, too -- this is the whole game when working with high speed hardware, because the hardware is often far faster than the software that manages it.  We like to think of the RDMA layer and the NVM storage as two additional concurrent systems, and our way of visualizing Derecho is a bit like imagining a machine with five separate moving parts that interact in a few spots, but are as independent as we could manage.

For steady state performance -- bandwidth and latency -- we can actually ignore everything except the update path and the query path.  And as it happens, Derecho's query path is just like any query-intensive read-only subsystem: it uses a ton of hashed indices to approximate one-hop access to objects it needs, and it uses RDMA if that one hop involves somehow fetching data from a remote node, or sending a computational task to that remote node.  This leads to fascinating questions, in fact: you want those paths to be lock-free, zero-copy, ideally efficient, etc.  But we can set those questions to the side for our purposes here -- results like the one by Keidar and Shraer really are about update rates.  And for this, as noted a second ago, almost nothing matters except the data-movement path used by the one subsystem concerned with that role.  Let's have a closer look.

For large transfers Derecho uses a tree-based data movement protocol that we call a binomial pipeline.  In simple terms, we build a binary tree, and over it, create a flow pattern of point to point block transfers that obtains a high level of internal concurrency, like a two-directional bucket brigade -- we call this a reliable multicast over RDMA, or "RDMC").  Just like in an actual bucket brigade, every node settles into a steady behavior, receiving one bucket of data (a "chunk" of bytes) as it sends some other bucket, more or less simultaneously.  The idea is to max-out the RDMA network bandwidth (the hardware simply can't move data more efficiently).  The actual data structure creates a hypercube "overlay" (a conceptual routing diagram that lives on our actual network, which allows any-to-any communication) of dimension d, and then wraps d binomial trees over it, and you can read about it in our DSN paper, or in the main Derecho paper.

A binary tree is the best you can hope for when using point-to-point transfers to replicate large, chunked, objects.  And indeed, when we measure RDMC, it seems to do as well as one can possibly do on RDMA, given that RDMA lacks a reliable one-to-many chunk transfer protocol.   So here we actually do have an ideal mapping of data movement to RDMA primitives.

Unfortunately, RDMC isn't very helpful for data too small to "chunk".  If we don't have enough data a binomial tree won't settle into its steady-state bucket brigade mode and we would just see a series of point-to-point copying actions.  This is still "optimal" at large-scale, but recall that often we will be replicating in a shard of size two, three or perhaps five.  We decided that Derecho needed a second protocol for small multicasts, and Sagar Jha implemented what he calls the SMC protocol.

SMC is very simple.  The sender, call it process P, has a window, and a counter.  To send a message, P places the data in a free slot in its window (each sender has a different window, so we mean "P's window"), and increments the counter (again, P's counter).  When every receiver (call them P, Q and R: this protocol actually loops data back, so P sends to itself as well as to the other shard members) has received the message, the slot is freed and P can reuse it, round-robin.  In a shard of size three where all the members send, there would be one instance of this per member: three windows, three counters, three sets of receive counters (one per sender).

SMC is quite efficient with small shards.  RDMA has a direct-remote-write feature that we can leverage (RDMC uses a TCP-like feature where the receiver needs to post a buffer before the sender transmits, but this direct write is different: here the receiver declares a region of memory into which the sender can do direct writes, without locking).

Or is it?  Here we run into a curious philosophical debate that centers on the proper semantics of Derecho's ordered_send: should an ordered_send be immediate, or delayed for purposes of buffering, like a file I/O stream?  Sagar, when he designed this layer, opted for urgency.  His reasoning was that if a developer can afford to batch messages and send big RDMC messages that carry thousands of smaller ones, this is exactly what he or she would do.  So a developer opting for SMC must be someone who prioritizes immediate sends, and wants the lowest possible latency.

So, assume that ordered_send is required to be "urgent".  Let's count the RDMA operations that will be needed to send one small object from P to itself (ordered_send loops back), Q and R.  First we need to copy the data from P to Q and R: two RDMA operations, because  reliable one-sided RDMA is a one-to-one action.  Next P increments its full-slots counter and pushes it too -- the updated counter can't be sent in the same operation that sends the data because RDMA has a memory consistency model under which a single operation that spans different cache-lines only guarantees sequential consistency on a per-cache-line basis, and we wouldn't want P or Q to see the full-slots counter increment without certainty that the data would be visible to them.  You need two distinct RDMA operations to be sure of that (each is said to be "memory fenced.")  So, two more RDMA operations are required.  In our three-member shard, we are up to four RDMA operations per SMC multicast.

But now we need acknowledgements.  P can't overwrite the slot until P, Q and R have received the data and looked at it, and to report when this has been completed, the three update their receive counters.  These counters need to be mirrored to one-another (for fault-tolerance reasons), so P must send its updated receive counter to Q and R, Q to P and R, and R to P and Q: six more RDMA operations, giving a total of ten.  In general with a shard of size N, we will see 2*(N-1) RDMA operations to send the data and count, and N*(N-1) for these receive counter reports, a total of N^2+N-2.  Asymptotically, RDMC will dominate because of the N^2 term, but N would need to be much larger than five for this to kick in.  At a scale of two to five members, we can think of N as more or less a constant, and so this entire term is like a constant.

So by this argument, sending M messages using SMC with an urgent-send semantic "must" cost us M*(N^2+N-2) RDMA operations.  Is this optimal?

Here we run into a hardware issue.  If you check the specification for the Connect X4 Mellanox device used in my group's experiments, you'll find that it can transmit 75M RDMA messages per second, and also that it has peak performance of 100Gbps (12.5GB) in each link direction.   But if your 75M messages are used to report updates to tiny little 4-byte counters, you haven't used much of the available bandwidth: 75M times 4 bytes is only 300MB/s, and as noted above, the device is is bidirection.  Since we are talking about bytes, the bidirectional speed could be as high as 25GB/s with an ideal pattern of transfers.  Oops: we're too slow by a factor of 75x!

In our TOCS paper SMC peaks at around 7.5M small messages per second, which bears out this observation.  We seem to be leaving a lot of capacity unused.  If you think about it, everything centers on the assumption that ordered_send should be as urgent as possible.  This is actually limiting performance and for applications that average out at 7.5M SMC messages per second or less, but have bursts that might be much higher, this is even inflating latency (a higher-rate burst will just fill the window and the sender will have to wait for a slot).

Suppose our sender wants fast SMC streaming and low latency, and simply wasn't able to do application-level batching (maybe the application has a few independent subsystems of its own that send SMC messages).  Well, everyone is familiar with file I/O streaming and buffering.  Why not use the same idea here?

Clearly we could have aggregated a bunch of SMC messages, and then done one RDMA transfer for the entire set of full window slots (it happens that RDMA has a so-called "scatter gather/put feature", and we can use that to transfer precisely the newly full slots even if they wrap around the window).  Now one counter update covers the full set.  Moreover, the receivers can do "batched" receives, and one counter update would then cover the full batch of receives.

An SMC window might have 1000 sender slots in it, with the cutoff for "small" messages being perhaps 100B.  Suppose we run with batches of size 250.  We'll have cut the overhead factors dramatically: for 1000 SMC messages in the urgent approach, the existing system would send 1000*10 RDMA messages for the 3-member shard: 10,000 in total.  Modified to batch 250 messages at a time, only 40 RDMA operations are needed: a clean 250x improvement.  In theory, our 7.5M SMC messages per second performance could then leap to 1.9B/second.  But here, predictions break down: With 100 byte payloads, that rate would actually be substantially over the limit we calculated earlier, 25GB/s, which limits us to 250M SMC messages per second.  Still, 250M is quite a bit faster than 7.5M and worth trying to achieve.

It might not be trivial to get from here to there, even with batching.  Optimizations at these insane data rates often aren't be nearly as simple as a pencil-and-paper calculation might suggest.  And there are also those urgency semantics issues to think about:  A bursty sender might have some gaps in its sending stream.  Were one to occur in the middle of a 250 message batch, we shouldn't leave those SMC messages dangling: some form of automatic flush has to kick in.  We should also have an API operation so that a user could explicitly force a flush.  

Interestingly, once you start to think about this, you'll realize that in this latency sense, Sagar's original SMC is probably "more optimal" than any batched solution can be.  If you have just one very urgent notification to send, not a batch, SMC is already a very low-latency protocol; arguably, given his argument that the API itself dictates that SMC should be an urgent protocol, his solution actually is "ideally efficient."  What we see above is that if you question that assumption, you can identify an inefficiency -- not that the protocol as given is inefficient under the assumptions it reflects.

Moral of the story?  The good news is that right this second, there should be a way to improve Derecho performance for small messages, if the user is a tiny bit less worried about urgency and would like to enable a batching mode (we can make it a configurable feature).  But more broadly, you can see is that although Derecho lives in a world governed in part by theory, in the extreme performance range we target and with the various hardware constraints and properties we need to keep in mind, tiny decisions can sometimes shape performance to a far greater degree.

I happen to be a performance nut (and nobody stays in my group unless they share that quirk).  Now that we are aware of this SMC performance issue, which was actually called to our attention by Joe Israelevitz when he compared his Acuerdo protocol over RDMA with our Derecho one for 100B objects and beat us hands-down,  we'll certainly tackle it.  I've outlined one example of an optimization, but it will certainly turn out that there are others too, and I bet we'll end up with a nice paper on performance, and a substantial speedup, and maybe even some deep insights.  But they probably won't be insights about protocol complexity.  At the end of the day, Derecho may be quite a bit faster for some cases, and certainly this SMC one will be such a case.  Yet the asymptotic optimality of the protocol will not really have been impacted: the system is optimal in that sense today!  It just isn't as fast as it probably should be, at least for SMC messages sent in high-rate streams!

Wednesday, 26 June 2019

Whiteboard analysis: IoT Edge reactive path

One of my favorite papers is the one Jim Gray wrote with Pat Helland, Patrick O'Neil and Dennis Shasha, on the costs of replicating a database over a large set of servers, which they showed to be prohibitive if you don't fragment (shard) the database into smaller and independentally accessed portions: mini-databases.  In some sense, this paper gave us the modern cloud, because you can view Brewer's CAP conjecture and the eBay/Amazon BASE methodologies as both flowing from Gray's original insight.

Fundamentally, what Jim and his colleagues did was to undertake a whiteboard analysis of the scalability of concurrency control in an uncontrolled situation, where transactions are simply submitted to some big pool of servers, and then compete for locks in accordance with a two-phase locking model (one in which a transaction acquires all its locks before releasing any), and then terminates using a two-phase or three-phase commit.  They show that without some mechanism to prevent lock conflicts, there is a predictable and steadily increasing rate of lock conflicts leading to delay and even deadlock/rollback/retry.  The phenomenon causes overheads to rise as a polynomial in the number of servers over which you replicate the data, and quite sharply: I believe it was N^3 in the number of servers, and T^5 in the rate of transactions.  So your single replicated database will have a perform collapse.  With shards, using state machine replication (implemented using Derecho!) this isn't an issue, but of course we don't get the full SQL model at that point -- we end up with a form of NoSQL on the sharded database, similar to what MongoDB or Amazon's Dynamo DB offers.

Of course the "dangers" paper is iconic, but the techniques it uses are of broad value. And this was central to the way Jim approached problems: he was a huge fan in working out the critical paths and measuring costs along them.  In his cloud database setup, a bit of fancy mathematics let the group he was working with turn that sort of thinking into a scalability analysis that led to a foundational insight.  But even if you don't have an identical chance to change the world, it makes sense to try and follow a similar path.

This has had me thinking about paper-and-pencil analysis of the critical paths and potential consistency conflict points for large edge IoT deployments of the kind I described last week.  Right now, those paths are pretty messy, if you approach it this way.  Without an edge service, we would see something like this:

   IoT                             IoT         Function           Micro
Sensor  --------------->  Hub  ---> Server   ------> Service

In this example I am acting as if the function server "is" the function itself, and hiding the step in which the function server looks up the class of function that should handle this event, launches it (or perhaps had one waiting, warm-started), and then hands off the event data to the function for handling on one of its servers.  Had I included this handoff the image would be more like this:


   IoT                             IoT         Function        Function       Micro
Sensor  --------------->  Hub  ---> Server   ------>    F  -----> Service

F is "your function", coded in a language like C#, F# or C++ or Python, and then encapsulated into a container of some form.  You'll want to keep these programs very small and lightweight for speed.  In particular, a function is not the place to do any serious computing, or to try and store anything.  Real work occurs in the micro service, the one you built using Derecho.  Even so, this particular step looks costly to me: without warm-starting it, launching F could take a substantial fraction of a section.  And if F was warm-started, the context switch still involves some form of message passing, plus waking F up, and could still be many tens or even hundreds of milliseconds: an eternity at cloud speeds!

Even more concerning, many sensors can't connect directly to the cloud, and we end up cloning the architecture and running it twice: within an IoT Edge system (think of that as an operating system for a small NUMA machine or a cluster, running close to the sensors, and then relaying data to the main cloud if it can't handle the events out near the sensor device).

   IoT                            Edge      Edge Fcn                        IoT         Function              Micro
Sensor  --------------->  Hub  ---> Server -> F======>  Hub  ---> Server -> CF -> Service

Notice that now we have two user-supplied functions on the path.  The first one will have decided that the event can't be handled out at the edge, and forwarded the request to the cloud, probably via a message queuing layer that I haven't actually shown, but represented using a double-arrow: ===>.  This could have chosen to store the request and send it later, but with luck the link was up and it was passed to the cloud instantly, didn't need to sit in an arrival queue, and was instantly given to the cloud's IoT Hub, which in turn finally passed it to the cloud function server, the cloud function (CF) and the Micro Service.

The Micro Service may actually be a whole graph of mutually supporting Micro Services, each running on a pool of nodes, and each interacting with some of the others.  The cloud's "App Server" probably hosts these and provides elasticity if a backlog forms for one of them.

We also have the difficulty that many sensors capture images and videos.  These are initially stored on the device itself, which has substantial capacity but limited compute power.  The big issue is that the first link, from sensor to the edge hub, would often be bandwidth limited.  So we can't upload everything.  Very likely what travels from sensor to hub is just a thumbnail and other meta-data.  Then the edge function concludes that a download is needed (hopefully without too much delay), sends back a download request to the imaging device, and then the device moves the image to the cloud.

Moreover, there are industry standards for uploading photos and videos to a cloud, and those put the uploaded objects into the edge version of the blob store (short for "binary large objects"), which in turn is edge aware ands will mirror them to the main cloud blob store.  Thus we have a whole pathway from IoT sensor to the edge blob server, which will eventually generate another event later to tell us that the data is ready.  And as noted, for data that needs to reach the actual cloud and can't be processed at the edge, we replicate this path too, moving that image via the queuing service to the cloud.

So how long will all of this take?  Latencies are high and bandwidth low for the first hop, because sensors rarely have great connectivity, and almost never have the higher levels of power required for really fast data transfers (even with 5G).  So perhaps we will see a 10ms delay at that stop, plus more if the data is large.  Inside the edge we should have a NUMA machine or perhaps a small cluster, and can safely assume 10G connections with latencies of 10us or less, although of course software like TCP will often impose its own delays.  The big delay will probably be the handoff to the user-defined function, F.

My guess is that for an event that requires downloading a small photo, the very best performance will be something like 50ms before F sees the event (maybe even 100ms), then another 50-100 for F to request a download, then perhaps 200ms for the camera to upload the image to the blob server, and then a small delay (25ms?) for the blob server to trigger another event, F', saying "your image is ready!".  We're up near 350ms and haven't done any work at all yet!

Because the function server is limited to lightweight computing, it hands off to our micro-service (a quick handoff because the service is already running; the main delay will be the binding action by which the function connects to it, and perhaps this can be done off the critical path).  Call this 10ms?  And then the micro service can decide what to do with this image.

Add another 75ms or so if we have to forward the request to the cloud.  So the cloud might not be able to react to a photo in less than about 500ms, today.

None of this involved a Jim Gray kind of analysis of contention and backoff and retry.  If you took my advice and used Derecho for any data replication, the 500ms might be the end of the story.  But if you were to use a database solution like MongoDB (CosmosDB on Azure), it seems to me that you might easily see a further 250ms right there.

What should one do about these snowballing costs?  One answer is that many of the early IoT applications just won't care: if the goal is to just journal that "Ken entered Gates Hall at 10am on Tuesday", a 1s delay isn't a big deal.  But if the goal is to be reactive, we need to do a lot better.

I'm thinking that this is a great setting for various forms of shortcut datapaths, that could be set up after the first interaction and offer direct bypass options to move IoT events or data from the source directly to the real target.  Then with RDMA in the cloud, and Derecho used to build your micro service, the 500ms could drop to perhaps 25 or 30ms, depending on the image size, and even less if the photo can be fully handled on the IoT Edge server itself.

On the other hand, if you don't use Derecho but you do need consistency, you'll get into trouble quickly: with scale (lots of these pipelines all running concurrently), and contention, it is easy to see how you could trigger Jim's "naive replication" concerns.  So designers of smart highways had better beware: if they don't heed Jim's advice (and mine), by the time that smart highway warns that a car should "watch out for that reckless motorcycle approaching on your left!" it will already have zoomed past...   

These are exciting times to work in computer systems.  Of course a bit more funding wouldn't hurt, but we certainly will have our work cut out for us!

Saturday, 22 June 2019

Data everywhere but only a drop to drink...

One peculiarity of the IoT revolution is that it may explode the concept of big data.

The physical world is a domain of literally infinite data -- no matter how much we might hope to capture, at the very most we see only a tiny set of samples from an ocean of inaccessible information because we had no sensor in the proper place, or we didn't sample at the proper instant, or didn't have it pointing in the right direction or focused or ready to snap the photo, or we lacked bandwidth for the upload, or had no place to store the data and had to discard it, or misclassified it as "uninteresting" because the filters used to make those decisions weren't parameterized to sense the event the photo was showing.

Meanwhile, our data-hungry machine learning algorithms currently don't deal with the real world: they operate on snapshots, often ones collected ages ago.  The puzzle will be to find a way to somehow compute on this incredible ocean of currently-inaccessible data while the data is still valuable: a real-time constraint.  Time matters because in so many settings, conditions change extremely quickly (think of a smart highway, offering services to cars that are whizzing along at 85mph).

By computing at the back-end, AI/ML researchers have baked in very unrealistic assumptions, so that today's machine learning systems have become heavily skewed: they are very good at dealing with data acquired months ago and painstakingly tagged by an army of workers, and fairly good at using the resulting models to make decisions within a few tens of milliseconds, but in a sense consider the action of acquiring data and processing it in real-time to be part of the (offline) learning side of the game.  In fact many existing systems wouldn't even work if they couldn't iterate for minutes (or longer) on data sets, and many need that data to be preprocessed in various ways, perhaps cleaned up, perhaps preloaded and cached in memory, so that a hardware accelerator can rip through the needed operations.  If a smart highway were capturing data now that we would want to use to relearn vehicle trajectories so that we can react to changing conditions within fractions of a second, many aspects of this standard style of computing would have to change.

To me this points to a real problem for those intent on using machine learning everywhere and as soon as possible, but also a great research opportunity.  Database and machine learning researchers need to begin to explore a new kind of system in which the data available to us is understood to be a "skim" (I learned this term when I used to work with high performance computing teams in scientific computing settings where data was getting big decades ago.  For example the CERN particle accelerators capture far too much data to move data from the sensor, so even uploading "raw" data involves deciding which portions to keep, which to sample randomly, and which to completely ignore).

Beyond this issue of deciding what to include in the skim, there is the whole puzzle of supporting a dialog between the machine-learning infrastructure and the devices.  I mentioned examples in which one need to predict that a photo of such and such a thing would be valuable, anticipate the timing, point the camera in the proper direction, pre-focus it (perhaps, on an expected object that isn't yet in the field of view, so that the auto-focus wouldn't be useful because the thing we want to image hasn't yet arrived), plan the timing, capture the image, and then process it -- all under real-time pressure.

I've always been fascinated by the emergence of new computing areas.  To me this looks like one ripe for exploration.  It wouldn't surprise me at all to see an ACM Symposium on this topic, or an ACM Transactions journal.  Even at a glance one can see all the elements: a really interesting open problem that would lend itself to a theoretical formalization, but also one that will require substantial evolution of our platforms and computing systems.  The area is clearly of high real-world importance and offers a real opportunity for impact, and a chance to build products.  And it emerges at a juncture between systems and machine learning: a trending topic even now, so that this direction would play into gradually building momentum at the main funding agencies, which rarely can pivot on a dime, but are often good at following opportunities in a more incremental, thoughtful way.

The theoretical question would run roughly as follows.  Suppose that I have a machine-learning system that lacks knowledge required to perform some task (this could be a decision or classification, or might involve some other goal, such as finding a path from A to B).  The system has access to sensors, but there is a cost associated with using them (energy, repositioning, etc).  Finally, we have some metric for data value: a hypothesis concerning the data we are missing that tells us how useful a particular sensor input would be.  Then we can talk about the data to capture next that minimizes cost while maximizing value.  Given a solution to the one-shot problem, we would then want to explore the continuous version, where the new data changes these model elements, fixed-points for problems that are static, and quality of tracking for cases where the underlying data is evolving.

The practical systems-infrastructure and O/S questions center on the capabilities of the hardware and the limitations of today's Linux-based operating system infrastructure, particularly in combination with existing offloaded compute accelerators (FPGA, TPU, GPU, even RDMA).  Today's sensors run a gamut from really dumb fixed devices that don't even have storage to relatively smart sensors that can do various tasks on the device itself, have storage and some degree of intelligence about how to report data, etc.  Future sensors might go further, with the ability to download logic and machine-learned models for making such decisions: I think it is very likely that we could program a device to point the camera at such and such a lane on the freeway, wait for a white vehicle moving at high speed that should arrive in the period [T0,T1], obtain a well-focused photo showing the license plate and current driver, and then report the image capture accompanied by a thumbnail.  It might even be reasonable to talk about prefocusing, adjust the spectral parameters of the imaging system, selecting from a set of available lenses, etc.

Exploiting all of this will demand a new ecosystem that combines elements of machine learning on the cloud with elements of controlled logic on the sensing devices.  If one thinks about the way that we refactor software, here we seem to be looking at a larger-scale refactoring in which the machine learning platform on the cloud, with "infinite storage and compute" resources, has the role of running the compute-heavy portions of the task, but where the sensors and the other elements of the solution (things like camera motion control, dynamic focus, etc) would need to participate in a cooperative way.  Moreover, since we are dealing with entire IoT ecosystems, one has to visualize doing this at huge scale, with lots of sensors, lots of machine-learned models, and a shared infrastructure that imposes limits on communication bandwidth and latency, computing at the sensors, battery power, storage and so forth.

It would probably be wise to keep as much of the existing infrastructure as feasible.  So perhaps that smart highway will need to compute "typical patterns" of traffic flow over a long time period with today's methodologies (no time pressure there), current vehicle trajectories over mid-term time periods using methods that work within a few seconds, and then can deal with instantaneous context (a car suddenly swerves to avoid a rock that just fell from a dumptruck onto the lane) as an ultra-urgent real-time learning task that splits into the instantaneous part ("watch out!") and the longer-term parts ("warning: obstacle in the road 0.5miles ahead, left lane") or even longer ("at mile 22, northbound, left lane, anticipate roadway debris").  This kind of hierarchy of temporality is missing in today's machine learning systems, as far as I can tell, and the more urgent forms of learning and reaction will require new tools. Yet we can preserve a lot of existing technology as we tackle these new tasks.

Data is everywhere... and that isn't going to change.  It is about time that we tackle the challenge of building systems that can learn to discover context, and use current context to decide what to "look more closely" at, and with adequate time to carry out that task.  This is a broad puzzle with room for everyone -- in fact you can't even consider tackling it without teams that include systems people like me as well as machine learning and vision researchers.  What a great puzzle for the next generation of researchers!

Sunday, 12 May 2019

Redefining the IoT Edge

Edge computing has a dismal reputation.  Although continuing miniaturization of computing elements has made it possible to put small ARM processors pretty much anywhere, general purpose tasks don’t make much sense in the edge.  The most obvious reason is that no matter how powerful the processor could be, a mix of power, bandwidth and cost constraints argue against that model.

Beyond this, the interesting forms of machine learning and decision making can't possibly occur in an autonomous way.  An edge sensor will have the data it captures directly and any configuration we might have pushed to it last night, but very little real-time context:  if every sensor were trying to share its data with every other sensor that might be interested in that data, the resulting n^2 pattern would overwhelm even the beefiest ARM configuration.  Yet exchanging smaller data summaries implies that each device will run with different mixes of detail.

This creates a computing model constrained by hard theoretical bounds.  In papers written in the 1980's, Stoneybrook economics professor Pradeep Dubey studied the efficiency of game-theoretic  multiparty optimization.  His early results inspired follow-on research by  Berkeley's Elias Koutsoupias and Christos Papadimitriou, and by my colleagues here at Cornell, Tim Roughgarten and Eva Tardos.  The bottom line is unequivocal: there is a huge "price of anarchy.”  In an optimization system where parties independently work towards an optimal state using non-identical data, even when they can find a Nash optimal configuration, that state can be far from the global optimal.

As a distributed protocols person who builds systems, one obvious idea would be to explore more efficient data exchange protocols for the edge: systems in which the sensors iteratively exchange subsets of data in a smarter way, using consensus to agree on the data so that they are all computing against the same inputs.  There as been plenty of work on this, including some of mine.  But little of it has been adopted or even deployed experimentally.

The core problem is that communication constraints make direct sensor to sensor data exchange difficult and slow.  If a backlink to the cloud is available, it is almost always best to just use it.  But if you do, you end up with an IoT cloud model, where data first is uploaded to the cloud, then some computed result is pushed back to the devices.  The devices are no longer autonomously intelligent: they are basically peripherals of the cloud.

Optimization is at the heart of machine learning and artificial intelligence, and so all of these observations lead us towards a cloud-hosted model of IoT intelligence.  Other options, for example ones in which brilliant sensors are deployed to implement a decentralized intelligent system, might enable yield collective behavior but that behavior will be suboptimal, and perhaps even unstable (or chaotic).   I was once quite interested in swarm computing (it seemed like a natural outgrowth of gossip protocols, on which I was working at the time).   Today, I've come to doubt that robot swarms or self-organizing convoys of smart cars can work, and if they can, that the quality of their decision-making could compete against cloud-hosted solutions.

In fact the cloud has all sorts of magical superpowers that enable it to perform operations inaccessible to the IoT sensors.  Consider data fusion: with multiple overlapping cameras operated from different perspectives, we can reconstruct 3D scenes -- in effect, using the images to generate a 3D model and then painting the model with the captured data.  But to do this we need lots of parallel computing and heavy processing on GPU devices.  Even a swarm of brilliant sensors could never create such a fused scene given today’s communication and hardware options.

And yet, even though I believe in the remarkable power of the cloud, I'm also skeptical about an IoT model that presumes the sensors are dumb devices.   Devices like cameras actually possess remarkable powers too, ones that no central system can mimic.  For example, if preconfigured with some form of interest model, a smart sensor can classify images: data to  upload, data to retain but report only as a thumbnail with associated metadata, and data to discard outright.  A camera may be able to pivot so as to point the lens at an interesting location, or to focus in anticipation of some expected event, or to configure a multispectral image sensor.  It can decide when to snap the photo, and which of several candidate images to retain (many of today's cameras take multiple images and some even do so with different depths of field or different focal points).  Cameras can also do a wide range of on-device image preprocessing and compression.  If we overlook these specialized capabilities, we end up with a very dumb IoT edge and a cloud unable to compensate for its limitations.

The future, then, actually will demand a form of edge computing -- but one that will center on a partnership between the cloud (or perhaps a cloud edge running on a platform near the sensor, as with Azure IoT Edge), working in close concert with the attached sensors to dynamically configure them, perhaps reconfigure them as conditions change, and even to pass them knowledge models computed on the cloud that they can use on-camera (or radar, lidar, microphone) to improve the quality of information captured.  Each element has its unique capabilities and roles.

Even the IoT network is heading towards a more and more dynamic and reconfigurable model.  If one sensor captures a huge and extremely interesting object, while others have nothing notable to report, it may make sense to reconfigure the WiFi network to dedicate a maximum of resources to that one WiFi link.  Moments later, having pulled the video to the cloud edge, we might shift those same resources to a set of motion sensors that are watching an interesting pattern of activity, or to some other camera.

Perhaps we need a new term for this kind of edge computing, but my own instinct is to just coopt the existing term -- the bottom line is that the classic idea of edge computing hasn't really gone very far, and reviled or not, is best "known" to people who aren't even active in the field today.  The next generation of edge computing will be done by a new generation of researchers and product developers, and they might as well benefit from the name recognition -- I think they can brush off the negative associations fairly easily, given that edge computing never actually took off and then collapsed, or had any kind of extensive coverage in the commercial press.

The resulting research agenda is an exciting one.  We will need to develop models for computing that single globally optimal knowledge state, yet for also "compiling" elements of it to be executed remotely.  We'll need to understand how to treat physical-world actions like pivoting and focusing as elements of an otherwise Van Neuman computational framework, and to include the possibility of capturing new data side by side with the possibility of iterating a stochastic gradient descent one more time.  There are questions of long term knowledge (which we can compute on the back-end cloud using today's existing batched solutions), but also contextual knowledge that must be acquired on the fly, and then physical world "knowledge" such as a motion detection that might be used to trigger a camera to acquire an image.  The problem poses open questions at every level: the machine learning infrastructure, the systems infrastructure on which it runs, and the devices themselves -- not brilliant and autonomous, but not dumb either.  As the area matures and we gain some degree of standardization around platforms and approaches, the potential seems enormous!

So next time you teach a class on IoT and mention exciting ideas like smart highways that might sell access to high speed lanes or other services to drivers or semi-autonomous cars, pause to point out that this kind of setting is a perfect example of a future computing capability that will soon supplant past ideas of edge computing.  Teach your students to think of robotic actions like pivoting a camera, or focusing it, or even configuring it to select interesting images, as one facet of a rich and complex notion of edge computing that can take us into settings inaccessible to the classical cloud, and yet equally inaccessible even to the most brilliant of autonomous sensors.   Tell them about those theoretical insights: it is very hard to engineer around an impossibility proof, and if this implies that swarm computing simply won't be the winner, let them think about the implications.  You'll be helping them prepare to be leaders in tomorrow's big new thing!

Wednesday, 3 April 2019

The intractable complexity of machine-learned control systems for safety-critical settings.

As I read the reporting on the dual Boeing 737 Max air disasters, what I find worrying is that the plane seems to have depended on a very complicated set of mechanisms that interacted with each other, with the pilot, with the airplane flaps and wings, and with the environment in what might think of as a kind of exponentially large cross-product of potential situations and causal-sequence chains.  I hope that eventually we'll understand the technical failure that brought these planes down, but for me, the deeper story is already evident, and it concerns the limits on our ability to fully specify extremely complex cyber-physical systems, to fully characterize the environments in which they need to operate, to anticipate every plausible failure mode and the resultant behavior, and to certify that the resulting system won't trigger a calamity.   Complexity is the real enemy of assurance, and the failure to learn that lesson can result in huge loss of lives.

One would think that each event of this kind would be sobering and lead to a broad pushback against "over-automation" of safety-critical systems.  But there is a popular term that seemingly shuts down rational thinking: machine learning.

The emerging wave of self-driving cars will be immensely complex -- in many ways, even more than the Boeing aircraft, and also far more dependent upon a high quality of external information coming from systems external to the cars (from the cloud).  But whereas the public seems to perceive the Boeing flight control system as a "machine" that malfunctioned, and has been quick to affix blame, it doesn't seem as if accidents involving self-driving cars elicit a similar reaction: there have been several very worrying accidents by now, and several deaths, yet the press, the investment community and even the public appear to be enthralled.

This is amplified by ambiguity about how to regulate the area.  Although any car on the road is subject to safety reviews both by a federal organization called the NHTSA and by state regulators, the whole area of self-driving vehicles is very new.  As a result, these cars don't undergo anything like the government "red team" certification analysis required for planes before they are licensed to fly.  My sense is that because these cars are perceived as intelligent, they are somehow being treated differently from what we perceive as a more mechanical style of system when we think about critical systems on aircraft, quite possibly because machine intelligence brings such an extreme form of complexity that there actually isn't any meaningful way to fully model or verify their potential behavior.  Movie treatments of AI focus on themes like "transcendence" or "exponential self-evolution" and in doing so, they both highlight the fundamental issue here (namely, that we have created a technology we can't truly characterize or comprehend), while at the same time elevating it to human-like or even superhuman status.

Take a neural network: with even a few nodes, today's neural network models become mathematically intractable in the sense that although we do have mathematical theories that can describe their behavior, the actual instances would be too complex to model.  Thus one can definitely build such a network and experiment upon it, but it becomes impossible to make rigorous mathematical statements.  In effect, these systems are simply too complex to predict their behavior, other than by just running them and watching how they perform.  On the one hand, I suppose you can say this about human pilots and drivers too.  But on the other hand, this reinforces the point I just made above: the analogy is false, because a neural network is very different than an intelligent human mind, and when we draw that comparison we conflate two completely distinct control models.

With safety critical systems, the idea of adversarial certification, in which the developer proves the system safe while the certification authority poses harder and harder challenges, is well established.  Depending on the nature of the question, the developer may be expected to use mathematics, testing, simulation, forms of root-cause analysis, or other methods.  But once we begin to talk about systems that have unquantifiable behavior, and yet that may confront stimuli that could never have been dreamed up during testing,  we enter a domain that can't really be certified for safe operation in the way that an aircraft control system normally would be -- or at least, would have been in the past, since the Boeing disasters suggest that even aircraft control systems may have finally become overwhelmingly complex.

When we build systems that have neural networks at their core, for tasks like robotic vision or robotic driving, we enter an inherently uncertifiable world in which it just ceases to be possible to undertake a rigorous, adversarial, analysis of risks.

To make matters even worse, today's self-driving cars are designed in a highly secretive manner, tested by the vendors themselves, and are really being tested and trained in a single process that occurs out on the road, surrounded by normal human drivers, people walking their dogs or bicycling to work, children playing in parks and walking to school.  All this plays out even as the systems undergo the usual rhythm of bug identification and correction, frequent software patches and upgrades: a process in which the safety critical elements are continuously evolved even as the system is developed and tested.

The government regulators aren't being asked to certify instances of well-understood control technologies, as with planes, but are rather being asked to certify black boxes that in fact are nearly as opaque to their creators as to the government watchdog.  No matter what the question, the developer's response is invariably the same: "in our testing, we have never seen that problem."  Boeing reminds us to add the qualifier: "Yet."

The area is new for the NHTSA and the various state-level regulators, and I'm sure that they rationalize this inherent opacity by saying to themselves that over time, we will gradually develop  safety-certification  rules -- the idea that by its nature, these technologies may not permit a systematic certification seems not to have occurred to the government, or the public.  And yet a self-driving car is a 2-ton robot that can accelerate from 0 to 80 in 10 seconds.

You may be thinking that well, in both cases, the ultimate responsibility is with the human operator.  But in fact there are many reasons to doubt that a human can plausibly intervene in the event of a sudden problem: people simply aren't good at reacting in a fraction of a second.  In the case of the Boeing 737 Max, the pilots of the two doomed planes certainly weren't able to regain control, despite the fact that one of them apparently did disable the problematic system seconds into the flight.  Part of the problem relates to unintended consequences: apparently, Boeing recommends disabling the system by turning off an entire set of subsystems, and some of those are needed during takeoff, so the pilot was forced to reengage them, and with them, the anti-stall system reactivated. A second issue is just the lack of adequate time to achieve "affirmative control:"  People need time to formulate a plan when confronted with a complex crisis outside of their experience, and if that crisis is playing out very rapidly, may be so overwhelmed that even if a viable recovery plan is available they might fail to discover it.

I know the feeling.  Here in the frosty US north, it can sometimes happen that you find that your car is starting to skid on icy, snowy roads.  Over the years I've learned to deal with skids, but it takes practice.  The first times, all your instincts are wrong: in fact for an unexperienced driver faced with a skid, the safest reaction is to freeze.  The actual required sequence is to start by figuring out which direction the car is sliding in (and you have to do this while your car is rotating).  Then you should  steer towards that direction, no matter what it happens to be.  Your car should straighten out, at which point you can gently pump the brakes.  But all this takes time, and if you are skidding quickly, you'll be in a snowbank or a ditch before you manage to regain control.  In fact the best bet of all is to not skid in the first place, and after decades of experience, I never do.  But it takes training to reach this point.  How do we train the self-driving machine learning systems on these rare situations?  And keep in mind, every skid has its very own trigger.  The nature of the surface, the surroundings of the road, the weather, the slope or curve, other traffic -- all factor in.

Can machine learning systems powered by neural networks and other clever AI tools somehow magically solve such problems?

When I make this case to my colleagues who work in the area, they invariably respond that the statistics are great... and yes, as of today, anyone would have to acknowledge this point.  Google's Waymo has already driven a million miles without any accidents, and perhaps far more because that number has been out there for a while.

But then (sort of like with studies of new, expensive, medications) you run into the qualifiers.  It turns out that Google tests Waymo in places like Arizona, where roads are wide, temperatures can be high, and the number of people and pets and bicyclists would often be rather low (120F heat doesn't really make bicycling to work all that appealing).   They also carefully clean and tune their cars between each test drive, so the vehicles are in flawless shape.  The also benefit simply because so many human drivers shouldn't be behind the wheel in the first place: Waymo is never intoxicated, isn't likely to be distracted by music, phone calls, texting, arguments between the kids in the back.  It has some clear advantages even before it steers itself from the parking lot.

Yet there are easy retorts:
  • Let's face it: conditions in places like Phoenix or rural Florida are about as benign as can be imagined.  In any actual nationwide deployment, cars would need to cope with mud and road salt, misaligned components, power supply issues (did you know that chipmunks absolutely love to gnaw on battery wires?).  Moreover, these vehicles had professional drivers in the emergency backup role, and focused attentively on the road and the dashboard while being monitored by a second level of professionals with the specific role of reminding them to pay attention.  In a real deployment, the human operator might be reading the evening sports results while knocking back a beer or two and listening to the radio or texting a friend.  
  • Then we run into issues of roadwork that invalidates maps and lane markings, GPS signals are well known to bounce off buildings, resulting in echoes that can confuse a location sensor (if you have ever used Google Maps in a big city, you know what I mean).  Weather conditions can result in vehicle challenges never seen in Phoenix: blizzard conditions, flooded or icy road surfaces, counties that ran low on salt and money for plowing and left their little section of I87 unplowed in the blizzard, potholes hiding under puddles or in deep shadow, tires with uneven tread wear or that have gone out of balance, and the list is really endless.  On the major highways near New York, I've seen cars abandoned right in the middle lane, trashcans upended to warn drivers of missing manhole covers, and had all sorts of objects fly off flatbed trucks right in front of me: huge metal boxes, chunks of loose concrete or metal, a mattress, a refrigerator door...  This is the "real world" of driving, and self-driving cars will experience all of these things and more from the moment we turn them loose in the wild.
  • Regional driving styles vary widely too, and sometimes in ways that don't easily translate from place to place and that might never arise in Phoenix.  For example, teenagers who act out in New Jersey and New York are fond of weaving through traffic.  At very high speeds.  In Paris, this has become an entire new concept in which motorcyclists get to use the lanes "between" the lanes of cars as narrow, high-speed driving lanes (and they weave too).  But New Jersey has its version of a weird rule of the road too: on the main roads near Princeton, for some reason there is a kind of sport of not letting cars enter from, and even elderly drivers won't leave you more than a fraction of a second and a few inches to spare as you dive into the endless stream.  I'm a New Yorker, and can drive like a taxi driver there... well, any New York taxi driver worth his or her salary can confirm that this is a unique experience, a bit like a kind of high-speed automotive ballet (or if you prefer, like being a single fish in a school of fish).  The taxis flow down the NYC avenues at 50mph, trying to stay with the green lights, and flowing around obstacles in a strangely coordinated ways.  But New York isn't special. Over in Tel Aviv, drivers will switch lanes without a further through after glancing no more than 45 degrees to either side, and will casually pull in front of you leaving centimeters to spare.  Back in France, at the Arc de Triomphe and Place Victor Hugo, the roundabouts allow incoming traffic in priority over outgoing traffic... but only those two use this rule; in all the rest of Europe, the priority favors outgoing traffic (this makes a great example for teaching about deadlocks!)  And in Belgium, there are a remarkable number of unmarked intersections.  On those, the priority always allows the person entering from the right to cut in front of the person on his/her left even if the person from the right is crossing the street or turning, and even if the person on the left was on what seemed like the main road.  In Province, the roads are too narrow: everyone blasts down them at 70mph but also is quick to actually go off the road, with one tire on the grass, if someone approaches in the other direction.  If you didn't follow that rule... bang!  In New Dehli and Chennai, anything at all is accepted -- anything.  In rural Mexico, at least the last time I was there, the local drivers enjoyed terrifying the non-local ones (and I can just imagine how they would have treated robotic vehicles).   
And those are just environmental worries.  For me, the stranger part of the story is the complacency of the very same technology writers who are rushing to assign blame in the recent plane crashes.  This gets back to my use of the term "enthralled."  Somehow, for them, the mere fact that self-driving cars are "artificially intelligent" seems to blind technology reviewers to the evident reality: namely, that there are tasks that are far too difficult for today's machine learning solutions, and that they simply aren't up to the task of driving cars -- not even close!

What, precisely, is the state of the art?  Well, we happen to be wrapping up an exciting season of  faculty hiring focused on exactly these areas of machine learning.  In the past few weeks I've seen talks on vision systems that try to make sense of clutter or to anticipate what might be going on around a corner or behind some visual obstacle.  No surprise: the state of the art is rather primitive.  We've also heard about research on robotic motion just to solve basic tasks like finding a path from point A to point B in a complex environment, or ways to maneuver that won't startle humans in the vicinity.

Let me pause to just point out that if these basic tasks are considered to be cutting edge research, shouldn't it be obvious that the task of finding a safe path, in real-time (cars don't stop on a dime, you know), is actually not a solved one, either.  If we can't do it in a warehouse, how in the world have we talked ourselves into doing it on Phoenix city streets?

Self-driving cars center on deep neural networks for vision, and yet nobody quite understands how to relate the problem these devices solve to the real safety issue that cars confront.  Quite the opposite: neural networks for vision are known to act bizarrely for seemingly trivial reasons.  A neural network that is the world's best for interpreting photos can be completely thrown off by simply placing a toy elephant somewhere in the room.  A different neural network, that one a champ at making sense of roadway scenes, stops recognizing anything if you inject just a bit of random noise.  Just last night I read a report that Tesla cars can be tricked to veer into oncoming traffic if you put a few spots of white paint on the lane they are driving down, or if the lane marking to one side is fuzzy.  Tesla, of course, denies that this could ever occur in a real-world setting, and points out that they have never observed such an issue, not even once.

People will often tell you that even if the self-driving car concept never matures, at least it will spin some amazing technologies out.  I'll grant them this: the point is valid.

For example, Intel's Mobile Eye is a genuinely amazing little device that warns you if the cars up ahead of you suddenly brake.  I had it in a rental car recently and it definitely avoided a possible read-ender for me new Newark airport.  I was driving on the highway when a the wind blew a lot of garbage from a passing truck.  Everyone (me included) glanced in that direction, but someone up ahead must have also  slammed on the brakes.  A pileup was a real risk, but Mobile Eye made this weird squack (as if I was having a sudden close encounter with an angry duck), and vibrated the steering wheel, and it worked: I slowed down in time.

On the other hand, Mobile Eye also gets confused.  A few times it through I was drifting from my lane when actually, the lane markers themselves were just messy (some sort of old roadwork had left traces of temporary lane markings).  And at one point I noticed that it was watching speed limit signs, but was confused by the speed limits for exit-only lanes to my right, thinking that they also applied to the thru traffic lanes I was in.

Now think about this: if Mobile Eye gets confused, why should you assume that Waymo and Tesla and Uber self-driving cars never get confused?  All four use neural network vision systems.  This is a very fair question.

Another of my favorite spinouts is Hari Balakrishnan's startup in Boston.  His company is planning to monitor the quality of drivers: the person driving your car and perhaps those around your car too.  What a great idea!

My only worry is that if this were really to work well, could our society deal with the consequences?  Suppose that your head-up display somehow drew a red box around every dangerous car anywhere near you on the road.  On the positive side, now you would know which are were being driven by hormonal teenagers, which have drivers who are distracted by texting, which are piloted by drunk or stoned drivers, which have drivers with severe cataracts who can't actually see much of anything...

But on the negative side, I honestly don't know how we will react.  The fact is that we're surrounded by non-roadworthy cars, trucks carrying poorly secured loads of garbage,  and drivers who probably should be arrested!

Then it runs the other way, too.  If you are driving on a very poor road surface, you might be swerving to avoid the potholes or debris.  A hands-free phone conversation is perfectly legal, as is the use of Google Maps to find the address of that new dentist's office.  We wouldn't want to be "red boxed" and perhaps stopped by a state trooper for reasons like that.

So I do hope that Hari can put a dent in in road safety.  But I suspect that he and his company will need quite a bit of that $500M they just raised to pull it of.

So where am I going with all of this?  It comes down to an ethical question.  Right this second, the world is in strong agreement that Boeing's 737 Max is unsafe under some not-yet-fully-described condition.  Hundreds of innocent people have died because of that.  And don't assume that Airbus is somehow different -- John Rushby could tell you some pretty hair-raising stories about Airbus technology issues (their planes have been "fly by wire" for decades now, so they are not new to the kind of puzzle we've been discussing).  Perhaps you are thinking that well, at least Airbus hasn't killed anyone.  But is that really a coherent way to think about safety?

Self-driving cars may soon be carrying far more passengers under far more complex conditions than either of these brands of air craft.  And in fact, driving is a much harder job than flying a plane.  Our colleagues are creating these self-driving cars, and in my view, their solutions just aren't safe to launch onto our roads yet.  This generation of machine learning may simply not be up to the task.  Our entire approach to safety certification isn't yet ready to cope with the needed certification tasks.

When we agree to allow these things out on the road, that will include roads that you and your family will be driving on, too.  Should Hari's company put a red warning box around them, to help you stay far from them?  And they may be driving on your local city street too.  Your dog will be chasing sticks next to that street, your cats will out there doing whatever cats do, and your children will be learning to bicycle, all on those same shared roads.

There have already been too many  deaths.  Shouldn't our community be calling for this to stop, before far more people get hurt?

Wednesday, 13 March 2019

Intelligent IoT Services: Generic, or Bespoke?

I've been fascinated by a puzzle that will probably play out over several years.  It involves a deep transformation of the cloud computing marketplace, centered on a choice.  In one case, IoT infrastructures will be built the way we currently build web services that do things like intelligent recommendations or ad placements. In the other, edge IoT will require a "new" way of developing solutions that centers on creating new and specialized services... ones that embody real-time logic for making decisions or even learning in real-time.

I'm going to make a case for bespoke, handbuilt, services: the second scenario.  But if I’m right, there is hard work to be done and whoever starts first will gain a major advantage.

So to set the stage, let me outline the way IoT applications work today in the cloud.  We have devices deployed in some enterprise setting, perhaps a factory, or an apartment complex, or an office building.  These might be quite dumb, but they are still network enabled: they could be things like temperature and humidity sensors, motion detectors, microphones or cameras, etc.  Because many are dumb, even the smart ones (like cameras and videos with built-in autofocus, deblurring, depth perception) are treated in a sort of rigid manner: the basic model is of a device with a limited API that can be configured, and perhaps can be patched if the firmware has issues, but then generates simple events with meta-data that describes what happens.   

In a posting a few weeks ago, I noted that unmanaged IoT deployments are terrifying for system administrators, so the world is rapidly shifting towards migrating IoT device management into systems like Azure's infrastructure for Office 365.  Basically, if my company already uses Office for other workplace tasks, it makes sense to also manage these useful (but potentially dangerous) devices through the same system.  

Azure's IoT Hub handles that managerial role: secure connectivity to the sensors, patches guaranteed to be pushed as soon as feasible... and in the limit, maybe nothing else. But why stop there? My point a few weeks back was simply that even just managing enterprise IoT will leave Azure in a position of managing immense numbers of devices -- and hence, in a position to leverage the devices by bringing new value to the table.

Next observation: this will be an "app" market, not a "platform" market.  In this blog I don't often draw on marketing studies and the like, but for the particular case, it makes sense to point to market studies that explain my thinking (look at Lecture 28 in my CS5412 cloud computing class to see charts from the studies I drew on).  

Cloud computing, perhaps far more than most areas of systems, is shaped by the way cloud customers actually want to use the infrastructure.  In contrast, an area like databases or big data is about how people want to use the data, which shapes access patterns.  But they aren't trying to explicitly route their data through FPGA devices that will transform it in some way, or doing computations that can't keep up unless they run in GPU clusters.  So, because my kind of cloud customers migrate to the clouds that make it easier to build their applications, they will favor the cloud that has the best support for IoT apps.

A platform story basically offers minimal functionality, like bare metal running Linux, and leaves the developers to do the rest.  They are welcome to connect to services but not required to do so.  Sometimes this is called the hybrid cloud.

Now, what's an app?  As I'm using the term, you would want to visualize the iPhone or Android app store: small programs that share many common infrastructure components (the GUI framework, the storage framework, the motion sensor and touch sensors, etc), and then that connect to their bigger cloud-hosted servers over a Web Services layer that tends to match nicely with the old Apache-dominated cloud for doing highly concurrent construction of web pages.  So this is the intuition.

For IoT, though, an app model wouldn't work in the same way -- in fact, it can't work in the same way.  First, IoT devices that want help from intelligent machine-learning will often need support from something that learns in real-time.  In contrast, today's web architecture is all about learning yesterday and then serving up read-only data at ultra-fast rates from scalable caching layers that could easily be stale if the data was actually changing rapidly.  So suddenly we will need to do machine learning, decision making and classification, and a host of other performance-intensive tasks at the edge, under time pressure, and with data changing quite rapidly.  Just think of a service that guides a drone surveying a farming area that wants to optimize its search strategy to "sail on the wind" and you'll be thinking about the right issues. 

Will the market want platforms, or apps?  I think the market data strongly suggests that apps are winning.  Their relatively turnkey development advantages outweigh the limitations of programming in a somewhat constrained way.  If you do look at the slides from my course, you can see how this trend is playing out.  The big money is in apps.

And now we get to my real puzzle.  If I'm going to be creating intelligent infrastructure for these rather limited IoT devices (limited by power, and by compute cycles, and by bandwidth), where should the intelligence live?  Not on the devices: we just bolted them down to a point where they probably wouldn't have the capacity.  Anyhow, they lack the big picture: if 10 drones are flying around, the cloud can build a wind map for the whole farm.  But any single drone wouldn't have enough context to create that situational picture, or to optimize the flight plan properly.  There is even a famous theoretical result on the "cost of anarchy", showing that you don't get the global optimum if you have a lot of autonomous agents making individually optimal choices.  No, you want the intelligence to reside in the cloud.  But where?

Today, machine intelligence lives at the back, but the delays are too large.  We can’t control today’s drones with yesterday’s wind patterns.  We need intelligence right at the edge!

Azure and AWS both access their IoT devices through a function layer ("lambdas" in the case of AWS).  This is an elastic service that hosts containers, launching as many instances of your program as needed on the basis of events.  Functions of this kind are genuine programs and can do anything they need to do, but they run what is called a "stateless" mode, meaning that they flash into existence (or are even warm-started ahead of time, so that when the event arrives, the delay is minimal).  Then they handle the event, but they can't save any permanent data locally, even though the container does have a small file system that works perfectly well: as soon as the event handling ends, the container will garbage collect itself and that local file system will evaporate.

So, the intelligence and knowledge and learning has to occur in a bank of servers.  One scenario, call it the PaaS mode, would be that Amazon and Microsoft pre-build a set of very general purpose AI/ML services, and we code all our solutions by parameterizing those and mapping everything into them.  So here you have AI-as-a-service.  Seems like a guaranteed $B startup concept!  But very honestly, I'm not seeing how it can work.  The machine learning you would do to learn wind patterns and direct drones to sail on the wind is just too different from what you need to recognize wheat blight, or to figure out what insect is eating the corn.

The other scenario is the "bespoke" one.  My Derecho library could be useful here.  With a bespoke service, you take some tools like Derecho and build a little cluster-hosted service of your very own, which you then tell the cloud to host on your behalf.  Then your functions or lambdas can talk to your services, so that if an IoT event requires a decision, the path from device to intelligence is just milliseconds.  With consistent data replication, we can even eliminate stale data issues: these services would learn as they go (or at least, they could), and then use their most recent models to handle each new stage of decision-making.

But without far better tools, it will be quite annoying to create these bespoke services, and this, I think, is the big risk to the current IoT edge opportunity: do Microsoft and Amazon actually understand this need, and will they enlarge the coverage of VSCode or Visual Studio or in Amazon's case, Cloud9, to "automate" as many aspects of service creation as possible, while still leaving flexibility for the machine learning developer to introduce the wide range of customizations that her service might require?

What are these automation opportunities?  Some are pretty basic (but that doesn't mean they are easy to do by hand)!  To actually launch a service on a cloud, there needs to be a control file created, typically in a JSON format, with various fields taking on the requisite values.  Often, these include magically generated 60-hexidecimal-digit keys or other kinds of unintuitive content.  When you use these tools to create other kinds of cloud solutions, they automate those steps.  By hand, I promise that you’ll spend an afternoon and feel pretty annoyed by the waste of your time.  A good hour will be lost on those stupid registry keys alone.

Interface definitions are a need too.  If we want functions and lambdas talking to our new bespoke micro-services ("micro" to underscore that these aren't the big vendor-supplied ones, like CosmosDB), the new micro-service needs to export an interface that the lambda or function can call at runtime.  Again, help needed!

In fact the list is surprisingly long, even though the items on it are (objectively) trivial.  The real point isn’t that these are hard to do, but rather that they are arcane and require looking for the proper documentation, following some sort of magic incantation, figuring out where to install the script or file, testing your edited version of the example they give, etc.   Here are a few examples:
  • Launch service
  • Authenticate if needed
  • Register micro/service to accept RPCs
  • There should be an easy way to create functions able to call the service, using those RPC APIs
  • We need an efficient upload path for image objects
  • There will need to be tools for garbage collection (and tools to track space use)
  • … and tools for managing the collection of configuration parameter files and settings for an entire application
  • .… and lifecycle tools, for pushing patches and configuration changes in a clean way.
Then there are some more substantial needs:
  • Code debugging support for issues missed in development and then arising at runtime
  • Performance monitoring, hotspot visualization and performance optimization (or even, performance debugging) tools
  • Ways to enable a trusted micro-service to make use of hardware accelerators like RDMA or FGPA even if the end user might not be trusted to safely to so (many accelerators save money and improve performance but are just not suitable for direct access by hordes of developers with limited skill sets.  Some could destabilize the data center or crash nodes, and some might have security vulnerabilities.
This makes for a long list, but in my view, a strong development team at Amazon or Microsoft, perhaps allied with a strong research group to tackle the open ended tasks, could certainly succeed.  Success would open the door to mature intelligent edge IoT.  Lacking such tools, though, it is hard not to see edge IoT as being pretty immature today: huge promise, but more substance is needed.
My bet?  Well, companies like Microsoft need periodic challenges to set in front of their research teams.  I remember that when I visited MSR Cambridge back in 2016, everyone was asking what they should be doing as researchers to enable the next steps for the product teams... the capacity is there.  And those market slides I mentioned make it clear: The edge is a huge potential market.  So I think the pieces are in place, and that we should jump on the IoT edge bandwagon (in some cases, “yet again”).  This time, it may really happen!