A Few Thoughts on Distributed Computing: October 2019

I thought I might share a problem we've been discussing in my Cornell PhD-level class on programming at the IoT edge.

Imagine that at some point in the future, a company buys into the idea (my idea!) that we'll really need smart highway systems to ever safely use self-driving cars. A bit like air traffic control, but scaled up.

In fact a smart highway might even offer guidance and special services to "good drivers", such as permission to drive at 85mph in a special lane for self-guided cars and superior drivers... for a fee. Between the smart cars and the good drivers, there would be a lot of ways to earn revenue here. And we could even make some kind of dent in the endless gridlock that one sees in places like Silicon Valley from around 6:45am until around 7pm.

So this company, call it Smart Highways Inc, sets out to create the Linux of smart highways: a new form of operating system that the operator of the highway could license to control their infrastructure.

What would it take to make a highway intelligent? It seems to me that we would basically need to deploy a great many sensors, presumably in a pattern intendent to give us some degree of redundancy for fault-tolerance, covering such things as roadway conditions, weather, vehicles on the road, and so forth.

For each vehicle we would want to know various things about it: when it entered the system (for eventual tolls, which will be the way all of this pays for itself), its current trajectory (a path through space-time annotated with speeds and changes in speed or direction), information about the vehicle itself (is it smart? is the driver subscribed to highway guidance or driving autonomously?), and so forth.

Now we could describe a representative "app": perhaps, for the stretch of CA 101 from San Francisco to San Jose, a decision has been made to document the "worst drivers" over a one month period. (Another revenue opportunity: this data could definitely be sold to insurance companies!) How might we do this? And in particular, how might we really implement our solution?

What I like about this question is that it casts light on exactly the form of Edge IoT I've been excited about. On the one hand, there is an AI/ML aspect: automated guidance to the vehicles by the highway, and in this example, an automated judgement about the quality of driving. One would imagine that we train an expert system to take trajectories as input and output a quality metric: a driver swerving between other cars at high speed, accelerating and turning abruptly, braking abruptly, etc: all the hallmarks of poor driving!

But if you think more about this you'll quickly realize that to judge quality of driving you need a bit more information. A driver who swerves in front of another car with inches to spare, passes when there are oncoming vehicles, causes others to break or swerve to avoid collisions -- that driver is far more of a hazard than a driver who swerves suddenly to avoid a pothole or some other form of debris, or one who accelerates only while passing, and passes only when nobody is anywhere nearby in the passing lane. A driver who stays the right "when possible" is generally considered to be a better driver than one who lingers in the left, if the highway isn't overly crowded.

A judgment is needed: was this abrupt action valid, or inappropriate? Was it good driving that evaded a problem, or reckless driving that nearly caused an accident?

So in this you can see that our expert system will need expert context information. We would want to compute the set of cars near each vehicle's trajectory, and would want to be able to query the trajectories of those cars to see if they were forced to take any kind of evasive action. We need to synthesize metrics of "roadway state" such as crowded or light traffic, perhaps identify bunches of cars (even on a lightly driven road we might see a grouping of cars that effectively blocks all the lanes), etc. Road surface and visibility clearly are relevant, and roadway debris. We would need some form of composite model covering all of these considerations.

I could elaborate but I'm hoping you can already see that we are looking at a very complicated real-time database question. What makes it interesting to me is that on the one hand, it clearly does have a great deal of relatively standard structure (like any database): a schema listing information we can collect for each vehicle (of course some "fields" may be populated for only some cars...), one for each segment of the highway, perhaps one for each driver. When we collect a set of documentation on bad drivers of the month, we end up with a database with bad-driver records in it: one linked to vehicle and driver (after all, a few people might share one vehicle and perhaps only some of the drivers are reckless), and then a series of videos or trajectory records demonstrating some of the "all time worst" behavior by that particular driver over the past month.

But on the other hand, notice that our queries also have an interesting form of locality: they are most naturally expressed as predicates over a series of temporal events that make up a particular trajectory, or a particular set of trajectories: on this date at this time, vehicle such-and-such swerved to pass an 18-wheeler truck on the right, then dove three lanes across to the left (narrowly missing the bumper of a car as it did so), accelerated abruptly, braked just as suddenly and dove to the right... Here, I'm describing some really bad behavior, but the behavior is best seen as a time-linked (and driver/vehicle-linked) series of events that are easily judged as "bad" when viewed as a group, and yet that actually would be fairly difficult to extract from a completely traditional database in which our data is separated into tables by category.

Standard databases (and even temporal one) don't offer particularly good ways to abstract these kinds of time-related event sequences if the events themselves are from a very diverse set. The tools are quite a bit better for very regular structures, and for time series data with identical events -- and that problem arises here, too. For example, when computing a vehicle trajectory from identical GPS records, we are looking at a rather clean temporal database question, and some very good work has been done on this sort of thing (check out Timescale DB, created by students of my friend and colleague, Mike Freedman!). But the full-blown problem clearly is the very diverse version, and it is much harder. I'm sure you are thinking about one level of indirection and so forth, and yes, this is how I might approach such a question -- but it would be hard even so.

In fact, is it a good idea to model a temporal trajectory a database relation? I suspect that it could be, and that representing the trajectory that way would be useful, but this particular kind of relation just lists events and their sequencing. Think also about this issue of event type mentioned above: here we have events linked by the fact that they involve some single driver (directly or perhaps indirectly -- maybe quite indirectly). Each individual event might well be of a different type: the data documenting "caused some other vehicle to take evasive action" might depend on the vehicle, and the action, and would be totally different from the data documenting "swerved across three lanes" or "passed a truck on its blind side."

Even explaining the relationships and causality can be tricky: Well, car A swerved in front of car B, which braked, causing C to brake, causing D to swerve and impact E. A is at fault, but D might be blamed by E!

In fact, as we move through the world -- you or me, as drivers of our respective vehicles, or for that matter even as pedestrians trying to cross the road, this aspect of building self-centered domain-specific temporal databases seems to be something we do very commonly, and yet don't model particularly well in today's computing infrastructures. Moreover, you and I are quite comfortable with highways that might have cars and trucks, motorcycles, double-length trucks, police enforcement vehicles, ambulances, construction vehicles... extremely distinct "entities" that are all capable of turning up on a highway, our standard ways of building databases seem a bit overly structured for dealing with this kind of information.

Think next about IoT scaling. If we had just one camera, aimed at one spot on our highway, we still could do some useful tasks with it: we could for example equip it with a radar-speed detector that would trigger photos and use that to automatically issue speeding tickets, as they do throughout Europe. But the task I described above fuses information from what may be tens of thousands of devices deployed over a highway more than 100 miles long at the location I specified, and that highway could have a quarter-million vehicles on it at peak commute hours.

As a product opportunity, Smart Highways Inc is looking at a heck of a good market -- but only if they can pull off this incredible scaling challenge. They won't simply be applying their AI/ML "driving quality" evaluation to individual drivers, using data from within the car (that easier task is the one Hari Balakrisnan's Cambridge Mobile Telematics has tackled, and even this problem has his company valued in the billions as of round A). Smart Highways Inc is looking at the cross-product version of that problem: combining data across a huge number of sensor inputs, fusing the knowledge gained, and eventually making statements that involve observations taken at multiple locations by distinct devices. Moreover, we would be doing this at highway scale, concurrently, for all of the highway all of the time.

In my lecture today, we'll be talking about MapReduce, or more properly, the Spark/Databricks version of Hadoop, which combines an open source version of MapReduce with extensions to maximize the quality of in-memory caching and introduces a big-data analytic ecosystem. The aspect of interest to me is the caching mechanism: Spark centers on a kind of cacheable query object they call a Resilient Distributed Data object, or RDD. An RDD describes a scalable computation designed to be applicable across a sharded dataset, which enables a form of SIMD computing at the granularity of files or tensors being processed on huge numbers of compute nodes in a datacenter.

The puzzle for my students, which we'll explore this afternoon, is whether the RDD model could be transferred from the batched, non-real-time settings in which it normally runs (and even more than that, functional, in the sense that Spark treats every computation as a series of read-only data transformation steps, from a static batched input set through a series of MapReduce stages to a sharded, distributed result). So our challenge is: could a graph of RDDs and an interative compute model express tasks like the Smart Highway ones?

RDDs are really a linkage between database models and a purely functional Lisp-style Map and Reduce functional computing model. I've always liked them, although my friends who do database research tend to view them dimnly, at best. They often feel that all of Spark is doing something a pure database could have done far better (and perhaps more easily). Still, people vote with their feet and for whatever reason, this RDD + computing style of coding is popular.

So... could we move RDDs to the edge? Spark itself, clearly, wouldn't be the proper runtime: it works in a batched way, and our setting is event-driven, with intense real-time needs. It might also entail taking actions in real-time (even pointing a camera or telling it to take a photo, or to upload one, is an action). So Spark per-se isn't quite right here. Yet Spark's RDD model feels appropriate. Tensor Flow uses a similar model, by the way, so I'm being unfair when I treat this as somehow Spark-specific. I just have more direct experience with Spark, and additionally, see Spark RDDs as a pretty clear match to the basic question of how one might start to express database queries over huge IoT sensor systems with streaming data flows. Tensor Flow has many uses, but I've seen far more work on using it within a single machine, to integrate a local computation with some form of GPU or TPU accelerator attached to that same host. And again, I realize that this may be unfair to Tensor Flow. (And beyond that I don't know anything at all about Julia, yet I hear that system name quite often lately...)

Anyhow, back to RDDs. If I'm correct, maybe someone could design an IoT Edge version of Spark, one that would actually be suitable for connecting to hundreds of thousands of sensors, and that could really perform tasks like the one outlined earlier in real-time. Could this solve our problem? It does need to happen in real-time: a smart highway generates far too much data per second to keep much of it, so a quick decision is needed that we should document the lousy driving of vehicle A when driver so-and-so is behind the wheel, because this person has caused a whole series of near accidents and actual ones -- sometimes, quite indirectly, yet always through his or her recklessness. We might need to make that determination within seconds -- otherwise the documentation (the raw video and images and radar speed data) may have been discarded.

If I was new to the field, this is the problem I personally might tackle. I've always loved problems in systems, and in my early career, systems meant databases and operating systems. Here we have a problem of that flavor.

Today, however, scale and data rates and sheer size of data objects are transforming the game. The kind of system needed would span entire datacenters, and we will need to use accelerators on the data path to have any chance at all of keeping up. So we have a mix of old and new... just the kind of problem I would love to study, if I was hungry for a hugely ambitious undertaking. And who know... if the right student knocks on my door, I might even tackle it.

A Few Thoughts on Distributed Computing

Thursday, 10 October 2019

If the world was a (temporal) database...