A Few Thoughts on Distributed Computing: Data everywhere but only a drop to drink...

One peculiarity of the IoT revolution is that it may explode the concept of big data.

The physical world is a domain of literally infinite data -- no matter how much we might hope to capture, at the very most we see only a tiny set of samples from an ocean of inaccessible information because we had no sensor in the proper place, or we didn't sample at the proper instant, or didn't have it pointing in the right direction or focused or ready to snap the photo, or we lacked bandwidth for the upload, or had no place to store the data and had to discard it, or misclassified it as "uninteresting" because the filters used to make those decisions weren't parameterized to sense the event the photo was showing.

Meanwhile, our data-hungry machine learning algorithms currently don't deal with the real world: they operate on snapshots, often ones collected ages ago. The puzzle will be to find a way to somehow compute on this incredible ocean of currently-inaccessible data while the data is still valuable: a real-time constraint. Time matters because in so many settings, conditions change extremely quickly (think of a smart highway, offering services to cars that are whizzing along at 85mph).

By computing at the back-end, AI/ML researchers have baked in very unrealistic assumptions, so that today's machine learning systems have become heavily skewed: they are very good at dealing with data acquired months ago and painstakingly tagged by an army of workers, and fairly good at using the resulting models to make decisions within a few tens of milliseconds, but in a sense consider the action of acquiring data and processing it in real-time to be part of the (offline) learning side of the game. In fact many existing systems wouldn't even work if they couldn't iterate for minutes (or longer) on data sets, and many need that data to be preprocessed in various ways, perhaps cleaned up, perhaps preloaded and cached in memory, so that a hardware accelerator can rip through the needed operations. If a smart highway were capturing data now that we would want to use to relearn vehicle trajectories so that we can react to changing conditions within fractions of a second, many aspects of this standard style of computing would have to change.

To me this points to a real problem for those intent on using machine learning everywhere and as soon as possible, but also a great research opportunity. Database and machine learning researchers need to begin to explore a new kind of system in which the data available to us is understood to be a "skim" (I learned this term when I used to work with high performance computing teams in scientific computing settings where data was getting big decades ago. For example the CERN particle accelerators capture far too much data to move data from the sensor, so even uploading "raw" data involves deciding which portions to keep, which to sample randomly, and which to completely ignore).

Beyond this issue of deciding what to include in the skim, there is the whole puzzle of supporting a dialog between the machine-learning infrastructure and the devices. I mentioned examples in which one need to predict that a photo of such and such a thing would be valuable, anticipate the timing, point the camera in the proper direction, pre-focus it (perhaps, on an expected object that isn't yet in the field of view, so that the auto-focus wouldn't be useful because the thing we want to image hasn't yet arrived), plan the timing, capture the image, and then process it -- all under real-time pressure.

I've always been fascinated by the emergence of new computing areas. To me this looks like one ripe for exploration. It wouldn't surprise me at all to see an ACM Symposium on this topic, or an ACM Transactions journal. Even at a glance one can see all the elements: a really interesting open problem that would lend itself to a theoretical formalization, but also one that will require substantial evolution of our platforms and computing systems. The area is clearly of high real-world importance and offers a real opportunity for impact, and a chance to build products. And it emerges at a juncture between systems and machine learning: a trending topic even now, so that this direction would play into gradually building momentum at the main funding agencies, which rarely can pivot on a dime, but are often good at following opportunities in a more incremental, thoughtful way.

The theoretical question would run roughly as follows. Suppose that I have a machine-learning system that lacks knowledge required to perform some task (this could be a decision or classification, or might involve some other goal, such as finding a path from A to B). The system has access to sensors, but there is a cost associated with using them (energy, repositioning, etc). Finally, we have some metric for data value: a hypothesis concerning the data we are missing that tells us how useful a particular sensor input would be. Then we can talk about the data to capture next that minimizes cost while maximizing value. Given a solution to the one-shot problem, we would then want to explore the continuous version, where the new data changes these model elements, fixed-points for problems that are static, and quality of tracking for cases where the underlying data is evolving.

The practical systems-infrastructure and O/S questions center on the capabilities of the hardware and the limitations of today's Linux-based operating system infrastructure, particularly in combination with existing offloaded compute accelerators (FPGA, TPU, GPU, even RDMA). Today's sensors run a gamut from really dumb fixed devices that don't even have storage to relatively smart sensors that can do various tasks on the device itself, have storage and some degree of intelligence about how to report data, etc. Future sensors might go further, with the ability to download logic and machine-learned models for making such decisions: I think it is very likely that we could program a device to point the camera at such and such a lane on the freeway, wait for a white vehicle moving at high speed that should arrive in the period [T0,T1], obtain a well-focused photo showing the license plate and current driver, and then report the image capture accompanied by a thumbnail. It might even be reasonable to talk about prefocusing, adjust the spectral parameters of the imaging system, selecting from a set of available lenses, etc.

Exploiting all of this will demand a new ecosystem that combines elements of machine learning on the cloud with elements of controlled logic on the sensing devices. If one thinks about the way that we refactor software, here we seem to be looking at a larger-scale refactoring in which the machine learning platform on the cloud, with "infinite storage and compute" resources, has the role of running the compute-heavy portions of the task, but where the sensors and the other elements of the solution (things like camera motion control, dynamic focus, etc) would need to participate in a cooperative way. Moreover, since we are dealing with entire IoT ecosystems, one has to visualize doing this at huge scale, with lots of sensors, lots of machine-learned models, and a shared infrastructure that imposes limits on communication bandwidth and latency, computing at the sensors, battery power, storage and so forth.

It would probably be wise to keep as much of the existing infrastructure as feasible. So perhaps that smart highway will need to compute "typical patterns" of traffic flow over a long time period with today's methodologies (no time pressure there), current vehicle trajectories over mid-term time periods using methods that work within a few seconds, and then can deal with instantaneous context (a car suddenly swerves to avoid a rock that just fell from a dumptruck onto the lane) as an ultra-urgent real-time learning task that splits into the instantaneous part ("watch out!") and the longer-term parts ("warning: obstacle in the road 0.5miles ahead, left lane") or even longer ("at mile 22, northbound, left lane, anticipate roadway debris"). This kind of hierarchy of temporality is missing in today's machine learning systems, as far as I can tell, and the more urgent forms of learning and reaction will require new tools. Yet we can preserve a lot of existing technology as we tackle these new tasks.

Data is everywhere... and that isn't going to change. It is about time that we tackle the challenge of building systems that can learn to discover context, and use current context to decide what to "look more closely" at, and with adequate time to carry out that task. This is a broad puzzle with room for everyone -- in fact you can't even consider tackling it without teams that include systems people like me as well as machine learning and vision researchers. What a great puzzle for the next generation of researchers!

A Few Thoughts on Distributed Computing

Saturday 22 June 2019

Data everywhere but only a drop to drink...

No comments:

Post a Comment