Context: Temporal Data Mining is Hard Today
Everyone is talking about the Internet of Things (IoT), but fast, scalable IoT solutions can be difficult to create. IoT applications often require real-time analysis on data streams, but most software is designed to just read data from existing files. As a result, the developer may be forced to create entirely new applications that read data continuously. Developers who prefer to just build scripts that build new solutions from existing analytic tools would find this frustrating and time-consuming.
In fact, this problem has been troublesome to computing professionals for decades. There is an enormous variety of prebuilt software that can do almost any imaginable computation provided that the input is in files (for example tables represented in comma-separated value format, or files that list a series of values from some kind of sensor, or files written one per input, such as photos captured from an image sensor). In contrast, there is little prebuilt software for incremental data feeds (settings in which programs need to run continuously and accept one new input at a time).
This is why so many developers prefer to store data into files, then carry out any needed analysis on the files. But here we run into another issue: in today’s systems, the delay (latency) of the store-then-compute style of analysis is often very high. As an answer to this problem, my team recently created a system we call the Freeze-Frame File System. FFFS is able to accept streams of updates while fulfilling “temporal reads” on demand. By running analytics on the files that FFFS captures, we can get the best of both worlds: high productivity through reuse of existing code, and ultra-low latency. The key is that FFFS bridges between a model of continuously writing data into files and one of performing analysis on snapshots representing instants in time (much as if it was a file system backup system creating a backup after every single file update operation).
- The developer sees a standard file system.
- Snapshots are available and will materialize past file state at any time desired.
- The snapshots don’t need to be planned ahead of time, and look to the application like file system directories (folders), with the snapshot time embedded in the name.
- Applications don’t need any modifications at all, and can run on these snapshots with extremely low delay between when data is captured and when the analysis runs.
- The file system uses RDMA data transfers for extremely high speed.
The result is that even though the FFFS looks like standard file system, and supports a standard style of program that just stores data into files and then launches analytics that use those files as input, performance is often better than that of a custom data streaming solution. This gives developers the best of two worlds: they can leverage existing file-oriented data analytics, but gain the performance of a custom-built data streaming solution!
Cornell’s Freeze Frame File System in Action
To illustrate the kind of computing I have in mind, here’s an example that arises in the smart grid. We simulated a wave propagating through a very simple mesh network, and generated 100 10x10 image streams, as if each cell in the mesh were monitored by a distinct camera, including a timestamp for each of these tiny images. We streamed the images in a time-synchronized manner to a set of data collectors over TCP, using a separate connection for each stream (mean RTT was about 10ms). Each data collector simply writes incoming data to files. Finally, to create a movie we extracted data for the time appropriate to each frame trusting the file system to read the proper data, fused the data, then repeated. In Figures 1-3 we see representative output.
|Example of a real-time IoT application. The full animations are available here.|
The image on the left used the widely popular HDFS system to store the files. HDFS has a built-in snapshot feature; we triggered snapshots once every 100 milliseconds (HDFS snapshots must be planned in advance and requested at the exact time the snapshot should cover). Then we rendered the data included in that snapshot to create each frame of the movie. Even from the single frame shown, you should be able to instantly see that this version is of low quality: HDFS is oblivious to timestamps and hence often mixes frames from different times.
In the middle case, we ran the same application but now saved the files into FFFS, configured to assume that each update occurred at the time the data reached the data storage node. Then we used FFFS snapshots to make the movie. You can immediately see the improvement: this is because FFFS understands both time and data consistency (if you run the animation, however, you will see that this version isn't perfect either). Another nice feature is that FFFS snapshots don’t need to be preplanned: you just ask for data at any desired time.
Finally, on the right, we again used FFFS, but this time configured it to extract time directly from the original image by providing a datatype-specific plug-in. (Some programming may be needed to create a new plug-in the format of your sensor data is one we haven’t worked with previously). Now the movie is perfect, although if data showed up very late (in particular, after we made the snapshots that were combined into the movie frames), obviously we could have seen glitches here too. Still, this is getting quite good.
With this image in mind, now imagine that rather than making a movie, the plan was to run an advanced smart-grid data analytic on the captured data. If a file system can’t support making a movie, as in Figure 1, it clearly wouldn’t work well for other time-sensitive computations, either. In effect, the HDFS style of file system fights against the IoT application. FFFS, by understanding time, offers a far superior environment for doing these kinds of smart analytics.
Why the Internet of Things Needs a Real-time Cloud
I hope you’ll check out our paper on FFFS and maybe even download and use the system itself. We believe it opens the door to a whole new style of data capture and computing, and also enables temporal forensics, where you use FFFS to create an archive of data, and then mine on the data later to explain things that happened unexpectedly and leave you puzzled: you could look back in time to see when the issue first arose. But rather than limit myself to FFFS here, I want to offer a totally different thought: maybe it is time for the cloud to start to embrace time, just like we did in FFFS, but in other parts of the cloud too.
Here are some reasons time could matter, a lot, in the future cloud. First, the world is gravitating to a wider and wider range of applications like the smart grid. Smart cars are an example: they will have autonomous onboard controllers but will probably often be paired with cloud-hosted services that run on behalf of the car and help it pick a route, point out nearby restaurants and book tables, notice if a road suddenly gets blocked and replan your travels, book you into a hotel if the weather becomes dangerously bad. And I think these will be highly available cloud-hosted real-time applications, for lots of reasons, not least being that they need to be super responsive.
Banking and other investment trading systems will operate more and more in this mode, and so will the systems controlling smart homes and smart cities and smart factories, and the list really goes on and on.
But once you imagine a whole world of time-based computing, with deadlines and scheduling and priorities, data replicated at high speeds for availability if a fault occurs, parallel computing in the loop so that machine learning can keep up with a rapidly changing external world, you aren’t talking about today’s asynchronous cloud anymore. CAP (the claimed tradeoff that weakens consistency to favor availability and partition tolerance) will be pushed to the side in favor of solutions like FFFS with strong consistency and strong temporal guarantees.
I do see hard puzzles: we need to scale this stuff up and make it handle georeplication too. We also need to integrate FFFS with the data analytic tools in systems like Microsoft Azure, Google and Amazon. So we researchers have our work cut out for us. But I can also see why you’ll definitely want FFFS not just now, but long into the future, and all these other goodies that really need to surround it in a grown-up real-time cloud!