Cloud computing systems ignore time, and this is a mistake.
Context: Temporal Data Mining is Hard Today
Everyone is talking about the Internet of Things
(IoT), but fast, scalable IoT solutions can be difficult to create. IoT applications often require real-time
analysis on data streams, but most software is designed to just read data from
existing files. As a result, the
developer may be forced to create entirely new applications that read data
continuously. Developers who prefer to
just build scripts that build new solutions from existing analytic tools would
find this frustrating and time-consuming.
In fact, this problem has been troublesome
to computing professionals for decades.
There is an enormous variety of prebuilt software that can do almost any
imaginable computation provided that the
input is in files (for example tables represented in comma-separated value
format, or files that list a series of values from some kind of sensor, or
files written one per input, such as photos captured from an image
sensor). In contrast, there is little
prebuilt software for incremental data feeds (settings in which programs need
to run continuously and accept one new input at a time).
This is why so many developers prefer to
store data into files, then carry out any needed analysis on the files. But here we run into another issue: in
today’s systems, the delay (latency) of the store-then-compute
style of analysis is often very high. As an answer to this problem, my team
recently created a system we call the Freeze-Frame File System. FFFS is able to
accept streams of updates while fulfilling “temporal reads” on demand. By running analytics on the files that FFFS
captures, we can get the best of both worlds: high productivity through reuse
of existing code, and ultra-low latency.
The key is that FFFS bridges between a model of continuously writing
data into files and one of performing analysis on snapshots representing
instants in time (much as if it was a file system backup system creating a
backup after every single file update operation).
- The developer sees a standard file system.
- Snapshots are available and will materialize past file state at any time desired.
- The snapshots don’t need to be planned ahead of time, and look to the application like file system directories (folders), with the snapshot time embedded in the name.
- Applications don’t need any modifications at all, and can run on these snapshots with extremely low delay between when data is captured and when the analysis runs.
- The file system uses RDMA data transfers for extremely high speed.
The result is that even though the FFFS
looks like standard file system, and supports a standard style of program that
just stores data into files and then launches analytics that use those files as
input, performance is often better than that of a custom data streaming
solution. This gives developers the best
of two worlds: they can leverage existing file-oriented data analytics, but
gain the performance of a custom-built data streaming solution!
Cornell’s
Freeze Frame File System in Action
To illustrate the kind of computing I have
in mind, here’s an example that arises in the smart grid. We simulated a wave propagating through a
very simple mesh network, and generated 100 10x10 image streams, as if each
cell in the mesh were monitored by a distinct camera, including a timestamp for
each of these tiny images. We streamed the images in a time-synchronized manner
to a set of data collectors over TCP, using a separate connection for each
stream (mean RTT was about 10ms). Each data collector simply writes incoming
data to files. Finally, to create a movie we extracted data for the time
appropriate to each frame trusting the file system to read the proper data,
fused the data, then repeated. In Figures 1-3 we see representative output.
Example of a real-time IoT application. The full animations are available here. |
The image on the left used the widely
popular HDFS system to store the files.
HDFS has a built-in snapshot feature; we triggered snapshots once every
100 milliseconds (HDFS snapshots must be planned in advance and requested at
the exact time the snapshot should cover).
Then we rendered the data included in that snapshot to create each frame
of the movie. Even from the single frame shown, you should be able to instantly
see that this version is of low quality: HDFS is oblivious to timestamps and
hence often mixes frames from different times.
In the middle case, we ran the same
application but now saved the files into FFFS, configured to assume that each
update occurred at the time the data reached the data storage node. Then we used FFFS snapshots to make the
movie. You can immediately see the
improvement: this is because FFFS understands both time and data consistency (if you run the animation, however, you will see that this version isn't perfect either). Another nice feature is that FFFS snapshots don’t need to be
preplanned: you just ask for data at any desired time.
Finally, on the right, we again used
FFFS, but this time configured it to extract time directly from the original
image by providing a datatype-specific plug-in.
(Some programming may be needed to create a new plug-in the format of
your sensor data is one we haven’t worked with previously). Now the movie is perfect, although if data showed up very late (in particular, after we made the snapshots that were combined into the movie frames), obviously we could have seen glitches here too. Still, this is getting quite good.
With this image in mind, now imagine that rather
than making a movie, the plan was to run an advanced smart-grid data analytic
on the captured data. If a file system
can’t support making a movie, as in Figure 1, it clearly wouldn’t work well for
other time-sensitive computations, either. In effect, the HDFS style of file
system fights against the IoT application.
FFFS, by understanding time, offers a far superior environment for doing
these kinds of smart analytics.
Why the Internet of Things Needs a Real-time Cloud
I hope you’ll check out our paper on FFFS
and maybe even download and use the system itself. We believe it opens the door to a whole new
style of data capture and computing, and also enables temporal forensics, where you use FFFS to create an archive of
data, and then mine on the data later to explain things that happened
unexpectedly and leave you puzzled: you could look back in time to see when the
issue first arose. But rather than limit
myself to FFFS here, I want to offer a totally different thought: maybe it is
time for the cloud to start to embrace time, just like we did in FFFS, but in
other parts of the cloud too.
Here are some reasons time could matter, a
lot, in the future cloud. First, the
world is gravitating to a wider and wider range of applications like the smart
grid. Smart cars are an example: they
will have autonomous onboard controllers but will probably often be paired with
cloud-hosted services that run on behalf of the car and help it pick a route,
point out nearby restaurants and book tables, notice if a road suddenly gets
blocked and replan your travels, book you into a hotel if the weather becomes
dangerously bad. And I think these will
be highly available cloud-hosted real-time applications, for lots of reasons,
not least being that they need to be super responsive.
Banking and other investment trading
systems will operate more and more in this mode, and so will the systems
controlling smart homes and smart cities and smart factories, and the list
really goes on and on.
But once you imagine a whole world of time-based
computing, with deadlines and scheduling and priorities, data replicated at
high speeds for availability if a fault occurs, parallel computing in the loop
so that machine learning can keep up with a rapidly changing external world,
you aren’t talking about today’s asynchronous cloud anymore. CAP (the claimed tradeoff that weakens
consistency to favor availability and partition tolerance) will be pushed to
the side in favor of solutions like FFFS with strong consistency and strong
temporal guarantees.
I do see hard puzzles: we need to scale
this stuff up and make it handle georeplication too. We also need to integrate FFFS with the data analytic tools in systems like Microsoft Azure, Google and Amazon. So we researchers have our work cut out for
us. But I can also see why you’ll definitely
want FFFS not just now, but long into the future, and all these other goodies
that really need to surround it in a grown-up real-time cloud!
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.