Let's imagine that we've created the ultimate data warehouse, using Derecho. This warehouse hosts terabytes of persistent memory, It can absorb updates at a staggering rate: hundreds of gigabits per second, has tens of thousands of processors that crunch the data down and then store it as collections of higher level "knowledge models", and is always hungry for more.
We used to think of data warehouses as respositories for nuggets of data, so perhaps we could call these "muggets:" models, that can now be queried. Hopefully this term isn't painfully cute.
Next, we imagine some collection of machine learning applications that consume these muggets and learn from them, or compute some sort of response to them. For example if the muggets represent knowledge collected on the local highway, the queries might be posed by smart cars trying to optimize their driving plans. "Is it ok with everyone if I shift to the passing lane for a moment?" "Does anyone know if there are obstacles on the other side of this big truck?" "What's the driving history for this motorcycle approaching me: is this guy some sort of a daredevil who might swoop in front of me with inches to spare?" "Is that refrigerator coming loose, the one strapped to the pickup truck way up ahead?"
If you drive enough, you realize that answers to such questions would be very important to a smart car! Honestly, I could use help with such things now and then too, and my car is pretty dumb.
We clearly want to answer the queries with strong consistency (for a smart car, a quick but incorrect answer might not be helpful!), but also very rapidly, even if this entails using slightly stale data. In Dercho, we have a new way to do this that adapts the FFFS snapshot approach described in our SOCC paper to run in what we call version vectors, which is how Derecho stores volatile and persistent data. Details will be forthcoming shortly, I promise.
Here's my question: Derecho's kind of data warehouse currently can't support the full ACID database style of computation, because at present, Derecho only has read-only consistent queries against its temporally precise, causally consistent snapshots. So we have strong consistency for updates, which are totally ordered atomic actions against sets of version vectors, and strong consistency for read-only queries, but not read/write queries, where an application might read the current state of the data warehouse, compute on it, and then update it. Is this a bad thing?
I've been turning the question over as I bike around on the bumpy, potholed roads here in Sebastopol, where we are finishing up my sabbatical (visiting my daughter, but I've also dropped in down in the valley and at Berkeley now and then). Honestly, cyclists could use a little smart warehouse help too, around here! I'm getting really paranoid about fast downhills with oak trees: the shadows invariably conceal massive threats! But I digress...
The argument that Derecho's time-lagged model suffices is roughly as follows: ACID databases are hard to scale, as Jim Gray observed in his lovely paper on "Dangers of Database Scalability". Basically, the standard model slows down as n^5 (n is the number of nodes running the system). This observation gave us CAP and BASE and ultimately, today's wonderfully scalable key-value stores and noSQL databases. But those have very weak consistency guarantees.
Our Derecho warehouse, sketched above (fleshed out in the TOCS paper we are just about to submit) gets a little further. Derecho can work quite well for that smart highway or similar purposes, especially if we keep the latency low enough. Sure, queries will only be able to access the state as of perhaps 100ms in the past, because the incoming database pipeline is busy computing on the more current state. But this isn't so terrible.
So the question we are left with is this: for machine learning in IoT settings, or similar online systems, are there compelling use cases that actually need the full ACID model? Or can Maine learning systems always manage with a big "nearly synchronous" warehouse, running with strong consistency but 100ms or so lagged relative to the current state of the world? Is there some important class of control systems or applications that can probably be ruled out by that limitation? I really want a proof, if so: show me that omitting the fully ACID behavior will be a huge mistake, and convince me with mathematics, not handwaving... (facts, not alt-facts).
I'm leaning towards the "Derecho would suffice" answer. But very curious to hear other thoughts...
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.