Tuesday 19 December 2017

Data concentrators: Follow the bytes

In my last posting, I discussed the need for cloud-hosted data concentrators in support of fog computing.  I thought it might be interesting to dive a bit deeper on this topic.  A number of questions arise, but I want to suggest that they center on understanding the "critical path" for data.  In computer systems parlance, the term just means "the portion of the system that limits performance."  Once we understand which elements make up the critical path, we can then ask further questions about how one might implement an architecture to support this form of data concentrator, with various cloud-like goals: resource sharing, scalability, and so forth. 

To set the context, let me give an example of how this form of reasoning can be valuable.  Consider Eric Brewer's famous CAP conjecture for web servers (and the associated BASE methodology).  It may be surprising to realize that before Eric actually put this forward, many web server systems were built as classic 3-tier database structures, in which every web page was basically the output of a database query, and the database server was thus on the critical path even for read-only web transactions.  The database queries ultimately became the bottleneck: nobody could figure out how to get linear scaling without having the database concurrency mechanism dead center on the critical path and ultimately, overloaded by the load.

At that time we saw some very clever workarounds that came close to solving the problem.  For example, with snapshot isolation, it was shown that you could approximate serializability by running transactions slightly in the past, on consistent snapshots.  There was a scheme for gradually narrowing the snapshot window, to make it as current as practical and yet ensure that the data in the snapshot was committed and consistent.  So by doing a read off this window, you could avoid consulting the back-end database more frequently than required and end up generating your web response in a consistent way, but totally locally, just talking to the cache.  Yet even snapshot isolation frustrated some vendors.  They wanted far more scalability, and snapshot isolation still had some overheads that involved queries to the backend database.  It seemed clear that consistency would be at odds with eliminating this round-trip to the back end.

CAP originated from Eric's decision to study this exact issue: he observed that  responsiveness for web page generation was overwhelmingly more critical than any other aspect of the system, and that database queries are slow due to their consistency mechanisms -- mechanisms that do guarantee that the answer is correct, but involve locking and other delays.  This led him to ask whether web servers actually needed consistency, and to speculate about how much data can be served out of cached content.  He ended up arguing for a "soft state" model, in which pretty much every web request is handled as much as possible from cached content, and his company, Inktomi, was born.  CAP then gave rise to a development methodology (the BASE approach), resulted in scalable key-value caching architectures (MemCached, Cassandra, etc), argued for NoSQL database-like models such as in DynamoDB, and so forth.

Honestly, my crowd was up in arms at the time.  Very few "strong consistency" researchers liked CAP or its conclusion that we should just toss consistency out the window.  I could point to colleagues who rail against CAP, even though CAP itself is 15 years old.  And how can one really dispute Eric's deeper point: if these applications don't actually need consistency, why are we insisting that they pay for a property that they don't want, and that isn't all that cheap either?

Notice how from single core insight about the critical path, we end up with a fertile debate and a whole area of research that played out over more than a decade.  You may not love CAP, yet it was actually extremely impactful. 

CAP isn't the only such story.  Jim Gray once wrote a lovely little essay on the "Dangers of Database Replication" in which he did a paper-and-pencil analysis (jointly with colleagues at Microsoft: Pat Helland, Dennis Shasha and ) and showed that if you just toss state machine replication into a database, the resulting system will probably slow down as n^5, where n is the number of replicas.  Since you probably wanted a speedup, not a slowdown, this is a classic example of how critical path analysis leads to recognition that a design is just very, very wrong!  And Jim's point isn't even the same one Eric was making.  Eric was just saying that if you want your first tier to respond to queries without sometimes pausing to fetch fresh data from back-end databases, you need to relax consistency.

Ultimately, both examples illustrate variations on the famous End to End principle: you should only solve a problem if you actually think you have that problem, particularly if the solution might be costly!  And I think it also points to a trap that the research community often falls into: we have a definite tendency to favor stronger guarantees even when the application doesn't need guarantees.  We are almost irresistibly drawn to making strong promises, no matter who we are making them to!  And face it: we tend to brush costs to the side, even when those costs are a really big deal.

So, armed with this powerful principle, how might we tackle the whole data concentrator question?  Perhaps the best place to start is with a recap of the problem statement itself, which arises in fog/edge computing, where we want a cloud system to interact closely with sensors deployed out in the real world.  The core of that story centered on whether we should expect the sensors themselves to be super-smart.  We argued that probably, they will be rather dumb, for a number of reasons: (1) power limits; (2) impracticality of sending them large machine-learned models, and of updating those in real-time; (3) least-common denominator, given that many vendors are competing in the IoT space and companies will end up viewing all these devices as plug-and-play options; (4) lack of peer-to-peer connectivity, when sensors would need to talk to one-another in real-time to understand a situation.  In the previous blog I didn't flesh out all 4 of these points, but hopefully they are pretty self-evident (I suppose that if you already bet your whole company on brilliant sensors, you might disagree with one or more of them, but at least for me, they are pretty clear).

Specialized sensors for one-time uses might be fancier (sensors for AUVs, for example, that do various kinds of information analysis and data fusion while in flight).  But any kind of general purpose system can't bet on those specialized features. We end up with a split: a military surveillance system might use very smart sensors, operating autonomously in an environment where connectivity to the ground was terrible in any case.  But a smart highway might ignore those super-fancy features because it wants to be compatible with a wide range of vendors and anyhow, wants to work with a very dynamically-updated knowledge model.  So, across the broad spectrum of products, a system that wants to have lots of options will end up forced towards that least-common denominator perspective.

From this, we arrived at various initial conclusions, centering on the need to do quite a bit of analysis on the cloud: you can probably do basic forms of image analysis on a camera, but matching your video or photo to a massive "model" of activity along Highway 101 near Redwood City -- that's going to be a cloud-computing task.  Many people believe that speech understanding is basically a cloud-scale puzzle, at least for the current decade or so (the best solutions seem to need deep neural networks implemented by massive FPGA or ASIC arrays).  Most forms of machine learning need big TPU or GPU clusters to crunch the data. 

So most modern roads lead to the cloud.  I'll grant that someday this might change, as we learn to miniaturize the hardware and figure out how to pack all the data in those models into tiny persistent storage units, but it isn't going to happen overnight.  And again, I'm not opposed to fancy sensors for AUVs.  I'm just saying that if some of your AUVs come from IBM, some from Microsoft and some from Boeing, and you want to work with all three options, you won't be able to leverage the fanciest features of each.

The data concentrator concept follows more or less directly from the observations above.  Today's cloud is built around Eric's CAP infrastructure, with banks of web servers that rush to serve up web pages using cached data spread over key-value stores.  Most of the deep learning and adaptation occurs as a back-end task, more or less offline.  It isn't unusual for edge models to be hours out of date.  And it is very hard to keep them within seconds of current state.  But if you plan to control a smart highway, or even a smart home, seconds are too high a latency.  "Watch out for that rusty pipe!" isn't a very helpful warning, if you (or your smart car) won't get that warning until 10s too late.

So what would a data concentrator do, and how does the answer to such a question reveal the most likely critical paths?  Basically, data concentrators give rise to a new kind of tier-one system.  Today's first tier of the cloud runs Eric Brewer's architecture: we have vast army's of very lightweight, stateless, web-page generating engines.  A data concentrator would be a new form of first tier, one that focuses on stateful responses using machine-learned models and customized using machine-learning code, much as we customize a web server to generate the various web pages needed for media, direct sales, advertising, etc.  Over time, I'm betting we would code these in a high level language, like TensorFlow, although today C++ might give much better performance.

Back to basics.  You know the business adage about following the money.   For a fog system where we care mostly about performance and real-time responses, the "costly" thing is slowdowns or delays.  So let's follow the data.  The whole reason for building data concentrators is that our sensors are kind of dumb, but the volumes of data are huge, hence we'll need a system element to filter and intelligently compress the data: I like to think of the resulting systems as "smart memory", meaning that we might treat them like file systems, and yet they are smart about how they store the stuff they receive. A smart memory is just one form of data concentrator; I can think of other forms, too; maybe a future blog could focus on that.

Via its file system personality, a smart-memory system could be used just like any storage infrastructure, supporting normal stuff like Spark/Databricks, TimescaleDB, you name it.  But in its edge-of-the-cloud personality, it offers the chance to do tasks like image or voice classification using big machine-learned models, real-time reaction to urgent events (like mufflers falling off on the highway), and perhaps even simple learning tasks ("Ken's car is 2" to the left of what was predicted").

Where would the highest data pressures arise?  If we have sensors capturing (and perhaps storing) video or voice, the individual data links back to the data collector tier of the cloud won't be terribly burdened: the core Internet has a lot of capacity, and even a dumb camera can probably do basic tasks like recognizing that the highway is totally empty.  But each of these collectors will be a concentrator of streams, capturing data from lots of sources in parallel.  So the highest demand may be on that very first leg inside the cloud: we should be asking what happens to a data stream as it arrives in a single data collector.

Without knowing the application, this may seem like an impossible goal, but consider this: we actually know a lot about modern applications, and they share a lot of properties.  One is the use of accelerators: if the data is video, the first steps will be best fitted to GPU or TPU or FPGA hardware, which can do tasks like parallel image processing, evaluating large Bayesian or Neural Network models, and so forth.  Right now, it is surprisingly hard to actually load data from an incoming TCP connection directly into an accelerator of these kinds: most likely it will need to be copied many times before it ever reaches GPU memory and we can turn that hardware loose.  And we've already concluded that only a limited amount of that kind of work can occur on the sensor itself.

So a first insight more or less leaps at us: for this data collection model to work, we have to evolve the architecture to directly shunt data from incoming TCP connections to GMEM or TMEM or through a bump-in-the-wire FPGA transformation.  Today's operating systems don't make this easy, so we've identified an O/S research need.  Lacking O/S research, we've identified a likely hot-spot that could be the performance-limiting step of the architecture.  That was basically how CAP emerged, and how Jim Gray's database scaling work advanced: both were paper-and-pencil analyses that led to pinch-points (in the case of CAP, the locking or cache-coherency delays needed to guarantee consistency for read-mostly workloads; in the case of database updates, the concurrency control and conflicts that caused aborts).

Distilled to its bare bones, the insight is: data concentrators need ways to move data from where it is captured to where it will be processed that minimize delay and maximize throughput -- and lack those options today.  Without them, we will see a lot of copying, and it will be very costly.

A second such issue involves comparing data that is captured in multiple ways.  If we believe in a sensor-rich IoT environment, we'll have many sensors in a position to hear a single utterance, or to see a single event, and different sensors may have differing qualities of signal.  So we might wish to combine sensor inputs to strengthen a signal, enrich a set of 2-D visual perspectives into a 3-D one, triangulate to locate objects in space, select the best qualify image from a set of candidates, etc.

Again, we have a simple insight: cross-node interactions will also be a pressure point.  The basic observation leads essentially the same findings as before.

Another form of cross node interaction arises because most data concentration systems form some kind of pipeline, with stages of processing.  If you know about Microsoft's Cosmos technology, this is what I have in mind.  Hadoop jobs (MapReduce) are another good example: the output of stage one becomes input to stage two.

So for such systems, performance is often dominated by flow of data between the stages.  For example, suppose that one stage is a GPU computation that segments video images, and a second stage guesses at what the segmented objects might be ("car", "truck", "rusty pipe"), and a third one matches those against a knowledge model ("Ken's car", "Diane's truck"), then compares the details ("is Ken on the same trajectory as predicted?").  We might see some form of fanout: there could be one element in the first stage, but there could be two or even many second or third stage components, examining very different information and with distinct purposes.  Each might send events or log data into storage.

So this then leads to scrutiny of architectural support for that form of structure, that form of handoffs, and attention to the requirements.  As one example: if we want our infrastructure to be highly available, does that imply that handoff needs to be reliable on an event-by-event basis?  The answers may differ case by case: data from a smart highway is very redundant because cars remain on it for long periods; if a glitch somehow drops one event, a follow-on event will surely have the same content.  So in an end-to-end sense, we wouldn't need reliability for that specific handoff, especially if we could gain speed by relaxing guarantees.  Eric would like that sort of thing.  Conversely, if an event is one of a kind ("Oh my goodness, Diane's muffler is falling off!  There it goes!!") you might want to view that as a critical event that must not be dropped, and that has many urgent consequences.

The tradeoffs between data throughput and latency are intriguing too.  Generally, one wants to keep an architecture busy, but a system can be busy in good or bad ways.  Further, goodness is very situation-specific: for warnings about roadway trash, a quick warning is really important.  For aggregated steady state data processing, longer pipelines and batch processing pay off.  And these are clearly in tension.

The pithy version would be along these lines: improved support for data pipelines will be important, and is lacking in today's systems.  Lacking progress, not only will we see a fair amount of inefficient copying, but we could end up with barrier delays much as in Hadoop (where stage two needs to sit waiting for a set of results from stage one, and any single delayed result can snowball to delay the entire computation).

I won't drag this out -- for a blog, this is already a rather long essay!  But hopefully you can see my core point, which I'll summarize.  Basically, we need to visualize a scalable data concentrator system in terms of the data flows it can support and process, and optimize to make those flows as efficient as possible.  Modern O/S structures seem ill-matched to this, although it may not be very hard to improve them until the fit is much better.  But in doing so, it jumps out that low latency and high throughput may well be in conflict here.  Following the bytes down these different paths might lead to very different conclusions about the ideal structure to employ!

I love puzzles like this one, and will surely have more to say about this one, someday.  But jump in if you want to share a perspective of your own!

No comments:

Post a Comment

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.