A Few Thoughts on Distributed Computing: April 2017

Monday, 24 April 2017

Will smart memory need an ACID model?

This is really part II of my posting on smart memory from ten days ago.

Let's imagine that we've created the ultimate data warehouse, using Derecho. This warehouse hosts terabytes of persistent memory, It can absorb updates at a staggering rate: hundreds of gigabits per second, has tens of thousands of processors that crunch the data down and then store it as collections of higher level "knowledge models", and is always hungry for more.

We used to think of data warehouses as respositories for nuggets of data, so perhaps we could call these "muggets:" models, that can now be queried. Hopefully this term isn't painfully cute.

Next, we imagine some collection of machine learning applications that consume these muggets and learn from them, or compute some sort of response to them. For example if the muggets represent knowledge collected on the local highway, the queries might be posed by smart cars trying to optimize their driving plans. "Is it ok with everyone if I shift to the passing lane for a moment?" "Does anyone know if there are obstacles on the other side of this big truck?" "What's the driving history for this motorcycle approaching me: is this guy some sort of a daredevil who might swoop in front of me with inches to spare?" "Is that refrigerator coming loose, the one strapped to the pickup truck way up ahead?"

If you drive enough, you realize that answers to such questions would be very important to a smart car! Honestly, I could use help with such things now and then too, and my car is pretty dumb.

We clearly want to answer the queries with strong consistency (for a smart car, a quick but incorrect answer might not be helpful!), but also very rapidly, even if this entails using slightly stale data. In Dercho, we have a new way to do this that adapts the FFFS snapshot approach described in our SOCC paper to run in what we call version vectors, which is how Derecho stores volatile and persistent data. Details will be forthcoming shortly, I promise.

Here's my question: Derecho's kind of data warehouse currently can't support the full ACID database style of computation, because at present, Derecho only has read-only consistent queries against its temporally precise, causally consistent snapshots. So we have strong consistency for updates, which are totally ordered atomic actions against sets of version vectors, and strong consistency for read-only queries, but not read/write queries, where an application might read the current state of the data warehouse, compute on it, and then update it. Is this a bad thing?

I've been turning the question over as I bike around on the bumpy, potholed roads here in Sebastopol, where we are finishing up my sabbatical (visiting my daughter, but I've also dropped in down in the valley and at Berkeley now and then). Honestly, cyclists could use a little smart warehouse help too, around here! I'm getting really paranoid about fast downhills with oak trees: the shadows invariably conceal massive threats! But I digress...

The argument that Derecho's time-lagged model suffices is roughly as follows: ACID databases are hard to scale, as Jim Gray observed in his lovely paper on "Dangers of Database Scalability". Basically, the standard model slows down as n^5 (n is the number of nodes running the system). This observation gave us CAP and BASE and ultimately, today's wonderfully scalable key-value stores and noSQL databases. But those have very weak consistency guarantees.

Our Derecho warehouse, sketched above (fleshed out in the TOCS paper we are just about to submit) gets a little further. Derecho can work quite well for that smart highway or similar purposes, especially if we keep the latency low enough. Sure, queries will only be able to access the state as of perhaps 100ms in the past, because the incoming database pipeline is busy computing on the more current state. But this isn't so terrible.

So the question we are left with is this: for machine learning in IoT settings, or similar online systems, are there compelling use cases that actually need the full ACID model? Or can Maine learning systems always manage with a big "nearly synchronous" warehouse, running with strong consistency but 100ms or so lagged relative to the current state of the world? Is there some important class of control systems or applications that can probably be ruled out by that limitation? I really want a proof, if so: show me that omitting the fully ACID behavior will be a huge mistake, and convince me with mathematics, not handwaving... (facts, not alt-facts).

I'm leaning towards the "Derecho would suffice" answer. But very curious to hear other thoughts...

Saturday, 15 April 2017

Will smart memory be the next big thing?

We're fast approaching a new era of online machine learning and online machine-learned behavior: platforms that will capture real-time data streams at high data rates, process the information instantly, and then act on the output of the processing stage. A good example would be a smart highway that tells cars what to expect up around the next curve, or a real-time television feed integrated with social networking tools (like a world-cup soccer broadcast that lets the viewers control camera angles while their friends join from a remote location).

If you ask how such systems will need to be structured, part of the story is familiar: as cloud-hosted services capturing data from sensors of various kinds (I'm including video cameras here), crunching on it, then initiating actions.

A great example of such a service is Microsoft's Cosmos data farm. This isn't a research platform but there have been talks on it at various forums. The company organized a large number of storage and compute nodes (hundreds of thousands) into a data warehouse that absorbs incoming objects, then replicates them onto a few SSD storage units (generally three replicas per object, with the replication pattern fairly randomized to smooth loads, but in such a way that the replicas are on fault-independent machines: ones that live in different parts of the warehouse and are therefore unlikely to crash simultaneously).

Once the data is securely stored, Cosmos computes on it: it might compress or resize an image or video, or it could deduplicate, or run an image segmentation program. This yields intermediary results, which it also replicates, stores, and then might further process: perhaps, given the segmented image, it could run a face recognition program to automatically tag people in a photo. Eventually, the useful data is served back to the front-end for actions (like the smart highway that tells cars what to do). Cold but valuable data is stored to a massive backend storage system, like Microsoft's Pelican shingled storage server. Unneeded intermediary data is deleted to make room for new inputs.

Thus Cosmos has a lot in common with Spark, the famous data processing platform, except on a much larger scale, and with an emphasis on a pipeline of transformations rather than on MapReduce.

If we step way back, we can start to perceive Cosmos as an example of a smart memory system: it remembers things, and also can think about them (process them), and could potentially query them too, although as far as I know Cosmos and Spark have limited interactive database query functionality. But you could easily imagine a massive storage system of this kind with an SQL front-end, and with some form of internal schema, dynamically managed secondary index structures, etc.

With such a capability, a program could interrogate the memory even as new data is received and stored into it. With Cornell's Derecho system, the data capture and storage steps can be an asynchronous pipeline that would still guarantee consistency. Then, because Derecho stores data into version vectors, queries can run asynchronously too, by accessing specific versions or data at specific times. It seems to me that the temporal style of indexing is particularly powerful.

The interesting mix here is massive parallelism, and massive amounts of storage, with strong consistency... and it is especially interesting that in Derecho, the data is all moved asynchronously using RDMA transfers. Nobody has to wait for anything, and queries can be done in parallel.

Tomorrow's machine learning systems will surely need this kind of smart memory, so for those of us working in systems, I would say that smart memory architectures jump out as a very exciting next topic to explore. How should such a system be organized, and what compute model should it support? As you've gathered, I think Cosmos is already pretty much on the right track (and that Derecho can support this, even better than Cosmos does). How can we query it? (Again, my comment about SQL wasn't haphazard: I would bet that we want this thing to look like a database). How efficiently can it use the next general of memory hardware: 3-D XPoint, phase-change storage, high-density RAID-style SSD, RDMA and GenZ communication fabrics?

Check my web page in a few years: I'll let you know what we come up with at Cornell...

Wednesday, 5 April 2017

Implicit protocols and dynamic protocol discovery

During the past ten days I've driven in the area around Tel Aviv, then Brussels, then New Jersey/New York City. The rules of the road are widely different in each of these areas.

Beyond the actual driving laws, there is an interesting kind of social networking at play, and it seems to suggest a possible research topic... one outside of my immediate area (anyhow, Derecho has me busy!), so I'll share it here in the hope that someone else might find the thought useful.

To set the stage, let me describe a few examples of what I'll call "implicit protocols" that drivers employ in these different areas. Even before giving examples, I should explain what I mean with this term. I have in mind situations where some actual communication occurs between drivers: we might look at each other and I might nod, meaning that I'm letting you pull in front of me on a crowded street. You could drift slightly towards my lane, and I might acknowledge that by slowing my car, just a hair, and this is enough for you to realize that you have my permission to shift lanes. The list goes on, and becomes pretty long. So, an implicit driving protocol is (1) a behavioral rule, familiar to the people who drive in the region, and (2) one that isn't automatically used; it has to somehow be requested, and acknowledged.

So here are some examples:

Example 1: In Israel, as far as I can tell, drivers almost never look behind themselves when shifting lanes (they do check in other situations, I'm talking her specifically about lane-to-lane movement). They seem to check behind them normally, but once they decide that a lane-change is needed, the decision of when to do that apparently is based on the 180 degree region sideways and in front, Once you get used to this you can easily understand intentions, but if you don't realize that they never check behind themselves, Israeli drivers often seem to be cutting you off because they will shift into your lane with inches to spare. The key point is that you, as the driver coming up from the rear, were supposed to realize "long ago" that the driver up ahead was planning to shift into your lane and leave room for that. If a driver seems to intend to change lanes and they don't (even with just inches to spare, as I said), you as the driver behind them are allowed to get impatient and honk the horn: "get on with it!". And then conversely, if they shift lanes and nearly crash into you, they will be angry at you if they think the rules made it quite clear that this is what they were doing. "You idiot, are you driving with your eyes closed?!! Anyone could see what I was doing!" They don't actually say stuff like that, by the way. I just mean that they are thinking it.

In fact the first time you drive in Tel Aviv you feel as if people break the law whenever they find it convenient to do so. But later you realize they actually have social rules for when to defy the laws, and even the city police respect those. Break the rules in a way that they don't normally consider appropriate, do something nobody should ever do, and I promise you: every driver in blocks around will lean on his or her horn instantly, and you just might find that the local policeman wants to chat, too! (There is also a whole protocol around weird rude behavior, like stopping your car in a way the blocks the entire street and then hanging out and smoking a cigarette because you are waiting for your cousin to come down from the fifth floor, but she's elderly and slow and it might take five minutes, during which every single car will be honking non-stop... because after all, if you don't wait for her right at that spot, well, you might have to drive around the block, or she might have to stand there... but let's not even go there!)

Example 2: In Belgium, the whole country uses a rule called "Priority a Droit". You didn't know it, but probably you are familiar with priority-to-the-left rule, and a good way to understand it is to start by thinking about a round-about (a traffic circle). Suppose someone wants to get into the circle, and you are driving around the circle. Priority to the left is the usual rule: the driver in the circle has the right to keep driving, and the one who wants to enter has to wait for a free slot. As you can imagine, the traffic circle deadlocks with priority to the right ("a droit" in French) because new cars can enter in priority over cars wishing to exit. (Fun fact: for historical reasons rooted deep in the past, France has two major traffic circles that actually use priority a droit, namely the ones at the Etoile and at Place Victor Hugo. I've seen them get deadlocked and stay that way for hours, literally hours. The identical cars just sit there... forever. Very interesting for me as a computer scientist!)

I'll toss out an example 2-a but this is a minor one: They have a ton of traffic circles, in fact, all over Europe. And there often are two lanes. How will they be used? Well, the rule is to use the inner lane if you will exit two or more exits from here, but be in the outer lane if this next exit is where you will leave the circle. All the drivers let you do this, if you know the rule and follow the rule. But tourists often get it wrong, and cause fend-benders.

But enough about traffic circles. Let's get to the real example 2. So, here's the actual issue for Belgium. It turns out that in Belgium, a great many small streets lack any kind of stop sign or traffic lights on the corners. So you are driving on a street, maybe even a nice large road, 2 lanes in each direction, and in comes this dinky little street from the right with no markings on the intersection, and in fact the little thing could almost be a driveway, it looks so dubious as far as streetness goes.

In the US, without question, you would assume that you on the main road have priority over the little road coming in from the right. And actually, legally, you do have priority: we use a priority to the left rule in this situation, like on normal traffic circles. So when an intersection is unmarked, the car to the left gets to drive directly through, and the car to the right must stop, check that the left is clear, and only then can advance into the intersection.

In Belgium, the rule is actually the opposite: dinky or not, the driver on that little alley has priority over the vehicles on the main 2-lane road if the intersection is unmarked, unless they tell you that you are on a "route prioritaire", which literally means "a road with priority over incoming traffic". So everyone on the main road is guaranteed to tap the brakes at every intersection. As the driver behind such a car you could crash right into them by not realizing they are forced to do this, and for that matter if you were that clueless, you could easily drive through without checking to the right, causing a major accident. As it happens, this rule also gets used in the south of Europe quite often, so do make sure to understand it if you plan to drive there.

Where is the implicit protocol? Partly, of course, the protocol is the legal rule. But beyond that there is also this question of whether you are too far into the intersection to slow down and yield. So there is still an exchange of context between you and the other driver that takes place: a form of communication that might be eye to eye, or might relate to how you drive your car. But information is conveyed and helps you know if the guy coming in is ok with you passing even though he has priority, or if he is unaware of you, or deeply committed, in which case, slam on your brakes my friend!

Example 3: In the area around New York and especially in the city itself, cars collaborate "school of fish" style to maximize throughput on streets that happen to have slightly coordinated traffic lights, such as the main north/south avenues that run for miles from the top of Manhattan to the bottom. So on these, all the drivers are implicitly in agreement that the plan is to get as far as possible before stopping for the next red light, and with this in mind, they need to shift lanes and otherwise coordinate much like fish in a school of fish. The car density might be very high, and yet the cars try to flow around obstacles, which are unfortunately common in New York City: gaping holes in the pavement where work on a steam pipe is underway, perhaps with fencing around the hole, perhaps just a big hole; steel plates that look unstable, maybe a homeless person pushing a shopping cart of stuff across the street right at that moment. The cars just all flow, seamlessly.

When you understand this it feels like a kind of dancing: a ballet of cars, and especially of taxi cars because there are additional complications: the taxi drivers all drive "correctly" but people from out of town on these roads are treated kind of like potholes: the taxis are very wary around them because they know that those drivers are clueless and might do stupid (or inefficient) things. But there is a form of implicit communication here too: very quickly, I find that I'm in sync with the taxi and other drivers around me: I'm not in a yellow cab, yet the school of cabs accepts me. It isn't an eye-to-eye thing: it just comes to the rules you are following.

I could definitely extend this list: Switzerland has its own style (absurdly, excessively, polite and respectful of the rules, plus: "tous qui n'est pas interdit est obligatoire", meaning that "If it isn't illegal, it is obligatory"). California has its own style. The style in Mexico can be rather creative.

The computer science puzzle: Ok, hopefully we're on the same page now. As you can see, by driving in all of these different settings in a short amount of time, I've had to repeatedly adapt my meta-driving-protocol: I've had to shift to different implicit protocols, so that I would behave as expected from the point of view of all those others on the road. Make the shift and you drive more safely and efficiently; fail to make it, and horns blare at you from all directions!

But how am I doing this? And how did I figure out the rules in Tel Aviv, which is a new environment for me? Or other such places?

I'll postulate that first of all, humans have explicit and implicit ways to communicate. We talk, but we also communicate eye-to-eye through glances and little nods, and through behavior like when your car drifts just a little to the right or left while in motion in a way that "makes sense" under the local implicit driving rules. So we send information through a dozen little "tells", as a poker player might express the idea, and this is often a two-way thing or even an n-way thing, as in the case of schools of drivers on heading south on 2nd Avenue in New York.

Next, the rules that matter on roadways are discoverable: a person who drives well and doesn't get stressed easily can learn these rules very quickly. How? I have no idea: I'm not adept at languages (I know like five words of Hebrew by now, a bit less than two words per month for the past three months: not impressive!) But somehow these implicit protocols are so evident that they are just totally obvious once you settle into the situation, and you develop hypotheses and quickly validate them.

The means of dynamic discovery is also incredibly robust: we can learn these rules even though perhaps half or more of the drivers aren't using them at all, or not using them properly. We can learn which cars "get it" and which ones don't, and then we as a school of cars can all cooperate to shun the ones that are clueless. In fact you can figure out that the guy driving that SUV is clueless, and even though you are way up ahead of me, I'll realize this too, and will route around the SUV too, because I've learned something from you. And yet I can't see you, don't know you. Somehow you communicate this knowledge through your behavior! And I somehow deduce it from that behavior.

Why does all this matter? Here's why. First, I think this represents a kind of exciting topic for research in social networks. I bet that if you plan to come to do a PhD at Cornell and knock on Jon Kleinberg's door, you could even talk him into working on this with you. (The topic is more his kind of thing than mine). So do that! We always need more brilliant PhD students.

And the second reason is that self-driving cars will never be safe until we understand this aspect of human behavior. Obviously, if you follow these blog postings of mine, you'll know that I've become very worried about self-driving cars (check out that tag if you missed my prior comments on this). But setting that stuff to the side, the situation we have is that Uber and others want to create self-driving taxis that will be safe in New York, and in California too, and Tel Aviv, and Brussels. Well, good luck to them on this if they don't figure this stuff out! I myself will try to opt out for as long as possible.

And in case you are wondering, I'm still driving standard shifts, as much as possible. Of course I can drive an automatic. But I just don't trust those automatic shifting systems...