Fundamentally, what Jim and his colleagues did was to undertake a whiteboard analysis of the scalability of concurrency control in an uncontrolled situation, where transactions are simply submitted to some big pool of servers, and then compete for locks in accordance with a two-phase locking model (one in which a transaction acquires all its locks before releasing any), and then terminates using a two-phase or three-phase commit. They show that without some mechanism to prevent lock conflicts, there is a predictable and steadily increasing rate of lock conflicts leading to delay and even deadlock/rollback/retry. The phenomenon causes overheads to rise as a polynomial in the number of servers over which you replicate the data, and quite sharply: I believe it was N^3 in the number of servers, and T^5 in the rate of transactions. So your single replicated database will have a perform collapse. With shards, using state machine replication (implemented using Derecho!) this isn't an issue, but of course we don't get the full SQL model at that point -- we end up with a form of NoSQL on the sharded database, similar to what MongoDB or Amazon's Dynamo DB offers.
Of course the "dangers" paper is iconic, but the techniques it uses are of broad value. And this was central to the way Jim approached problems: he was a huge fan in working out the critical paths and measuring costs along them. In his cloud database setup, a bit of fancy mathematics let the group he was working with turn that sort of thinking into a scalability analysis that led to a foundational insight. But even if you don't have an identical chance to change the world, it makes sense to try and follow a similar path.
This has had me thinking about paper-and-pencil analysis of the critical paths and potential consistency conflict points for large edge IoT deployments of the kind I described last week. Right now, those paths are pretty messy, if you approach it this way. Without an edge service, we would see something like this:
IoT IoT Function Micro
Sensor ---------------> Hub ---> Server ------> Service
In this example I am acting as if the function server "is" the function itself, and hiding the step in which the function server looks up the class of function that should handle this event, launches it (or perhaps had one waiting, warm-started), and then hands off the event data to the function for handling on one of its servers. Had I included this handoff the image would be more like this:
IoT IoT Function Function Micro
Sensor ---------------> Hub ---> Server ------> F -----> Service
F is "your function", coded in a language like C#, F# or C++ or Python, and then encapsulated into a container of some form. You'll want to keep these programs very small and lightweight for speed. In particular, a function is not the place to do any serious computing, or to try and store anything. Real work occurs in the micro service, the one you built using Derecho. Even so, this particular step looks costly to me: without warm-starting it, launching F could take a substantial fraction of a section. And if F was warm-started, the context switch still involves some form of message passing, plus waking F up, and could still be many tens or even hundreds of milliseconds: an eternity at cloud speeds!
Even more concerning, many sensors can't connect directly to the cloud, and we end up cloning the architecture and running it twice: within an IoT Edge system (think of that as an operating system for a small NUMA machine or a cluster, running close to the sensors, and then relaying data to the main cloud if it can't handle the events out near the sensor device).
IoT Edge Edge Fcn IoT Function Micro
Sensor ---------------> Hub ---> Server -> F======> Hub ---> Server -> CF -> Service
Notice that now we have two user-supplied functions on the path. The first one will have decided that the event can't be handled out at the edge, and forwarded the request to the cloud, probably via a message queuing layer that I haven't actually shown, but represented using a double-arrow: ===>. This could have chosen to store the request and send it later, but with luck the link was up and it was passed to the cloud instantly, didn't need to sit in an arrival queue, and was instantly given to the cloud's IoT Hub, which in turn finally passed it to the cloud function server, the cloud function (CF) and the Micro Service.
The Micro Service may actually be a whole graph of mutually supporting Micro Services, each running on a pool of nodes, and each interacting with some of the others. The cloud's "App Server" probably hosts these and provides elasticity if a backlog forms for one of them.
We also have the difficulty that many sensors capture images and videos. These are initially stored on the device itself, which has substantial capacity but limited compute power. The big issue is that the first link, from sensor to the edge hub, would often be bandwidth limited. So we can't upload everything. Very likely what travels from sensor to hub is just a thumbnail and other meta-data. Then the edge function concludes that a download is needed (hopefully without too much delay), sends back a download request to the imaging device, and then the device moves the image to the cloud.
Moreover, there are industry standards for uploading photos and videos to a cloud, and those put the uploaded objects into the edge version of the blob store (short for "binary large objects"), which in turn is edge aware ands will mirror them to the main cloud blob store. Thus we have a whole pathway from IoT sensor to the edge blob server, which will eventually generate another event later to tell us that the data is ready. And as noted, for data that needs to reach the actual cloud and can't be processed at the edge, we replicate this path too, moving that image via the queuing service to the cloud.
So how long will all of this take? Latencies are high and bandwidth low for the first hop, because sensors rarely have great connectivity, and almost never have the higher levels of power required for really fast data transfers (even with 5G). So perhaps we will see a 10ms delay at that stop, plus more if the data is large. Inside the edge we should have a NUMA machine or perhaps a small cluster, and can safely assume 10G connections with latencies of 10us or less, although of course software like TCP will often impose its own delays. The big delay will probably be the handoff to the user-defined function, F.
My guess is that for an event that requires downloading a small photo, the very best performance will be something like 50ms before F sees the event (maybe even 100ms), then another 50-100 for F to request a download, then perhaps 200ms for the camera to upload the image to the blob server, and then a small delay (25ms?) for the blob server to trigger another event, F', saying "your image is ready!". We're up near 350ms and haven't done any work at all yet!
Because the function server is limited to lightweight computing, it hands off to our micro-service (a quick handoff because the service is already running; the main delay will be the binding action by which the function connects to it, and perhaps this can be done off the critical path). Call this 10ms? And then the micro service can decide what to do with this image.
Add another 75ms or so if we have to forward the request to the cloud. So the cloud might not be able to react to a photo in less than about 500ms, today.
None of this involved a Jim Gray kind of analysis of contention and backoff and retry. If you took my advice and used Derecho for any data replication, the 500ms might be the end of the story. But if you were to use a database solution like MongoDB (CosmosDB on Azure), it seems to me that you might easily see a further 250ms right there.
What should one do about these snowballing costs? One answer is that many of the early IoT applications just won't care: if the goal is to just journal that "Ken entered Gates Hall at 10am on Tuesday", a 1s delay isn't a big deal. But if the goal is to be reactive, we need to do a lot better.
I'm thinking that this is a great setting for various forms of shortcut datapaths, that could be set up after the first interaction and offer direct bypass options to move IoT events or data from the source directly to the real target. Then with RDMA in the cloud, and Derecho used to build your micro service, the 500ms could drop to perhaps 25 or 30ms, depending on the image size, and even less if the photo can be fully handled on the IoT Edge server itself.
On the other hand, if you don't use Derecho but you do need consistency, you'll get into trouble quickly: with scale (lots of these pipelines all running concurrently), and contention, it is easy to see how you could trigger Jim's "naive replication" concerns. So designers of smart highways had better beware: if they don't heed Jim's advice (and mine), by the time that smart highway warns that a car should "watch out for that reckless motorcycle approaching on your left!" it will already have zoomed past...
These are exciting times to work in computer systems. Of course a bit more funding wouldn't hurt, but we certainly will have our work cut out for us!
I really enjoyed reading "Whiteboard analysis: IoT Edge reactive path". Building scalable, timely and predicable IoT services over uncertainties of today underlying infrastructure/protocols is a very big challenge... I wonder whether such increasing demands for new IoT services (specially those requiring consistency, as you pointed) can ever be fully satisfied without considering the whole redesign of these underlying infrastructures/protocols.
ReplyDeleteI've always found that there is value in the existing technology base and that reusing and extending it generally is the way to go unless something is outright "wrong"! For this particular example, some kind of bypass (short-path) optimization would probably eliminate most of the issues I've pointed to. A form of futures could eliminate the remainder (so in effect, a node actually touching data would trigger the real download, and the data would be delivered to it, not some intermediary entity).
DeleteBut having said this, I was thinking of writing another of these blogs on the fast paths in Derecho -- and these tend to confirm your view! In Derecho we really did need to do a full redesign of the way data was being shipped around in our version of virtually synchronous Paxos in order to benefit from the full potential speed of the RDMA hardware (and it turns out to be a win even over TCP which was a nice discovery).
Butler Lampson wrote papers, quite a while back, on critical paths. He basically said that you shouldn't overdesign or overoptimize, but then when your system is as simple as possible, and stable as possible, you should go back and optimize the commonly used performance-critical paths in ways that don't complicate the interfaces (and he also emphasized: the commonly used ones, not the obscure ones). So he felt that complexity and cumbersome APIs were the biggest threat, but after you've simplified and arrived at something clean, then you do go back, find the mission-critical data paths, and those you "hack" until performance is amazing. In a way, the design of Derecho confirms that approach...