A Few Thoughts on Distributed Computing: A few 10-Year Challenges for Distributed Systems and IoT

A newspaper column on "next decade" predictions got me thinking about crystal ball prognoses.   Tempting as it is to toss in my views on climate change, surveillance in China and self-driving cars, I'll focus this particular blog on computer systems topics in my area.

1. AI Sys and RT ML. These terms relate to computer systems created to support AI/ML applications, and that often involve addressing real-time constraints. There is a second meaning that centers on using AI tools in networks and operating systems and database platforms. I'm open-minded but, so far, haven't seen convincing demonstrations that this will yield big advances. I'll focus on the first meaning here.

Although AI Sys terminology is trendy, the fact is that we are still at the very earliest stages of incorporating sensing devices into applications that leverage cloud-scale data and machine learning. As this style of system deployment accelerates in coming years, we'll start to see genuinely smart power grids (existing grids often proclaim themselves to be "smart" but honestly, not much use is being made of ML as of yet), smart homes and offices, smart cities and highways, smart farms.... The long-term potential is enormous, but to really embrace it we need to rethink the cloud, and especially, the cloud edge where much of the reactive logic needs to run. This is why the first of my predictions centers on the IoT edge: we'll see a new and trustworthy edge IoT architecture emerge and mature, in support of systems that combine sensors, cloud intelligence and big data.

Getting there will require more than just redesigning today's cloud edge components, but the good news is that an area begging for disruptive change can be an ideal setting for a researcher to tackle. To give just one example: in today's IoT hub services, we use a database model to track the code revision level and parameter settings for sensors, connected and accessible or not. The IoT hub manages secure connectivity to the sensors, pushes updates to them, and filters incoming event notifications, handing off to the function service for lightweight processing. I really like the hub concept, and I think it represents a huge advance relative to the free-for-all that currently is seen when sensors are connected to the cloud. Moreover, companies like Microsoft are offering strong quality of service guarantees (delay, bandwidth, and even VPN security) for connectivity to the edge. They implement the software, then contract with ISPs and teleco's to obtain the needed properties. From the customer's perspective what matters is that the sensors are managed in a trustworthy, robust and secure manner.

The puzzle relates to the reactive path, which is very far from satisfactory right now. When a sensor sends some form of event to the cloud, the IoT hub operates like a windowing environment handling mouse movements or clicks: it functions as the main loop for a set of handlers that can be customized and even standardized (thus, a Canon camera could eventually have standard cloud connectivity with standard events such as "focused", or "low power mode", or "image acquired."). Like with a GUI, the incoming events that need user-defined processing are passed to these user-defined functions, which can customize what will happen next.

The core problem is with the implementation. First, we split the path: sensor-to-cloud uploads of large objects like photos are videos follow one path, and end up with the data automatically stoed into a binary large objects (BLOB) store, replicated for fault-tolerance.

Meanwhile, other kinds of events, like the ones just mentioned, are handled by small fragments of logic: cloud functions. But these aren't just lambdas written in C++ or Scala -- they are typically full programs coded in Linux and then handed to the cloud as containers, perhaps with configuration files and even their own virtualized network mapping. As a result, the IoT Hub can't just perform the requested action -- it needs to launch the container and pass the event into it.

The IoT hub accomplishes these goals using the "function service", which manages a pool of machines, picks one on which to launch this container for this particular Canon photo acquisition event, and then the program will load that event's meta-data and can decide what to do. In effect, we launch a Linux command.

Normally, launching a Linux command has an overhead of a few milliseconds. Doing so through the IoT hub is much slower: Today, this process takes as much as two seconds.   The issues are several: first, because the IoT Hub is built on a database like SQL server or Oracle, we have overheads associated with the way databases talk to services like the function service. Next the function service itself turns out to do a mediocre job of warm-starting functions -- here the delay would center on caching, binding the function to any microservices it will need to talk to ahead of time (off the critical path), dealing with any synchronization the function may require.

I can't conceive of a sensible realtime use case where we can tolerate two seconds of delay -- even web page interactions are down in the 10-50ms range today, well below the 100ms level at which alpha-beta tests show that click-through drops. So I would anticipate a complete redesign of the IoT hub and function layer to warm-start commonly needed functions, allow them to pre-bind to any helper microservices they will interact with (binding is a potentially slow step but can occur out of the critical path), and otherwise maintain a shorter critical path from sensor to user-mediated action. I think we could reasonably target sub-1ms delays... and need to do so!

There are many other unnecessarily long delays in today's IoT infrastructures, impacting everything from photo and video upload to ML computation on incoming objects. But none of this is inevitable, and from a commercial perspective, the value of reengineering it (in a mostly or fully compatible way) would be huge.

2. Cost-efficient sharable hardware accelerators for IoT Edge. In prior blog postings, I've written about the puzzle of hardware for the IoT Edge (many people take that to mean "outside" the cloud, but I also mean "in the outermost tier of a data center supporting the cloud, like Azure IoT). Here, the central question involves costs: modern ML and especially model training is cost-effective only because we can leverage hardware accelerators like GPU, TPU and custom FPGA to offload the computationally parallel steps into ultra-efficient hardware. To this, add RDMA and NVM.

The current generation of hardware components evolved in backed back-end systems, and it is no surprise to realize that they are heavily optimized for batched, offline computing. And this leads to the key puzzle: today's ML accelerators are expensive devices that are cost effective only when they can be kept busy. The big batches of work seen in the back-end enable today's accelerators to run in support of very long tasks, which keeps them busy and makes them cost-effective. If the same devices were mostly idle, this style of accelerated ML would become extremely expensive.

In some sense, today's ML accelerators could have been at home in the old-styled batch computing systems of the 1970's. As we migrate toward a more event-driven IoT edge, we also will need to migrate machine learning (model training) and inference into real-time contexts, and this means that we'll be using hardware accelerators in settings that lack the batched pipelining that dominates in the big-data HPC-style settings were those currently reside. To be cost-effective we will either need completely new hardware (sharable between events or between users), or novel ways to repurpose our existing hardware for use in edge settings.

It isn't obvious how to get to that point, making it a fascinating research puzzle. As noted, edge systems are event-dominated, although we do see streams of image and video data (image-processing tasks on photo or video streams can be handled fairly well with existing GPU hardware, so that particular case can be solved cost-effectively now). The much harder case involves singleton events: "classify this speech utterance," or "decide whether or not to retain a copy of that photo." So the problem is to do snap analysis of an event. And while my examples involve photos and videos, any event could require an intelligent response. We may only have milliseconds to react, and part of that reaction may entail retraining or incrementally adjusting the ML models -- dynamic learning.

The hardware available today isn't easily sharable across scaled out event-driven systems where the events may originate in very different privacy domains, or from different users. We lack ways to protect data inside accelerators (Intel's new SIMD instruction set offers standard protections, but a GPU or TPU or FPGA is typically operated as a single security context: it is wide-open if a task runs on behalf of me immediately after one that ran on behalf of you: the kernel I've invoked could just reach over and extract any data left behind after your task was finished).

So why not use Intel's SIMD solutions? For classification tasks, this may be the best option, but for training, which is substantially more expensive from a computational point of view, the Intel SIMD options are currently far slower than GPU or TPU (FPGA is the cheapest of all the options, but would typically be somewhere in between the SIMD instructions and a GPU on the performance scale).

It will be interesting to watch this one play out, because we can see the end goal easily, and the market pressure is already there. How will the hardware vendors respond? And how will those responses force us to reshape the IoT edge software environment?

3. Solutions for the problem blockchain was supposed to solve. I'm pretty negative about cryptocurrencies but for me, blockchain is a puzzle. Inside the data center we've had append-only logs for ages, and the idea of securing them against tampering using entangled cryptographic signatures wasn't particularly novel back when the blockchain for Bitcoin was first proposed. So why is a tamper-proof append-only log like Microsoft's Corfu system not a blockchain?

There are several aspects in which blockchain departs from that familiar, well-supported option. I'll touch on them in an unusual order: from practical uses first to more esoteric (almost, "religious") considerations, which I'll tackle last. Then I want to argue that the use cases do point to a game changing opportunity, but that the whole story is confused by the religious zealotry around some of these secondary and actually, much less important aspects.

First among the novel new stories is the concept of a smart contract, which treats the blockchain as a database and permits the developer to place executable objects into Blockchain records, with the potential of representing complex transactions like the mortgage-backed securities that triggered the 2008 meltdown. The story goes that if we can capture the full description of the security (or whatever the contract describes), including the underlying data that should be used to value it, we end up with a tamper roof and self-validating way to price such things, and our transactions will be far more transparent.

I see the value in the concept of a smart contract, but worry that the technology has gotten ahead of the semantics: as of the end of 2019 you can find a dozen tools for implementing smart contracts (Ethereum is the leader, but Hyperledger is popular too). Less clear is the question of precisely how these are supposed to operate. Today's options are a bit like the early C or Java programming languages: both omitted specifications for all sorts of things that actually turned out to matter, leaving it to the compiler-writer to make a choice. We ended up with ambiguities that gave us today's security problems with C programs.

With blockchain and smart contracts you have even nastier risks because some blockchain implementations are prone to rollback (abort), and yet smart contracts create dependency graphs in which record A can depend on a future record B. A smart contract won't seem so smart if this kind of ambiguity is allowed to persist... I predict that 2020 will start a decade when smart contracts with strong semantics will emerge. But I'll go out on a limb and also predict that by the time we have such an option, there will be utter chaos in the whole domain because of these early but inadequate stories. Smart contracts, the real kind that will be robust with strong semantics? I bet we won't have them for another fifteen years -- and when we do get them, it will be because a company like Oracle or Microsoft steps in with a grown-up product that was thought through from bottom to top. We saw that dynamic with Java and CORBA giving way to C# and LINQ and .NET, which in turn fed back into languages like C++. And we will see it again, but it will take just as long!

But if you talk to people enamored with blockchain, it turns out that in fact, smart contracts are often seen as a cool curiosity. I might have a narrow understanding of the field, but among people I'm in touch with, there is little interest in cryptocurrency and even less interest in smart contracts. More common, as far as I can tell, is a focus on the auditability of a tamperproof ledger.

I'll offer one example that I run into frequently here at Cornell, in the context of smart farming. You see variants of it in medical centers (especially ones with partner institutions that run their own electronic health systems), human resource management, supply chains, airports that need to track airplane maintenance, and the list goes on. At any rate, consider farm to table cold-chain shipment for produce or agricultural products like cheese or processed meats. A cup of yoghurt will start with the cow being milked, and even at that stage we might wish to track which cow we milked, how much milk she produced, the fat content, document that she was properly washed before the milking machine kicked in, that we tested for milk safety and checked her health, that the milk was promptly chilled and then stored at the proper temperature. Later the milk is aggregated into a big batch, transported, tested again, pasteurized, homogenized, graded by fat content, cultured (and that whole list kicks in again: in properly sterile conditions, at the right temperature...).|

So here's the challenge: Could we use a blockchain to capture records of these kinds in a secure and tamperproof manner, and then be in a position to audit that blockchain for various tasks such as to confirm that the required safety steps were preserved, or to look for optimization opportunities? Could we run today's ML tools on it, treating the records as an ordered collection and mapping that collection into an event Tensor Flow or Spark/Databricks could ingest and analyze? I see this a fantastic challenge problem for the coming decade.

The task is fascinating and hard, for a lot of reasons. One is that the domain is partly disconnected (my colleagues have created a system, Vegvisir, focused on this aspect). A second question you can ask concerns integrity of our data capture infrastructure: can I trust that this temperature record is from the proper thermometer, correctly calibrated, etc? Do I have fault-tolerant redundancy? How can we abstract from the chain of records to a trustworthy database, and what do trust-preserving queries look like? How does one do machine learning on a trusted blockchain, and what trust properties would the model then carry? Can a model be self-certifying, too? What would the trust certificate look like (at a minimum, it would need to say that "if you trust X and Y and Z, you can trust me for purpose A under assumption B..."). I'm reminded of the question of self-certifying code... perhaps those ideas could be applied in this domain.

I commented that this is the problem blockchain really should be addressing. I say this because as far as I can tell, the whole area is bogged down on really debates that have more to do with religion than with rigorous technical arguments. To me this is at least in part because of the flawed belief that anonymity and permissionless mining are key properties that every blockchain should offer. The former is of obvious value if you plan to do money laundering, but I'm pretty sure we wouldn't even want this property in an auditing setting. As for the permissionless mining model, the intent was to spread the blockchain mining revenue fairly, but this has never really been true in any of the main blockchain systems: they are all quite unfair, and all the revenue goes to shadowy organizations that operate huge block-mining systems. As such, the insistence on permissionless mining with anonymity really incarnates a kind of political opinion, much like the "copyleft" clause built into GNU licenses, which incarnated a view that software shouldn't be monetized. Permissionless blockchain incarnates the view that blockchains are for cybercurrency, that cybercurrency transactions shouldn't be taxed or regulated, and that management of this infrastructure is a communal opportunity, but also a communal revenue source.

Turning to permissionless blockchain as it exists today, we have aspects of this dreamed-of technology, but the solutions aren't fair, and in fact demand a profoundly harmful mining model that squanders energy in the form of hugely expensive proof-of-work certifications. My colleague, Robbert van Renesse, has become active in the area and has been doing a survey recently to also look at some of the other ideas people have floated: proof of stake (a model in which the rich get richer, but the compute load is much reduced, so they spend less to earn their profits...), proof of elapsed time (a lovely acronym, PoET, but in fact a problematic model because the model can be subverted using today's Intel SGX hardware), and all sorts of one-way functions that are slow to compute and easy to verify (the parallelizable ones can be used for proof-of-work but the sequential ones simply reward whoever has the fastest computer, which causes them to fail on a different aspect of the permissionless blockchain mantra: they are "undemocratic", meaning that they fail to distribute the income for mining blocks in a fair manner). The bottom line, according to Robbert, is that for now, permissionless blockchain demands computational cycles and those cycles make this pretty much the least-green technology on earth. There is some irony here, because those who promote this model generally seem to have rather green politics in other ways. I suppose this says something about the corrupting influence of potentially vast wealth.

Meanwhile, more or less betting on the buzz, we have a whole ecosystem of companies convinced that what people really want are blockchain curation products for existing blockchain models. These might include tools that build the blockchain for you using the more-or-less standard protocols, that back it up, clean up any garbage, index it for quick access, integrate it with databases and AI/ML. We also have companies promoting some exceptionally complex protocols, many of which seem to have the force of standards simply because people are becoming familiar with their names. It will take many years to even understand whether or not some of these are correct -- I have serious doubts about a few of the most famous ones!

But here's my bet for the coming decade: in 2029, we'll be seeing this market morph into a new generation of WAN database consumers, purchasing products from today's database companies. Those customers won't really be particularly focused on whether they use blockchain or some other technology (and certainly won't insist on permissive models with pervasive anonymity and proof of work).   They will be more interested in tamperproof audits and ML on the temporally-ordered event set.

Proof of work per-se will have long since died from resource exhaustion: the world simply doesn't have enough electrical power and cooling to support that dreadful model much longer (don't blame the inventors: the blame here falls squarely on the zealots in the cybercoin community, who took a perfectly good idea and twisted it into something harmful as part of their quest to become billionaires off the back of a pie-in-the-sky economic model).

The future WAN databases that emerge from the rubble will have sophisticated protection against tampering and the concept of trust in a record will have been elevated to a notion of a trustworthy query result, that can be checked efficiently by the skeptical end-user. And this, I predict, will be a huge market opportunity for the first players to pull it off. It would surprise me if those players don't turn out to include today's big database companies.

4. Leave-nothing-sensitive behind privacy. The role of the cloud in smart settings -- the ones listed above, or others you may be thinking about -- is deeply enshrined by now: very few smart application systems can avoid a cloud-centric computing model in which the big data and the machine intelligence is at least partly cloud-hosted. However, for IoT uses, we also encounter privacy and security considerations that the cloud isn't terribly good at right now, with some better examples (Azure, on the whole, is excellent) and some particularly poor ones (I won't point a finger but I will comment that companies incented to place a lot of advertising often find it hard to avoid viewing every single user interaction as an invaluable asset that must be captured in perpetuity and then mined endlessly for every possible nugget of insight).

The upshot of this is that the cloud is split today between smart systems that are trying their best to spy on us, and smart systems that are just doing smart stuff to benefit us. But I suspect that the spying will eventually need to end, at least if we hope to preserve our Western democracies. How then can we build privacy-preserving IoT clouds?

I've written about this in the past, but in a nutshell, I favor a partnership: a style of IoT application that tries to "leave no trace behind" coupled to a cloud vendor infrastructure that promises not to deliberately spy on the end-user. Thus for example when a voice command is given to my smart apartment, it may well need to be resolved up on the cloud, but shouldn't somehow be used to update databases about me, my private life, my friends...

I like the mental imagery of camping in a wilderness where there are some bears roaming around. The cloud needs a model under which it can transiently step in to assist in making sense of my accent and choice of expressions, perhaps even contextualized by knowledge of me and my apartment, and yet when the task finishes, there shouldn't be anything left behind that can leak to third party apps that will rush into my empty campsite, hungry to gobble up any private data for advertising purposes (or worse, in countries like China, where the use of the Internet to spy on the population is a serious threat to personal liberties). We need to learn to enjoy the benefits of a smart IoT edge without risk.

Can this be done? I think so, if the cloud partner itself is cooperative. Conversely, the problem is almost certainly not solvable if the cloud partner will see its revenue model break without all that intrusive information, and hence is hugely incented to cheat. We should tackle the technical aspects now, and once we've enabled such a model, I might even favor asking legislative bodies to mandate privacy-preservation as a legally binding obligation on cloud vendor models. I think this could be done in Europe, but the key is to first create the technology so that we don't end up with an unfunded and infeasible mandate. Let's strike a blow against all those companies that want to spy on us! Here's a chance to do that by publishing papers in top-rated venues... a win-win for researchers!

5. Applications that prioritize real-time. Many IoT systems confront deadlines, and really have no choice except to take actions at the scheduled time. Yet if we want to also offer guarantees, this poses a puzzle: how do we implemented solutions that are always sure to provide the desired timing properties, yet are also "as consistent" as possible, or perhaps "as accurate as possible", given those constraints?

To me this is quite an appealing question because it is easy to rattle off a number of ways one might tackle such questions. For example, consider an ML algorithm that iterates until it converges, which typically involves minimizing some sort of error estimate. Could we replace the fixed estimate by adopting a model that permits somewhat more error if the deadline is approaching?

Or here's an idea: What about simply skipping some actions because it is clear we can't meet the deadline for them? I'm reminded of work Bart Selman, a colleague of mine, did fifteen years ago. Bart was looking at situations in which an AI system confronted an NP complete question, but in a streaming context where variations on that question would be encountered every few seconds (he was thinking about robot motion planning but similar issues arise in many AI tasks). What he noticed was that heuristics for solving these constrained optimization problems sometimes converge rapidly but in other situations diverge and compute endlessly. So his idea, very clever, was to take the quick answers but to just pull the plug on computations that take too long. In effect, Bart argued that if the robot is faced with a motion-planning task it won't be able to solve before its next step occurs, take the previously-planned step and then try again. Sooner or later the computation will converge quickly, and the overall path will be both of high quality, and fast.

We could do similar things in many IoT edge settings, like the smart-things cases enumerated earlier. You might do better to have a smart grid that finds an optimized configuration setting once every few seconds, but then coasts along using old settings, than to pause to solve a very hard configuration problem 20 times per second if in doing so, you'll miss the deadline for actually using the solution. The same is true for management of traffic flow on a highway or in a dense city.

For safety purposes, we will sometimes still want to maintain some form of risk envelope. If I'm controlling a smart car in a decision-loop that runs 20 times per second, I might not run a big risk if I toss up my hands even 4 or 5 times in a role. But we would not want to abandon active control entirely for 30 seconds, so there has to be a safety mechanism too, one that kicks in long before the car could cause an accident (or miss the next turn), forcing it into a safe mode. I don't see any reason we couldn't do this: a self-driving car (or a self-managed smart highway) would need some form of safety monitor in any case, to deal with all sorts of possible mishaps, so having it play the role of making sure the vehicle has fresh guidance data seems like a fairly basic capability. Then in the event of a problem, we would somehow put that car into a safe shutdown mode (it might use secondary logic to pull itself into a safety lane and halt, for example).

I could probably go on indefinitely, but every crystal ball eventually fogs over, so perhaps we'll call it quits here. Have a great holiday and see you in the next decade!

A Few Thoughts on Distributed Computing

Friday, 20 December 2019

A few 10-Year Challenges for Distributed Systems and IoT

No comments:

Post a Comment