A Few Thoughts on Distributed Computing: u-services

Showing posts with label u-services. Show all posts

Saturday, 19 September 2020

Who will host the future edge cloud?

As you know if you've followed my blog, I'm particularly excited about the potential of the IoT edge. The term is a bit ambiguous, so I'll explain what I mean by it first, but then I want to speculate a little about who will end up "hosting" this important emerging real-estate. The host of any large-scale cloud has the potential to earn quite a bit of revenue (witness AWS and Azure), so the question has big monetary implications.

With the explosive growth of machine learning in the standard cloud, it can be easy to overlook a key point: in fact, very little of the cloud is capable of reacting in real-time to evolving data. The core issue is that ML emerged from the back-end big data analytic frameworks: massive storage infrastructures coupled to scalable computing infrastructures (think of Spark side-by-side with HDFS or Ceph). Such a structure is designed for sharding, so we can first upload our data in a spread-out way (either using a randomized placement, or with some form of control over that step), then initiate computations which will run in a decentralized manner. The resulting performance and scalability can be astonishing.

But if we work in this manner, we end up with batched jobs that recompute the models or perform other tasks after some delay, perhaps minutes and in many settings, days or even weeks. Thus today when the cloud edge performs a task using a machine-learned model, that model will probably be somewhat stale.

By cloud edge, I have in mind the first tiers of a typical data center. In systems like AWS, Azure, Facebook, Google (and the list, of course, is endless...), a data center is typically split between that back-end infrastructure and a front-end system. This front-end portion is deployed as a series of layers: the first tier is a highly elastic pool of servers that handle the majority of incoming requests, such as to build web pages or process video uploads from webcams. The idea is that a first tier server should be fast, lightweight and specialized, and we work with a pay for what you use model in which these servers are launched on demand, run while they are actively doing something, then terminate. Thus any given node in the first-tier pool can be repurposed quite often.

The increasingly popular "function" model is an example of a first-tier capability. In Google's cloud or AWS, the first tier can launch user-provided code: computing on demand, as needed. (Azure supports a function service too, but one that is strangely slow: Whereas a Google or AWS function runs within as little as 50-100ms, Azure functions often need many seconds to launch... and hence are probably not something that will be useful in the future edge cloud I have in mind, unless Microsoft reimplements that tier to offer competitive speed).

The nodes supporting the function tier generally lack very high-end GPUs, simply because a GPU needs a lot of care and feeding: we need to pre-provision it with the kernels (code libraries) we plan to use, perhaps would need to preload hyperparameters and models, etc. My point about Azure being slow applies: If one wanted to leverage these kinds of resources, launching a function might take seconds... far too much for a first-tier that nominally will be elastic at a scale of fractions of a second, running totally different sets of functions on those same nodes. This all adds up to a simple rule of thumb: A service running in this tier isn't really the obvious candidate for heavyweight ML tasks.

We solve this problem by pairing our first tier with a tier (call it the second tier) of services that provide supporting functionality. The second tier servers are managed by something called the "App Service", and again are quite dynamic: a collection (or more often, a graph) of service pools, each pool providing some functionality specialized to a particular task, and interconnected because these pools might need help from one-another too. Amazon once remarked that a typical web page actually reflects contributions from as many as 100 distinct servers: we have a web page builder in the first tier, and then those 100 servers are actually members of 100 service pools, each pool containing an elastically varied set of servers. Thus, for Amazon's own uses, the second tier would be crowded with 100's of service pools, to say nothing of second tier uses associated with AWS customers.

What might a second-tier service be doing? Well, think about the "also recommended" product list on your Amazon page. Some ML model was used to realize that if you are looking at the Da Vinci Code (which you bought years ago), you might also enjoy reading The Lost Symbol (a prior book of Brown's), or The Templar Codex (another "chase through Europe seeking a medieval artifact"), or perhaps An Instance of the Fingerpost (an obscure but excellent historical novel...). Amazon has a whole service to compute recommendations, and it in turn is using other services to pull in sub-recommendations in various genre's, all specialized for "people like you" based on your history of browsing and purchases.

So just this one example might give us several services, each one operating as a pool because the load on it would vary so greatly as Amazon shoppers poke around. We would run such a pool with a small number of service instances when load drops, but the App Service can spawn more on demand.

Unlike a first tier, second tier service pools deal mostly with internal traffic (the first tier is triggered by requests coming over the Internet, from browsers and other web-enabled apps), and whereas the first tier spins on a dime, spawning or terminating instances in fractions of a second, the second tier might be much less quick to adapt: think in terms of processes that do run on multi-purpose nodes, but once they launch tend to own the hardware for minutes or hours. So it makes sense to put GPUs and FPGAs on these second tier instances.

The second tier is closely paired to a third tier of vendor-supplied services, which often can be customized in various ways. So in this third tier we have storage solutions (file systems, DHTs of various kinds), databases, AI engines, etc. Some modern developers work purely with the pre-supplied options. On the other hand, if you are tackling an unusual situation, like automating a dairy farm, the odds are that you won't find the mix of services you really need. You'll pretty much be forced to either work with slow back-end systems (those run in batched style, once every few hours or days), or you might need to build a new second-tier service of your own.

With this context, let's revert to my real topic.

Cloud operators like AWS and Azure make a great deal of their revenue in these second and third tiers of the cloud. Yet in adopting this model, a hidden dynamic has emerged that favors the big warehouse style of clouds: the vast majority of today's web is centered around web pages and "high functionality" mobile devices. Any rapidly-changing data lives and is used right on your mobile device; for example, this is where Google Maps runs. Thus the data center isn't under enormous time pressure, even in the first and second tiers. We tend to focus more on pipelines: we want a steady stream of web pages constructed within 100ms per page. That may seem fast, but consider the speed at which a self-driving car needs to react -- on a high-speed road, 100ms is an eternity. Perhaps this observation explains why Azure has been competitive despite its strangely slow function tier: if people currently write functions that often run for minutes, a few seconds to launch them wouldn't be noticeable.

But we are seeing more and more uses that need instantaneous responses. Today's cloud edge is really just the outside of the cloud, and not responsive enough. As we deploy IoT devices into the environment, we are starting to see a need for something else: an edge cloud (did you notice that the word order flipped?) By this I mean a way to situate my computing closer to the devices. The cloud edge might be too far away, and the latency of talking to it could cripple my IoT solution.

The need, then, is for a new form of data center: an edge cloud data center. What might it look like, and where would it run?

My view is that it would look like the Google or AWS lambda tier (and I bet that by the time this plays out, the Azure function tier will have been fixed: the Azure cloud clearly is equal to any rival in every arena that has competitive importance right now). Basically, the edge cloud will host first, second and third-tier software, probably using the exact same APIs we see today.

It would run on something like Microsoft's Azure IoT Edge: a small compute cluster (maybe even just a little box running a NUMA CPU chip), situated close to the sensors, with a WAN network link reaching back into the cloud for any tasks that can't be performed on this platform.

So who will run these boxes? If we think of them in the aggregate, we are suddenly talking about a true edge cloud that could host third-party apps, charge for cycles and other resources consumed, and hence generate revenue for the operator. These apps could then support smart homes, smart grid, smart roadways to guide smart cars... whatever makes sense, and represents a potential market. Yet there is a chicken-and-egg puzzle here: the scale at which one can really call something a "cloud" is certainly national and perhaps global. So any operator able to play in this market would need to deploy and manage an infrastructure at planetary scale: otherwise, we end up with just a handful of boxes of type A in the US, others of type B in China, each owned and operated by a distinct player, and each with its own customized standards. I find it hard to see how that kind of heterogeneity could give us a true edge cloud.

I've been involved with 5G and for a while, wondered if telephony operators might be the answer. They already need to deploy 5G point of presence systems, which often look quite similar to what I've described. Indeed, Azure's IoT Edge seems to be an early winner in this market. But the resulting deployments are still operated by literally thousands of players, and perhaps I'm underestimating by a factor of 10. The number of likely 5G operators is simply too large for more than a few to tackle this opportunity (I can see AT&T doing it, for example, or Comcast -- but not some random little 5G operator that sells services through phone SIM cards).

Another possibility would be that one of the big cloud vendors goes all in on the edge cloud and deploys these boxes through various deals: perhaps the 5G players agree to rent the needed physical space, power and other resources, but the box itself belongs to Alphabet, Amazon, Microsoft or AT&T. This would be a story similar to the land-grab that occurred when Akamai emerged as dominant in the content-serving market back in 1998. On the other hand, it is notable that as Akamai grew, their deployment consolidated: around 2000, they were out to place an Akamai compute rack in every single ISP switching office in the world. By 2010, Akamai was operating a smaller number of data centers, mostly ones they themselves owned, and yet obtaining better performance. In fact, although I can't find any public studies of the question, I wouldn't be surprised if Akamai's global datacenter deployment today, in 2020, actually resembles the AWS, Azure and Google ones. The question is of interest because first, it may suggest that Akamai has a bit of an incumbency advantage. And second, any structure performant enough for content serving would probably work for IoT.

But perhaps we will soon know the answer. The industry clearly is waking up to the potential of the IoT Edge, and the edge cloud has such obvious revenue appeal that only a very sleepy company could possibly overlook it. If you look at the Google ngram graphs for terms like edge cloud, the trend is obvious: this is an idea whose time has arrived.

Friday, 16 November 2018

Forgotten lessons and their relevance to the cloud

While giving a lecture in my graduate course on the modern cloud, and the introduction of sensors and hardware accelerators into machine-learning platforms, I had a sudden sense of deja-vu.

In today's cloud computing systems, there is a tremendous arms-race underway to deploy hardware as a way to compute more cost-effectively, more data faster, or simply to offload very repetitious tasks into specialized subsystems that are highly optimized for those tasks. My course covers quite a bit of this work: we look at RDMA, new memory options, FPGA, GPU and TPU clusters, challenges of dealing with NUMA architectures and their costly memory coherence models, and similar topics. The focus is nominally on the software gluing all of this together ("the future cloud operating system") but honestly, since we don't really know what that will be, the class is more of a survey of the current landscape with a broad agenda of applying it to emerging IoT uses that bring new demands into the cloud edge.

So why would this give me a sense of deja-vu? Well, grant me a moment for a second tangent and then I'll link my two lines of thought into a single point. Perhaps you recall the early days of client-server computing, or the early web. Both technologies took off explosively, only to suddenly sag a few years later as a broader wave of adoption revealed deficiencies.

If you take client-server as your main example, we had an early period of disruptive change that you can literally track to a specific set of papers: Hank Levy's first papers on the Vax Cluster architecture when he was still working with DEC. (It probably didn't hurt that Hank was the main author: in systems, very few people are as good as Hank at writing papers on topics of that kind.) And in a few tens of pages, Hank upended the mainframe mindset and introduced us to this other vision: clustered computing systems in which lots of components somehow collaborate to perform scalable tasks, and it was a revelation. Meanwhile, mechanisms like RPC were suddenly becoming common (CORBA was in its early stages), so all of this was accessible. For people accustomed to file transfer models and batch computing, it was a glimpse of the promised land.

But the pathway to the promised land turned out to be kind of challenging. DEC, the early beneficiary of this excitement, got overwhelmed and sort of bogged down: rather than being a hardware company selling infinite numbers of VAX clusters (which would have made them the first global titan of the industry), they somehow got dragged further and further into a morass of unworkable software that needed a total rethinking. Hank's papers were crystal clear and brilliant, but a true client-server infrastructure needed 1000x more software components, and not everyone can function at the level Hank's paper more or less set as the bar. So, much of the DEC infrastructure was incomplete and buggy, and for developers, this translated to a frustrating experience: a fast on-ramp followed by a very bumpy, erratic experience. The ultimate customers felt burned and many abandoned DEC for Sun Microsystems, where Bill Joy managed to put together a client-server "V2" that was somewhat more coherent and complete. Finally, Microsoft swept in and did a really professional job, but by then DEC had vanished entirely, and Sun was struggling with its own issues of overreach.

I could repeat this story using examples from the web, but you can see where I'm going: early technologies, especially disruptive, revolutionary ones, often take on a frenetic life of their own that can get far ahead of the real technical needs. The vendor then becomes completely overwhelmed and unless they somehow can paper over the issues, collapses.

Back in that period, a wonderful little book came out on this: Crossing the Chasm, by Geoffrey Moore. It talked about how technologies often have a bumpy adoption curve. Moore talks about an adoption curve over time. The first bump is associated with the early adopters (the kind of people who live to be the first to use a new technology, back before it even becomes stable). But conservative organizations prefer to be "first to be last" as David Bakken says. They hold back waiting for the technology to mature and hoping that they can avoid the pain but also not miss the actual surge of mainstream adoption. Meanwhile, the pool of early adopters dries up and some of them wander off for the next even newer thing, so you see the adoption curve sag, perhaps for years. Wired writes articles about the "failure of client server" (well, back then it would have been ComputerWorld).

Finally, for the lucky few, the really sustainable successes, you see a second surge in adoption and this one would typically play out over a much longer period, without sagging in that same way, or at least not for many years. So we see a kind of S-curve, but with a bump in the early period.

All of which leads me back to today's cloud and this craze for new accelerators. When you consider any one of them, you quickly discover that they are extremely hard devices to program. FPGA pools in Microsoft's setting, for example, are clearly going be expert-only technologies (I'm thinking about the series of papers associated with Catapult). It is easy to see why a specialized cloud micro-service might benefit, particularly because the FPGA performance to power-cost ratio is quite attractive. Just the same, though, creation of an FPGA is really an experts-only undertaking. Anyhow, a broken FPGA could be quite disruptive to the data center. So we may see use of these pools by subsystems doing things like feature ranking for Bing search, crypto for the Microsoft Azure VPC, or data compression and other similar tasks in Cosmos. But I don't imagine that my students here at Cornell will be creating new services with new FPGA accelerators anytime soon.

GPU continues to be a domain where CUDA programming dominates every other option. This is awesome for the world's CUDA specialists, and because they are good at packaging their solutions in libraries we can call from the attached general purpose machine, we end up with great specialized accelerators for graphics, vision, and similar tasks. In my class we actually do read about a generalized tool for leveraging GPU: a language invented at MSR called Dandelion. The real programming language was easy: C# with LINQ, about as popular a technology as you could name. Then they mapped the LINQ queries to GPU, if available. I loved that idea... but Dandelion work stalled several years ago without really taking off in a big way.

TPU is perhaps easier to use: With Google's Tensor Flow, the compiler does the hard work (like with Dandelion), but the language is just Python. To identify the objects a TPU could compute on, the whole model focuses on creating functions that have vectors or matrices or higher-dimensional tensors as their values. This really works well and is very popular, particular on a NUMA machine with an attached TPU accelerator, particularly for Google's heavy lifting in their ML subsystems. But it is hard to see Tensor Flow as a general purpose language or even as a general purpose technology.

And the same goes with research in my own area. When I look at Derecho or Microsoft's FaRM or other RDMA technology, I find it hard not to recognize that we are creating specialist solutions, using RDMA in sophisticated ways, and supporting extensible models that are probably best viewed as forms of PaaS infrastructures even if you tend to treat them as libraries. They are sensational tools for what they do. But they aren't "general purpose". (For distributed computing, general purpose might lead you to an RPC package like the OMG's IDL-based solutions, or to REST, or perhaps to Microsoft's WCF).

So where does this leave us? Many people who look at the modern cloud are predicting that the cloud operating system will need to change in dramatic ways. But if you believe that difficulty of use and fragility and lack of tools makes the accelerators "off limits" except for a handful of specialists, and that the pre-built PaaS services will ultimately dominate, than what's wrong with today's micro-service models? As I see it, not much: they are well-supported, scale nicely (although some of the function-server solutions really need to work on their startup delays!), and there are more and more recipes to guide new users from problem statement to a workable, scalable, high performance solution. These recipes often talk to pre-built microservices and sure, those use hardware accelerators, but the real user is shielded from their complexity. And this is a good thing, because otherwise, we would be facing a new instance of that same client-server issue.

Looking at this as a research area, we can reach some conclusions about how one should approach research on the modern cloud infrastructure.

A first observation is that the cloud has evolved into a world of specialized elastic micro-services and that the older style of "rent a pile of Linux machines and customize them" is fading slowly into the background. This makes a lot of sense, because it isn't easy to end up with a robust, elastic solution. Using a pre-designed and highly optimized microservice benefits everyone: the cloud vendor gets better performance from the data center and better multi-tenancy behavior, and the user doesn't have to reinvent these very suble mechanisms.

A second is that specialized acceleration solutions will probably live mostly within the specialized microservices that they were created to support. Sure, Azure will support pools of FPGAs. But those will exist mostly to speed up things like Cosmos or Bing, simply because using them is extremely complex, and misusing them can disrupt the entire cloud fabric. This can also make up for the dreadful state of the supporting tools for most if not all cloud-scale elastic mechanisms. Like early client-server computing, home-brew use of technologies like DHTs, FPGA and GPU and TPU accelerators, RDMA, Optane memory -- none of that makes a lot of sense right now. You could perhaps pull it off, but more likely, the larger market will reject such things... except when they result in ultra-fast, ultra-cheap micro-services that they can treat as black boxes.

A third observation is that as researchers, if we hope to be impactful, we shouldn't fight this wave. Take my own work on Derecho. Understanding that Derecho will be used mostly to build new microservices helps me understand how to shape the APIs to look natural to the people who would be likely to use it. Understanding that those microservices might be used mostly from Azure's function server or Amazon's AWS Lambda, tells me what a typical critical path would look like, and can focus me on ensuring that this particular path is very well-supported, leverages RDMA at every hop if RDMA is available, lets me add auto-configuration logic to Derecho based on the environment one would find at runtime, etc.

We should also be looking at the next generation of applications and by doing so, should try to understand and abstract by recognizing their needs and likely data access and computational patterns. On this, I'll point to work like the new paper on Ray from OSDI: a specialized microservice for a style of computing common in gradient-descent model training, or Tensor Flow: ultimately, a specialized microservice for leveraging TPUs, or Spark: a specialized microservice to improve the scheduling and caching of Hadoop jobs. Each technology is exquisitely matched to context, and none can simply be yanked out and used elsewhere. For example, you would be unwise to try and build a new Paxos service using Tensor Flow: it might work, but it wouldn't make a ton of sense. You might manage to publish a paper, but it is hard to imagine such a thing having broad impact. Spark is just not an edge caching solution: it really makes sense only in the big-data repositories where the DataBricks product line lives. And so forth.

Monday, 15 October 2018

Improving asynchronous communication in u-service systems.

I've been mostly blogging on high-level topics related to the intelligent edge, IoT, and BlockChain lately. In my work, though, there are also a number of deeper tradeoffs that arise. I thought I might share one, because it has huge performance implications in today's systems and yet seems not to get much "press".

Context. To appreciate the issue I want to talk about, think about a data pipeline moving some sort of stream of bytes at very high speed from point A to point B. If you want to be very concrete, maybe A is a video camera and B is a machine that runs some form of video analytic. Nothing very exotic. Could run on TCP -- or it could use RDMA in some way. For our purposes here, you have my encouragement to think of RDMA as a hardware-accelerated version of TCP.

Now, what will limit performance? One obvious issue is that if there is some form of chain of relaying between A and B, any delays in relaying could cause a performance hiccup. Why might a chain arise? Well, one obvious answer would be network routing, but modern cloud systems use what we call a micro-services (u-service) model, and for these, we deliberately break computations down into stages, and then each stage is implemented by attaching a small "function" program (a normal program written in C++ or Java or a similar language, running in a lightweight container setting like Kubanetes, and with "services" like file storage coming from services it talks to in the same data center, but mostly running on different machines). You code these much as you might define a mouse-handler method: there is a way to associate events with various kinds of inputs or actions, and then to associate an event handler with those events, and then to provide the code for the handler. The events themselves are mostly associated with HTML web pages, and because these support a standard called "web services", there is a way to pass arguments in, and even to get a response: a form of procedure call that runs over web pages, and invokes a function handler program coded in this particular style.

So why might a u-services model generate long asynchronous chains of actions? Well, if you consider a typical use case, it might involve something like a person snapping a photo, which uploads into the cloud, where it is automatically analyzed to find faces, and then where the faces are automatically tagged with best matches in the social network of the camera-owner, etc. Each of these steps would occur as a function event in the current functional cloud model, so that single action of taking a photo (on Facebook, WhatsApp, Instagram, or whatever) generated a chain. In my work on Derecho, we are concerned with chains too. Derecho is used to replicate data, and we often would see a pipeline of photos or videos or other objects, which then are passed through layers of our system (a chain of processing elements) before they finally reach their delivery target.

Chains of any kind can cause real issues. If something downstream pauses for a while, huge backlogs can form, particularly if the system configures its buffers generously. So what seems like a natural mechanism to absorb small delays turns out to sometimes cause huge delays and even total collapse!

Implications. With the context established, we can tackle the real point, namely that for peak performance the chain has to run steadily and asynchronously: we need to see an efficient, steady, high-rate of events that flow through the system with minimal cost. This in turn means that we want some form of buffering between the stages, to smooth out transient rate mismatches or delays. But when we buffer large objects, like videos, we quickly fill the in-memory buffer capacity, and data will then spill to a disk. The stored data will later need to be reread when it is finally time to process it: a double overhead that can incur considerable delay and use quite a lot of resources. With long pipelines, these delays can be multiplied by the pipeline length. And even worse, modern standards (like HTML), often use a text-based data representation when forwarding information, but use a binary one internally: the program will be handed a photo, but the photo was passed down the wire in an ascii form. Those translations cost a fortune!

What alternative did we have? Well, one option to consider is back-pressure: preventing the source from sending the photo unless there is adequate space ("credits") available in the receiver. While we do push the delay all the way back to the camera, the benefit is that these internal overheads are avoided, so the overall system capacity increases. The end-to-end system may perform far better even though we've seemingly reduced the stage-by-stage delay tolerance.

But wait, perhaps we can do even better. Eliminating the ascii conversions would really pay off. So why not trick the system into taking the original photo, storing it into a blob ("binary large object") store, and substituting a URL in the pipeline. So now our pipeline isn't carrying much data at all -- the objects being shipped around might be just a few tens of bytes in size, instead of many megabytes. Now the blob store can offer a more optimized photo transfer service, and the need for buffering in the pipeline itself is eliminated because these tiny objects would be so cheap to ship.

This would be a useless optimization if most stages need to touch the image data itself, unless the stage by stage protocol is hopelessly tied to ascii representations and the blob-store-transfer is super-fast (but wait... that might be true!).

However, there is even more reason to try this: In a u-services world, most stages do things like key-value storage, queuing or message-bus communication, indexing. Quite possibly the majority of stages aren't data-touching at all, and hence wouldn't fetch the image itself. The tradeoff here would be to include meta-data actually needed by the intermediary elements in the pipeline while storing rarely-needed bits in the blob store, for retrieval only when actually needed. We could aspire to a "zero copy" story: one that never copies unnecessarily (and that uses RDMA for the actually needed data transfers, of course!)