While giving a lecture in my graduate course on the modern cloud, and the introduction of sensors and hardware accelerators into machine-learning platforms, I had a sudden sense of deja-vu.
In today's cloud computing systems, there is a tremendous arms-race underway to deploy hardware as a way to compute more cost-effectively, more data faster, or simply to offload very repetitious tasks into specialized subsystems that are highly optimized for those tasks. My course covers quite a bit of this work: we look at RDMA, new memory options, FPGA, GPU and TPU clusters, challenges of dealing with NUMA architectures and their costly memory coherence models, and similar topics. The focus is nominally on the software gluing all of this together ("the future cloud operating system") but honestly, since we don't really know what that will be, the class is more of a survey of the current landscape with a broad agenda of applying it to emerging IoT uses that bring new demands into the cloud edge.
So why would this give me a sense of deja-vu? Well, grant me a moment for a second tangent and then I'll link my two lines of thought into a single point. Perhaps you recall the early days of client-server computing, or the early web. Both technologies took off explosively, only to suddenly sag a few years later as a broader wave of adoption revealed deficiencies.
If you take client-server as your main example, we had an early period of disruptive change that you can literally track to a specific set of papers: Hank Levy's first papers on the Vax Cluster architecture when he was still working with DEC. (It probably didn't hurt that Hank was the main author: in systems, very few people are as good as Hank at writing papers on topics of that kind.) And in a few tens of pages, Hank upended the mainframe mindset and introduced us to this other vision: clustered computing systems in which lots of components somehow collaborate to perform scalable tasks, and it was a revelation. Meanwhile, mechanisms like RPC were suddenly becoming common (CORBA was in its early stages), so all of this was accessible. For people accustomed to file transfer models and batch computing, it was a glimpse of the promised land.
But the pathway to the promised land turned out to be kind of challenging. DEC, the early beneficiary of this excitement, got overwhelmed and sort of bogged down: rather than being a hardware company selling infinite numbers of VAX clusters (which would have made them the first global titan of the industry), they somehow got dragged further and further into a morass of unworkable software that needed a total rethinking. Hank's papers were crystal clear and brilliant, but a true client-server infrastructure needed 1000x more software components, and not everyone can function at the level Hank's paper more or less set as the bar. So, much of the DEC infrastructure was incomplete and buggy, and for developers, this translated to a frustrating experience: a fast on-ramp followed by a very bumpy, erratic experience. The ultimate customers felt burned and many abandoned DEC for Sun Microsystems, where Bill Joy managed to put together a client-server "V2" that was somewhat more coherent and complete. Finally, Microsoft swept in and did a really professional job, but by then DEC had vanished entirely, and Sun was struggling with its own issues of overreach.
I could repeat this story using examples from the web, but you can see where I'm going: early technologies, especially disruptive, revolutionary ones, often take on a frenetic life of their own that can get far ahead of the real technical needs. The vendor then becomes completely overwhelmed and unless they somehow can paper over the issues, collapses.
Back in that period, a wonderful little book came out on this: Crossing the Chasm, by Geoffrey Moore. It talked about how technologies often have a bumpy adoption curve. Moore talks about an adoption curve over time. The first bump is associated with the early adopters (the kind of people who live to be the first to use a new technology, back before it even becomes stable). But conservative organizations prefer to be "first to be last" as David Bakken says. They hold back waiting for the technology to mature and hoping that they can avoid the pain but also not miss the actual surge of mainstream adoption. Meanwhile, the pool of early adopters dries up and some of them wander off for the next even newer thing, so you see the adoption curve sag, perhaps for years. Wired writes articles about the "failure of client server" (well, back then it would have been ComputerWorld).
Finally, for the lucky few, the really sustainable successes, you see a second surge in adoption and this one would typically play out over a much longer period, without sagging in that same way, or at least not for many years. So we see a kind of S-curve, but with a bump in the early period.
All of which leads me back to today's cloud and this craze for new accelerators. When you consider any one of them, you quickly discover that they are extremely hard devices to program. FPGA pools in Microsoft's setting, for example, are clearly going be expert-only technologies (I'm thinking about the series of papers associated with Catapult). It is easy to see why a specialized cloud micro-service might benefit, particularly because the FPGA performance to power-cost ratio is quite attractive. Just the same, though, creation of an FPGA is really an experts-only undertaking. Anyhow, a broken FPGA could be quite disruptive to the data center. So we may see use of these pools by subsystems doing things like feature ranking for Bing search, crypto for the Microsoft Azure VPC, or data compression and other similar tasks in Cosmos. But I don't imagine that my students here at Cornell will be creating new services with new FPGA accelerators anytime soon.
GPU continues to be a domain where CUDA programming dominates every other option. This is awesome for the world's CUDA specialists, and because they are good at packaging their solutions in libraries we can call from the attached general purpose machine, we end up with great specialized accelerators for graphics, vision, and similar tasks. In my class we actually do read about a generalized tool for leveraging GPU: a language invented at MSR called Dandelion. The real programming language was easy: C# with LINQ, about as popular a technology as you could name. Then they mapped the LINQ queries to GPU, if available. I loved that idea... but Dandelion work stalled several years ago without really taking off in a big way.
TPU is perhaps easier to use: With Google's Tensor Flow, the compiler does the hard work (like with Dandelion), but the language is just Python. To identify the objects a TPU could compute on, the whole model focuses on creating functions that have vectors or matrices or higher-dimensional tensors as their values. This really works well and is very popular, particular on a NUMA machine with an attached TPU accelerator, particularly for Google's heavy lifting in their ML subsystems. But it is hard to see Tensor Flow as a general purpose language or even as a general purpose technology.
And the same goes with research in my own area. When I look at Derecho or Microsoft's FaRM or other RDMA technology, I find it hard not to recognize that we are creating specialist solutions, using RDMA in sophisticated ways, and supporting extensible models that are probably best viewed as forms of PaaS infrastructures even if you tend to treat them as libraries. They are sensational tools for what they do. But they aren't "general purpose". (For distributed computing, general purpose might lead you to an RPC package like the OMG's IDL-based solutions, or to REST, or perhaps to Microsoft's WCF).
So where does this leave us? Many people who look at the modern cloud are predicting that the cloud operating system will need to change in dramatic ways. But if you believe that difficulty of use and fragility and lack of tools makes the accelerators "off limits" except for a handful of specialists, and that the pre-built PaaS services will ultimately dominate, than what's wrong with today's micro-service models? As I see it, not much: they are well-supported, scale nicely (although some of the function-server solutions really need to work on their startup delays!), and there are more and more recipes to guide new users from problem statement to a workable, scalable, high performance solution. These recipes often talk to pre-built microservices and sure, those use hardware accelerators, but the real user is shielded from their complexity. And this is a good thing, because otherwise, we would be facing a new instance of that same client-server issue.
Looking at this as a research area, we can reach some conclusions about how one should approach research on the modern cloud infrastructure.
A first observation is that the cloud has evolved into a world of specialized elastic micro-services and that the older style of "rent a pile of Linux machines and customize them" is fading slowly into the background. This makes a lot of sense, because it isn't easy to end up with a robust, elastic solution. Using a pre-designed and highly optimized microservice benefits everyone: the cloud vendor gets better performance from the data center and better multi-tenancy behavior, and the user doesn't have to reinvent these very suble mechanisms.
A second is that specialized acceleration solutions will probably live mostly within the specialized microservices that they were created to support. Sure, Azure will support pools of FPGAs. But those will exist mostly to speed up things like Cosmos or Bing, simply because using them is extremely complex, and misusing them can disrupt the entire cloud fabric. This can also make up for the dreadful state of the supporting tools for most if not all cloud-scale elastic mechanisms. Like early client-server computing, home-brew use of technologies like DHTs, FPGA and GPU and TPU accelerators, RDMA, Optane memory -- none of that makes a lot of sense right now. You could perhaps pull it off, but more likely, the larger market will reject such things... except when they result in ultra-fast, ultra-cheap micro-services that they can treat as black boxes.
A third observation is that as researchers, if we hope to be impactful, we shouldn't fight this wave. Take my own work on Derecho. Understanding that Derecho will be used mostly to build new microservices helps me understand how to shape the APIs to look natural to the people who would be likely to use it. Understanding that those microservices might be used mostly from Azure's function server or Amazon's AWS Lambda, tells me what a typical critical path would look like, and can focus me on ensuring that this particular path is very well-supported, leverages RDMA at every hop if RDMA is available, lets me add auto-configuration logic to Derecho based on the environment one would find at runtime, etc.
We should also be looking at the next generation of applications and by doing so, should try to understand and abstract by recognizing their needs and likely data access and computational patterns. On this, I'll point to work like the new paper on Ray from OSDI: a specialized microservice for a style of computing common in gradient-descent model training, or Tensor Flow: ultimately, a specialized microservice for leveraging TPUs, or Spark: a specialized microservice to improve the scheduling and caching of Hadoop jobs. Each technology is exquisitely matched to context, and none can simply be yanked out and used elsewhere. For example, you would be unwise to try and build a new Paxos service using Tensor Flow: it might work, but it wouldn't make a ton of sense. You might manage to publish a paper, but it is hard to imagine such a thing having broad impact. Spark is just not an edge caching solution: it really makes sense only in the big-data repositories where the DataBricks product line lives. And so forth.
This is fantastic to read..
ReplyDeleteThe ideas of a specialized microservice for leveraging TPUs or Spark; and a specialized microservice to improve the scheduling and caching of Hadoop/Spark jobs are exciting and very interesting. It would be also nice to have a unified language/framework to program all types of accelerators (GPUs, TPUs, ..etc).
ReplyDeleteThank you, Kishore. In fact the Dandelion language was created with exactly that goal -- the paper talks about using GPU as an example, but also eventually supporting compilation to other architectures. TPUs would have been a natural option for them. FPGA is trickier because you write the code in Verilog and there isn't any simple mapping from C to Verilog (there is a lot of research on the topic, but first of all, it isn't simple, and secondly, it isn't yet competitive with hand-written Verilog). So I wish Dandelion had lived on and grown to cover more cases, but I don't think any single story can win. And keep in mind: someday, people may add quantum processors to this list, or could create other kinds of specialized accelerators that don't fit so well into the FPGA/TPU/GPU model (honestly, they don't even use the same model, except that all are plug-in components for PCI express).
DeleteGreat read. Thanks.
ReplyDeleteDo you see any role for CPU schedulers in cloud operating systems? I think current CPU schedulers(Linux's CFS used by KVM or Xen's credit schedulers) are not appropriate for cloud operating systems. They are best-effort CPU schedulers that aim at sharing processor resources fairly or improving performance. But, I think, in clouds, we seek high QoS rather than high performance. By QoS, I mean how much the delivered service is close to the user expectation. The more close, the higher QoS. Therefore, we should adopt new CPU schedulers that unlike traditional CPU schedulers offer differentiated service qualities to cloud users.
ReplyDeleteDefinitely -- in fact there are already a number of papers and products for efficient scheduling of containers on multi-tenant systems and for efficient function scheduling on function services (PaaS like the ones I mention -- Amazon Lambda or Microsoft Functions). Very important area.
Delete