Tuesday, 28 March 2017

Is it sensible to try to `` teach'' cloud computing?

Many years ago, I was teaching a rather polished course on distributed computing, and students seemed to like it very much: we looked at distributed programming models and algorithms, and the students learned to create new protocols of their own, solving problems like atomic multicast.  Mostly I taught from my own textbook, initially created back in 1996 on a prior sabbatical.

When distributed systems suddenly expanded enormously and started to be known as cloud infrastructure systems (I would date this roughly to 2005 when AWS was introduced), it seemed like an obvious moment to modernize that class and shift it to become a cloud computing course. 

What followed became a cycle: I would revise my distributed systems textbook, update all my slides,  and then the course would feel reasonably current: I did this in 2010, then again in 2012.

But today I'm facing an interesting puzzle: as part of my sabbatical I thought I might again attempt to revamp the class, but the more I learn about the current state of the art, the less clear it is that this is even possible!  Cloud computing may have become too large a topic, and too arcane, to permit a proper treatment.  Here are a few of the more important topics:

  • Edge platforms.  Devices like iPhones and iPads were completely new back when the cloud emerged, and fairly simple.  Today, they are incredibly complex and very powerful platforms, and it would probably take a full year-long course to just cover the full range of technologies embedded into them.
  • The network itself is increasingly sophisticated, and I use the word "increasingly" with some hesitation, because after decades of continuous evolution and change, it is hard for a person who thinks of the Internet in terms of the old TCP/IP structure to fathom quite how changed the modern network actually has become.  Today's network is a computational structure: it dynamically adapts routing, is able to model each separate enterprise as a distinct domain with its own security and quality of service, it can cache huge amounts of data, and it understands mobility and adapts proactively, so that by the time your car emerges from the tunnel, the network is ready to reconnect and deliver the next bytes of your children's videos.  With the push towards P4, the network is increasingly programmable: an active entity that computes on data flows running at rates of 100Gbs or more.
  • Any serious cloud computing company operates a lot of data centers, at locations spread globally (obviously, smaller players lease capacity from data centers operated by specialists, then customize their slice with their own software layers).  Some systems are just limited-functionality cache and web service structures: simple points of presence; others are full-featured data warehouses that do extensive computation.   Thus the cloud is a heavily distributed structure, with a hierarchy.  Routing of client requests is heavily managed.
  • Within any single data center we have layers of functionality: edge systems that run from cache and are mostly stateless (but this is changing), then back-end systems that track dynamic data, and compute engines that apply iterative computational steps to the flow: data arrives, is persisted, is compressed or analyzed, this creates new meta-data artifacts that in turn are processed, and the entire infrastructure may run on tens or hundreds of thousands of machines.
  • Big data is hosted entirely in the cloud, simply because there is so much of it.  So we also have these staggeringly-large data sets of every imaginable kind, together with indices of various kinds intended to transform all that raw stuff into useful "content".
  • We have elaborate scalable tools that are more and more common: key-value stores for caching (the basic MemCacheD model), transactional ones that can support SQL queries, even more elaborate key-value based database systems. 
  • The cloud is a world of extensive virtualization, and virtualized security enclaves.  All the issues raised by multitenancy arise, and those associated with data leakage, ORAM models, and then technologies like ISGX that offer hardware remedies.
  • Within the cloud, the network itself is a complex and dynamic creation, increasingly supporting RDMA communication, with programmable network interface cards, switches and routers that can perform aspects of machine learning tasks, such as in-network reductions and aggregation.
  • There are bump-in-the-wire processors: NetFPGA and other ASIC devices, plus GPU clusters, and these are all interconnected via new and rather exotic high speed bus technologies that need to be carefully managed and controlled, but permit amazingly fast data transformations.
  • File systems and event notification buses have evolved and proliferated, so in any given category one has an endless list of major players.  For example, beyond the simple file systems like HDFS we have ones that offer strong synchronization, like Zookeeper, ones that are object oriented, like Ceph, real-time versions like Cornell's Freeze Frame (FFFS), big-data oriented ones, and the list goes on and on.  Message bus options might include Kafka, Rabbit, OpenSlice, and these are just three of a list that could extend to include 25.  There are dozens of key-value stores.  Each solution has its special feature set, advantages and disadvantages.
  • There are theories and counter-theories: CAP, BASE, FLP, you name it.  Most are actually false theories, in the sense that they do apply to some specific situation but don't generalize.  Yet developers often elevate them to the status of folk-legend: CAP is so true in the mind of the developers that it almost doesn't matter if CAP is false in any strong technical sense.
  • We argue endlessly about consistency and responsiveness and the best ways to program asynchronously.  The technical tools support some models better than others, but because there are so many tools, there is no simple answer.
  • Then in the back-end we have all the technologies of modern machine learning: neural networks and MapReduce/Hadoop and curated database systems with layers of data cleaning and automated index creation and all the functionality associated with those tasks.
I could easily go on at far greater length.  In fact I was able to attend a presentation on Amazon's latest tools for AWS and then, soon after, one for Microsoft's latest Azure offerings, and then a third one focused on the Google cloud.  Every one of these presentations reveals myriad personalities you can pick and chose from: Azure, for example, is really an Azure PaaS offering for building scalable web applications, an Azure fabric and storage infrastructure, an Azure IaaS product that provides virtual machines running various forms of Linux (yes, Microsoft is a Linux vendor these days!), a container offering based on Mesos/Docker, and then there is Azure HPC, offering medium-size clusters that run MPI supercomputing codes over Infiniband.  All of this comes with storage and compute managers and developer tools to help you build and debug your code, and endless sets of powerful tools you can use at runtime.

Yet all of this is also terribly easy to misuse.  A friend who runs IT for a big company was just telling me about their move to the cloud: a simple mistake ran up a $100k AWS bill one weekend by generating files that simply got bigger and bigger and bigger. On a single computer, you run out of space and your buggy code crashes.  In the cloud, the system actually has the capacity to store exobytes... so if you do this, it just works.  But the, on Monday the bill arrives.  Microsoft, Amazon and Google are vying for the best ways to protect your IT department against surprises, but of course you need to realize that the dashboard has those options, and enable them, much like credit card alerts from Visa or MasterCard.

So even cloud dashboards have become an elaborate topic.

Where does this story end?  The short answer is that it definitely isn't going to end, not for decades.  If anything, it will accelerate: with the trend towards Internet of Things, we'll be moving all sorts of mission-critical real-time systems into the cloud, and even larger data flows: self-driving cars, self-managed smart highways and cities and buildings, you name it. 

And you know what?  I don't see a way to teach this anymore!  I was just talking to Idit Keidar yesterday at Technion, quite possibly the world's very best distributed systems researcher, and she told be a little about her graduate class in distributed systems.  Five years ago, I might have thought to myself that wow, she should really be teaching the cloud, that students will insist on it.  Yesterday my reaction was exactly the opposite: when I resume teaching this fall at Cornell, I might just do what she's doing and just return to my real roots.  Cloud computing was a fun course to teach for a while, but the growth of the area has simply taken it completely out of the scope of what we as faculty members can possibly master and cover in any one class.

No comments:

Post a Comment