Monday, 11 December 2017

Fog computing and cloud-hosted data concentrators

When a cloud touches the ground, we get fog, an image that inspires a new buzzword: "fog computing."  If you are interested in cloud integration with edge devices you work in the fog!

Internet of Things is going to a big deal, and this suggests that as a research topic, fog computing deserves close scrutiny.  Today's most interest question is more of a meta-question: figuring out which system elements should play which roles (once this question is resolved, a series of more technical follow-on questions would arise). 

The question isn't really so new: At least since the 1980's, researchers have speculated about the challenges of computing systems that might reach very broadly into the physical world through a variety of sensing modalities (still and video imaging, radar and lidar and IR motion detectors, temperature and humidity sensors, microphones, etc), use machine-learned techniques to make sense of those inputs, and then perhaps take action.  But the game changer relative to those early days is the contemporary appreciation of how cloud computing can drive costs down, and enable a new form of nimble startup -- the kind of company that builds a cool new technology and then can scale it to support millions of users almost overnight.  Past versions of edge systems mired in issues of cost and suffered market failures: the users who might have benefitted didn't want to pay, and because the costs were genuinely steep, the technology simply didn't take off.

Today, we can at least identify a strong market pull.  In any technology media publication you read about smart homes and office complexes and cities, smart highways and power grids, smart healthcare technologies.  Clearly there is a wave of interest for this form of deeply integrated machine intelligence.  Moreover, and this points to a social shift relative to the 1980's, the dominant tone isn't really a worry about privacy, although you do see some articles that fret about the risks (and they are quite real; we need to acknowledge and engage on that front).  But the more dominant tone looks at upside, drawing analogies to our cell phones.

For example, I have an advisory role in a company called Caspar.ai.  It was founded by Ashutosh Saxena, a friend who was a Cornell faculty member until he left to do the startup, and David Cheriton, who you may know of as the first outside investor in Google (he was a Stanford professor but also a very successful entrepreneur, and when Larry and Sergie knocked on his door with their idea and an early proof of concept, Dave jumped to help them get established).  They view the apartment or condominium as the next platform, and Ashutosh actually gives talks in which he shows you the architecture diagram of an iPhone and then a nearly identical one for a condo in Santa Rosa.  The Caspar.ai system is like the iPhone O/S and could host apps, and because Caspar works with the developer who built the entire development, the technology can orient itself: it is in such-and-such a room listening to the apartment owner giving instructions about music for the party tonight, etc.

The example highlights one of the puzzles we'll want to think about: Caspar has orientation because it is built right into the condominium or home.  But most fog computing devices today are just small gadgets that individuals buy and install by hand.  How would they know where they are, which way they are pointing, etc?  And even if a very geeky technology person could configure such a thing, could her grandfather do it, or would he need to call his granddaughter to come and set the device up?  Part of the fog computing puzzle is visible right in this example, and in the contrast with Caspar.ai: how will these systems orient themselves, and how will the devices be told what roles to play?

It isn't just about smart homes.  One of the more exciting ideas I heard about most recently centers on smart agriculture: I attended a workshop on digital agriculture at Cornell a few weeks ago, and completely coincidentally, was invited to attend a second one on the concept of a smart soil "macroscope" at Chicago almost immediate afterward.

So how would the fog impact agriculture or dive into the soil?  Researchers spoke about tracking produce from farm to table, literally step by step, and using that knowledge to reduce loss due to fresh produce sitting on the shelf for too long, to improve the efficiency of the produce supply chain, prevent accidental contamination by E-Coli or other bacteria, redirect shipments to match demand more closely, etc.  A researcher at Microsoft showed that with a new kind of white-fi communications, drones could fly over fields and map out insect infestations in real-time, enabling the farmer to stamp out the pests with spot applications of insecticides, reducing unnecessary pesticides by a factor of 1000x.  You could do the same with fertilizer, or when watering.   One person talked about underground WiFi: it turns out to work surprisingly well, if you have enough power!  Who would have imagined an underground WiFi network?  But if the need ever becomes real enough, it can be done!  The one caveat is that they need fairly well-drained soil; pooled water can block the signals.

Who needs this?  I found one answer out in Napa.  I happen to love great wines, and I'm friendly with some growers who own or manage insanely expensive vineyards.   They would love to be able to visualize and track the subsurface biome, the movement of nutrients and water, and the conversion of surface materials into soil.  This might help a grower identify particularly promising spots to place the next great winery.  Of course, they are also quick to point out that no technology is a complete answer to any question, and that going from data to "useful insights" is quite a complicated matter.  But in a world of constant climatic change, they are keenly interested in knowing what is going on down there.  In fact I'm thinking I should start a little company to work on this topic, if for no other reason than as an excuse to visit and taste some of those wines!  A lot of them are way to expensive for me to actually buy and drink on a routine basis.

Getting technical again: what questions can we identify from this set of examples, and how do they shape the likely form a fog computing system might take?

Part of the puzzle centers on the limitations of sensing devices.  Devices are gaining in potential compute power, but you need to ask whether computing on the device itself is a smart move, given that more and more sensing devices are designed to operate on batteries, or with minimal power draw.  Computing at the edge isn't a very power-efficient model, and relaying data back to a data center has overwhelmingly dominated when you look at actual deployed IoT products. 

In fact there is much to be said for viewing sensors as dumb devices that might store a few hours or days of data, but don't compute very much.  First, if you want to understand an image or a spoken command, the size of database you would use to do that is huge -- deep neural networks and Bayesian networks generate models that can be terabytes or even petabytes in size when used for tasks of this kind.  Keeping the neural network models and precomputed data back in your local cloud where they can be shared among a number of users is far more cost-effective than shipping those petabytes to the devices, and then needing to keep those updated as the runtime conditions and goals evolve.

The proper way to process data might also depend on things the sensor is unlikely to have access to, such as the best estimate of its location and orientation, knowledge of which people are in the home or office, context associated with their prior commands to the system that were given in a different location, and captured by some other device.  As we saw, while Caspar.ai might actually have this kind of orientation at hand, most devices lack that sort of context information (think of an Alexa camera/microphone/speaker that people shift around much like a flower vase: it isn't in the same place from day to day, and that camera could easily end up pointing at a wall, or a stack of books!)   All of this might argue for a sensor model in which sensors capture everything in the vicinity, store a copy locally, but then just blindly relay the acquired data to the cloud.  The sensor could still do some very basic stuff: for example, perhaps it can figure out that absolutely nothing is happening at all, and skip the upload in that case, or upload a tiny marker saying "no change."  This really limited form of local computing is something that even very simple, disoriented devices can perform.

However, arguing in the other direction, there are sensing options that only make sense if deployed at the edge.  For example, you can't easily correct for focus on vibration of a video after capturing it, so that form of dynamic adjustment should be performed right on the camera.   A subsurface sensor used to track humidity in the soil may need to dynamically vary the way it operates its sensing components, because the best options for measuring moisture vary enormously depending on the depth of the water table, how far from fully saturated the soil is, etc.  So for cases like these, a dumb sensor might end up generating endless streams of low-quality data that can't be patched up later.

Broadly, I think we'll need to do both, but that the sensors will be used mostly in pretty dumb ways (like to hold video, and to discard empty content), but then will relay most of the potentially interesting stuff back to the cloud.

So this starts to answer the meta question.  Given this model, we can see what the technical need might be: the model argues that we should create a new technology base focused on cloud-hosted data concentrators that are integrated deeply into cloud storage systems: I'm fond of the term "smart memory" for this functionality.  A single instance of a concentrator, on some single mid-range compute server within the cloud, might handle some large but not unbounded number of sensors: perhaps, 10,000 smart homes in some community, or 100 video cameras.  If you need more capacity, you would just spread your incoming data streams over more data concentrators.

Notice that I'm not working within a standard 3-tier cloud model in this example.  A standard cloud has a first tier that generates web content, often using cached data and other second-tier services.  The third tier covers everything at the back end.  A concentrator is an example of a new kind of first-tier: one that is stateful and smart, perhaps with real-time guarantees and consistency promises.  This is not today's most common cloud model -- although it is close enough to it that today's cloud could definitely evolve to fit this picture, and in fact it might not even be a major reach to pull it off!

Within the data concentrator we would have a machine-learning model, dynamically updated and accurate, that could be used to immediately "understand" the data.  Thus if a person in an apartment utters a remark that only makes sense in context, we could create a dynamic machine-learned utterance model that is accurate to the second and expresses the system knowledge of recent past; even if the speaker is moving from room to room, that model would evolve to reflect the system understanding of his or her request.  For example, "Caspar, can you adjust the shade to get rid of that glare?" can be understood only by a system that knows the position of the sun, the locations of the windows and shades and the TV, and the options for adjusting that kind of window shade, but with that data, it can be done -- and that data would most likely live in a cloud-style concentrator or a smart-home "brain unit" if we put the concentrator right in the home (appealing as a response to privacy worries).  Tell Alexa or Siri to do this, and because those technologies are kind of a free-standing autonomous edge, the best you can hope for is a response like "I'm sorry, Ken, I don't know how to do that."

The other argument for cloud models leads to a data storage and update insight.  In particular, it isn't just the massive databases used for vision and speech understanding that would probably need to live on the cloud.  There is also a question of the smaller knowledge models used to make sense of edge events, and to adapt as they occur.

A smart highway might have an evolving understanding of the situation on such-and-such a segment of the freeway, and a smart farming system might build up a model of the insect infestation in a field, adapting the drone's flight plan to focus on areas that are at highest risk, while spending less time scanning areas that seem to be unaffected by the bugs.

The argument for smart storage would simply be that as we capture and make sense of these data streams, we're in a unique position to decide what to keep and what to discard, which data to route to back-end systems for further evaluation using offline techniques, etc.  The back-end systems would view the entire smart memory as a big storage cluster containing read-only files, reflecting the knowledge acquired by the smart analytic layer, and indexed by time.  Of course they could write files too, for example to update parameters of the knowledge model in use on the front end.

As an example, if a smart highway were to observe that some car is a few cm to the side relative to predictions, the system would probably just tweak the model parameters at the edge.  But if a car changes lanes unexpectedly, that would be a big event, and might be better handled by forwarding the information to a back-end system running Hadoop (Spark/Databricks), where we could recompute the entire set of expected vehicle trajectories for that segment of highway.

In other blog entries, I've shifted to a pitch for Derecho around this point, but this blog is getting long and perhaps I'll just wrap up.  In fact, it isn't just Derecho that I've omitted: I haven't even touched on the need for specialized hardware (FPGA, GPU clusters and TPU clusters seem like the most promising technologies for really understanding speech and video at scale), privacy, security, or data consistency: all could be topics for future blogs.  But those argue for a cloud model too.  Overall, it does strike me as a very promising area for study.  My one qualm, really, centers on the buzzword: fog computing isn't my favorite term; it sounds way too much like marketing drivel, and we've all heard a lot of that sort of thing.  What was wrong with plain old IoT?  Or "smart edge"?

Tuesday, 28 November 2017

Dercho discussion thread

 As requested by Scott...  Scott, how about reposting your questions here?
===
Hi Ken.

Thanks very much for the info. I wasn't familiar with LibFabrics. I think since I don't have access to RDMA hardware right now I will wait for the derecho over libfabrics. How can I know when it is available?

In the mean time a question: you mentioned in one of your blog posts that it would be difficult to use other languages than c++ for using the derecho replicated object API. Could you comment on why that would be?...e.g. relative to python or java say.

To share one thought: the OSGi (java) concept of a service...or rather a remote service...is conceptually similar. OSGi services are plain 'ol object instances that are managed by a local broker (aka service registry) and accessed by interfaces.

One thing that makes OSGi services different from java objects is support for dynamics...i.e. OSGi services can come and go at runtime

Which brings me to another question: Is it possible for replicated object instances to be created and destroyed at runtime within a derecho process group?

BTW, if there is some derecho mailing list that is more appropriate than your blog for such questions please just send me there.

Scott

Friday, 24 November 2017

Can distributed systems have predictable delays?

I ran into a discussion thread about “deterministic networking” and got interested...

Here's the puzzle: suppose a system requires the lowest possible network latencies, the highest possible bandwidth, and steady timing.  What obstacles prevent end-to-end timing from being steady and predictable, and how might we overcome them?  What will be the cost?

To set context, it helps to realize that modern computing systems are fast because of asynchronous pipelines at every level.  The CPU runs in a pipelined way, doing branch prediction, speculative prefetch, and often executing multiple instructions concurrently, and all of this is asynchronous in the sense that each separate aspect runs independent of the others, as much as it can. Now and then barriers arise that prevent concurrency, of course.  Those unexpected barriers make the delays associated with such pipelines unpredictable.

Same for the memory unit.  It runs asynchronously from the application, populating cache entries, updating or invalidation them, etc.  There is a cache of memory translation data, managed in an inverted lookup format that depends on a fast form of parallel lookup hardware for fast address translations.  With a cache hit memory performance might be 20x better than if you get a miss.

Or consider reading input from a file.  The disk is asynchronous, with a buffer of pending operations in between your application and the file system, then another in between the file system buffer area and the disk driver, and perhaps another inside the disk itself (a rotating disk or SSD often has a battery-backed DRAM cache to mask hardware access delays, and to soak up DMA transfers from main memory, which ideally occur at a higher speed than the SSD can really run at).  If everything goes just right, your performance could be 100x better than if luck runs against you, for a particular operation.

NUMA effects cause a lot of asynchronous behavior and nondeterminism.  Plaforms hide the phenomenon by providing memory management software that will assign memory from a DRAM area close to your core in response to a malloc, but if you use data sharing internal to a multithreaded application and it runs on different cores, beware: cross core accesses can be very slow.  Simon Peter researched this, and ended up arguing that you should invariably make extra copies of your objects and use memcpy to rapidly move the bytes in these cross-core cases, so that each thread always accesses a local object.  His Barrelfish operating system is built around that model.  Researchers at MIT sped up Linux locking this way, too, using “sloppy counters” and replicating inode structures.  In fact locking is a morass of unpredictable delays.

Languages like C# and Java can hide some NUMA effects, but on the other hand, they also do garbage collection and other memory compaction operations, which can be costly and hard to anticipate or control.  You can eliminate this with C++, but the standard libraries sometimes do copying in unexpected ways too, so you can’t just assume that everything is roses even with these low level languages.  These days, even C can have strange library-level delays.

The case I worry about is when activities with different priorities share a lock: it is easy to create a priority inversion in which urgent action A is stuck waiting for thread B, which holds a lock but is very low priority and has been preempted by medium priority thread C.  This seems arcane as I describe it, but if you think of a NIC or an interrupts handler as a thread, that could be A.  Your user code is thread B. And the garbage collector is thread C.  Voila: major problems are lurking if A and C share a lock.  Locks make concurrent coding easier, but wow, do they ever mess up timing!

In distributed systems, any form of remote interaction brings sources of nondeterministic timing.  First,  on the remote side, we need some kind of task to wake up and run, so you have a scheduling effect.  But in addition to this, most networks use oversubscribed COS structures, with spine and lead routers or switches that can’t actually run at a full all-to-all data rate.  They congest and drop packets or send “slow down” messages or mark traffic to warn the endpoints about excessive load.  So any kind of cross-traffic effect can cause variable performance.

Few of us have a way to control the layout of our application tasks into the nodes in the computer.  So network latency can introduce nondeterminism simply because when things run, every different launch of your application may end up with nodes P and Q at different distances from each other, measured in network units.  In fact routing can also change, and tricks involving routing are surprisingly common.  Refreshing IP address leases and other kinds of dynamic host bindings can also glitch otherwise rock-steady performance.

If you use VPN security, encryption and decryption occurs at the endpoints, adding delay.  There are cryptographic schemes that have constant cost, but many are more or less rapid depending on the data you hand them, hence more non-determinism creeps in at that step.

Failure handling and any kind of system management activity can cause performance to vary in unexpected ways.  This would include automated data replication, copying, compaction and similar tasks on your file storage subsystem.

So... could we create a new operating system with genuinely deterministic timing?  Cool research question!  If I find the right PhD student, maybe I’ll let you know (but don’t hold your breath:  with research, timing can be unpredictable).

Sunday, 5 November 2017

Disaggregating data centers

One of the hot new buzz-phrases in the systems community is "disaggregation", as in "the future data center will be disaggregated".

Right now, we tend to build data centers by creating huge rack-mounted collections of computers that often run the same operating systems we use on the desktop: versions of Linux, perhaps virtualized (either with true VMs or via container models, like Docker on Mesos).  The core challenge centers on guaranteeing good performance while sharing resources in ways that optimize power and other costs.  The resulting warehouse-size systems can easily contain tens or hundreds of thousands of racked machines.

Researchers pressing for a disaggregated model generally start by pointing out that we really have a lot of unused resource in these settings.  For example, there might be large numbers of machines equipped with GPU coprocessors that aren't currently running any applications that use them (or ones that are FPGA-capable, but don't actually need FPGA support right now), machines with lots of spare DRAM or SSD space, etc.  You can make a long list of such resources: other forms of ASICs (such as for high-speed SHA3 hashing), accelerators for tasks like DFFT computation, DRAM, unused cores, NICs that actually have computers built into them and hence could be used as network-layer coprocessors, switching and routers with programmable capabilities, massive storage systems with ARM or similar compute power down near the medium, etc.

So, the proposal goes, why not shift towards a greater degree of sharing, where your spare resources could be available to my machine?

With RDMA and RDMA-enabled persistent memory (PMEM) standards starting to work properly in settings that long-ago standardized around Ethernet and IP (RDMA on Infiniband is just not a hugely popular technology in data centers, although Azure HPC offers it), it makes sense to talk about sharing these spare resources in more ambitious ways.  My spare DRAM could be loaned out to you to expand your in-memory cache.  Your unused FPGA could become part of my 1000-FPGA deep neural network service.  There are more and more papers about doing these kinds of things; my class this fall has been reading and discussing a bunch of them.  In our work on Derecho, we've targeted  opportunities in this space.  Indeed, Derecho could be viewed as an example of the kind of support disaggregated computing might require.

But deciding when disaggregation would be beneficial is a hard problem.  A remote resource will never have the sorts of ultra-low latency possible when a task running locally accesses a local resource, and studies of RDMA in large settings show surprisingly large variability in latency, beyond which one sees all sorts of congestion and contention-related throughput impacts.  So the disaggregated data center is generally going to be a NUMA environment: lots of space resources may be within reach, but unless care is taken to share primarily with nearby nodes  (perhaps just one hop away), you would need to anticipate  widely varying latency and some variability in throughput: orders of magnitude differences depending on where something lives.  Get the balance right and you can pull off some really nifty tricks.  But if you accidentally do something that is out of balance, the performance benefit can quickly evaporate.

Even with variability of this kind, a core issue centers on the shifting ratio of latency to bandwidth.  In a standard computer, talking to a local resource, you'll generally get very high bandwidth (via DMA) coupled with low latency.  You can do almost as well when running in a single rack, talking to a nearby machine over a single hop via a single TOR switch.  But when an application reaches out across multiple hops in a data-center network, the latency, and variability of latency, grow sharply. 

The bandwidth story is more complex.  Although modern data centers have oversubscribed leaf and  spine switches and routers, many applications are compute intensive, and it would be rare to see long periods with heavy contention.  Thus, even if latency becomes high, bandwidth might not be lost.

In our experiments with RDMA we've noticed that you often can get better bandwidth when talking to a remote resource than to local ones.  This is because with two machines connected by a fast optical link, two memory modules can cooperate to transfer data "in parallel", with one sending and one receiving.  The two NICs will need to get their DMA engines synchronized, but this works well, hence "all" their capacity can be dedicated (provided that the NICs have all the needed data for doing the transfer in cache). 

In contrast, on a single machine, a single memory unit is asked to read data from one place and write it to some other place.  This can potentially limit the operation to half its potential speed, simply because with one core and one memory unit, it is possible to "overload" one or both.

We thus confront a new balancing act: a need to work with applications that run over hardware with relatively high latency coupled to very high bandwidth.  In my view, this shifting ration of latency to bandwidth is a game-changer: things that worked well in classic systems will need to be modified in order to take advantage of it.  The patterns of communication that work best in a disaggregated world are ones dominated by asynchronous flows: high-rate pipelines that move data steadily with minimal delays: no locking, no round-trip requests where a sender has to pause waiting for responses. 

As you  know from these blogs, almost all of my work is on distributed protocols.  Mapping these observations to my area, you'll quickly realize that in modern settings, you really don't want to run 2-phase commit or similar protocols.  Traditional versions of Paxos or BFT are a performance disaster.  In fact, this is what led us to design Derecho in the way we ultimately settled upon: the key challenge was to adapt Paxos to run well in this new setting.  It wasn't easy and it isn't obvious how to do this in general, for some arbitrary distributed algorithm or protocol.

So the puzzle, and I won't pretend to have any really amazing ideas for how to solve it, centers on the generalized version of this question.  Beyond the protocols and algorithms, how does Linux need to evolve for this new world?  What would the ideal O/S look like, for a disaggregated world?

As a concrete example, think about a future database system.  Today's database systems mostly work on a read, compute, output model: you start by issuing a potentially complex query.  Perhaps, it asks  the system to make a list of open orders that have been pending for more than six weeks and involve a supplier enrolled in a special "prioritized orders" program, or some other fairly complex condition. 

The system reads the relations containing the data, performs a sequence of joins and selects and projects, and eventually out comes the answer.  With small data sets, this works like a charm.

But suppose that the database contains petabytes of data spread over thousands of storage systems.
Just reading the data could saturate your NIC for days!  In such a situation, you may need to find new ways to run your query.  You would think about disaggregation if the underlying hardware turns out to be reasonably smart.  For example, many modern storage devices have a local processor, run Linux, and can run a program on your behalf.

Clearly you want the database operations to be pushed right into the storage fabric, and would want to harness all that compute power.  You might want to precompute various indices or data samples, so that when the query is issued, the work has partly been done in advance, and the remaining work can happen down in the layer that hosts the data.

But of course I'm a systems person, so my system-centric question, posed in this context, is this: what will you want as an O/S for that environment?  And more to the point: how will that O/S differ from todays’s cloud systems?

If you are unsure how to think of such a question, consider this: if you wanted to implement a disaggregated database today, how would you even convince the storage layer to run these little add-on programs?  In a purely mechanical sense: how does one run a program on a modern SSD storage unit?  I bet you have absolutely no idea.

If you wanted to do everything by hand, logging into the storage servers one by one is an option.  Today, it might be the only option, and the environment you would find inside them could be pretty arcane.  Each vendor has its own weird variation on the concept. 

But we could change this.  We could create a new kind of operating system just for such tasks.  The power of an operating system is that it can impose standardization and increase productivity through portability, easy access paths, good debugging support, proper integration of the needed tools, virtualization for sharing, etc.  You would lack all of those features, today, if you set out to build this kind of database solution.

I mentioned at the start that data centers are optimized to minimize wasted resources.  One would need to keep an eye on several metrics in solving this database challenge using a disaggregated model.  For example, if DRAM on my machine gets loaned out to your machine, did the total performance of the data center as a whole improve, or did your performance improve at my expense?  Did the overall power budget just change in some way that the data center owner might need to think about?  How did this shift in resources impact the owner's ability to manage (to schedule) resources, so as to match profiles of intended level of resources for each class of job?

Then there are lower level questions of what the needed technical enablers would be.  I happen to believe that Derecho could become part of the answer -- this is one of the motivating examples we thought about when we developed the system, and one of my main reasons for writing these thoughts down in this blog.  But we definitely don't have the full story in hand yet, and honestly, it isn't obvious to me how to proceed!  There are a tremendous number of choices, options, and because the vendors don't have any standards for disaggregated computing, a total lack of. existing structure!  It’s going to be the Wild West, for a while... but of course, as a researcher, that isn't necessarily a bad news story...

Thursday, 12 October 2017

Deriving assumptions and constraints

In the past few blog entries, I focused on issues of "specification": to what extent are we able to specify the behavior our systems guarantee, the behavior we expect clients to respect, or other kinds of assumptions we might be making about the setting, peak loads, and so forth?

Today I thought I might ask a closely related question: could formal theorem-proving tools, like the Ivy prover used by my friends at Tel Aviv University, or the NuPRL system that Bob Constable's group here at Cornell created, be used to "extract" a specification, including any assumptions or constraints the solution imposes on applications that use it? 

I should mention that while these two theorem proving systems use high level logic languages, not all such systems are quite so mathematical in approach.  If you prefer Java as your coding language, Microsoft's Dafny would be another system where you could pose my question.  In fact, Dafny really embodies the main idea behind this blog posting, but approaches it the question in a slightly different way.  In Dafny, when you specify a component of a system, you encode any preconditions or invariants as assertions in a special notation that goes into the documentation for the associated module and methods.  Later, when compiling your code, Dafny requires you to prove that the method actually respects the constraints that you expressed this way.  So Dafny doesn't automate specification, per se, but it does enforce that a method must respect its specification, and it can sometimes automatically prove that this property holds.  If so, you won't need to do anything more at all.  Further, if your users later violate an assumption, Dafny would refuse to compile their code, ensuring that your module will be used only as intended.

The problem of automatically figuring out what these specifications should include looks super hard, but is fascinating to me: an ideal topic for research, if you have the math and type-theory background (or are willing to learn: nobody is born knowing that stuff...)

So here's the setting.  Suppose I have a method for computing some function, F(x), such as "it computes x/2".  Now, assume that I decide to run it purely with integer arithmetic.  When will method F() be correct?  Well, for this simple example, it will work for even values of x.  Had I been using Dafny, it might refuse to compile my code without some help, such as "assert even(x)" as a precondition (the definition of even(x) could be something like x%2 == 0, for an integer x).

So Dafny wouldn't necessarily compile code with an error in it, but it often wouldn't know why the code isn't correct.  The question becomes: could a theorem prover have figured this out for me?  With this very simple example, the answer is "sort of."  Take a theorem prover, and tell it that F is a "lambda" that maps from x to x/2.  Now define the argument x to be of type integer, and tell it that the arithmetic operator is of type (integer, integer) => integer.

If you tell NuPRL or Ivy to "verify" this code, it should fail, just as Dafny should fail, giving an example such as "x=1", and meaning "with 1 as its input, F(1) gives an undefined result".  Another possibility, for situations where the prover understands that integer division rounds down, would to try and verify the invariant that F(X)*2 = X.  This should fail, with the example F(1)=0.

So now we know that our function doesn't "work" as intended.  What we really want is for the system to suggest "Have you considered requiring that X must be even?"

So this would be an automated derivation of an assumption, or of a constraint on the inputs.  (For this blog posting, we can assume that assumption and constraint are synonyms, although there are some PL communities that use the terms in slightly different ways.)

How hard is to automate the inference of assumptions?  With current programming languages, it can be very difficult, for several reasons.  A core issue is specificity and naturalness.  Notice first that because false implies anything, there is a very simple universal constraint, but it renders the problem totally useless.  This rules out any kind of simple way of solving the problem.  So we want a non-trivial solution.

Next, consider that specifications can involve multiple kinds of formalisms: multiple areas of mathematics.  For example, over the course of my career I've worked with systems that have real-time guarantees,  protocols in which you typically specify a convergence guarantee,  more classic temporal logic properties, such as consistency for state-machine replication.  Years ago, one of my students (Kryzs Ostrowski) made the point that there aren't any programming languages in which all three of these "models" would easily fit.  He argued for a different form of type checking, but setting that to the side, the bottom line is that if you code in a language where temporal logic is relatively easy to express (like Dafny), you probably won't be able to specify real-time properties, or probabilistic convergence properties.  Each style of property has a different most natural style of formalism, and they don't necessary compose in obvious ways.

The same goes for having a system automatically infer specifications: the odds are that any given formalism will be expressive enough to capture and perhaps infer at least some kinds of specifications in its own formalism, but that it won't even be able to "understand" properties that are outside its normal scope of coverage.

This makes specification of the automatic-specification-derivation question itself tricky.  A really good derivation of assumptions would factor out each distinct element, so that we would have a small list of them, each "speaking" about a distinct aspect of the runtime environment.  But a theorem prover, lacking a human notion of what it means to be distinct and natural, is much more likely to extract some single messy condition for safety, and perhaps a second messy condition for liveness ("total" correctness can be viewed as having two aspects: "nothing bad happens", and "if condition X holds, then something good will eventually happen"). 

Of course, if English was useful for describing this goal, it would have been much easier.  We basically want a normal human reader of the specification to be given a list of preconditions (requirements the caller must respect), each separately intuitive and sensible: "The program will work if (1) The inputs are even numbers.  (2) No input is negative.  (3) No input exceeds 2^16-1..."  So getting from some single condition for correctness, if we had one, into a cleanly refactored form that feels right to human users will be a puzzle.  Then we would want to specify post-conditions, again in an easily understood form, and invariants.

But while these easy cases are probably very feasible, any general approach to extracting a specification purely in temporal logic is guaranteed to be hard.   Taking a favorite classic paper: I would love to see someone explore extracting the specification of uniform agreement (consensus).  In principle, doing so should  "derive" the famous Chandra and Toueg <>W conditions: consensus with N>2 members, of which at most one is faulty, can be achieved if the failure detection input has the property that eventually, it permanently stabilizes in a state where some correct process is never again suspected of being faulty by any correct process.  A really powerful system might even be able to give a result for the more general case of N members and T failures, and derive that N > 2T+1.

If you follow this blog, you've noticed that I'm obsessed with our new Derecho system.  For a system like Derecho, what could a tool of this kind derive?  We think we know the actual specification (we understand our own protocols well), but it would be fun to see a surprising "angle" that we haven't considered.  And you can't rule out such a thing happening: Derecho is a complex engineered piece of software, and even if we think we understand the conditions for progress and safety, perhaps there are aspects of the environment that Derecho depends upon, and that we haven't thought much about.  For example, what are we assuming about the network layer (RDMA, in the case of Derecho)?  What are we assuming about the SSD storage units?  Are there any timing requirements?

Monday, 18 September 2017

What we can learn about specifications from ZooKeeper's asynchronous mode, and its unsafe ForceSync=no option?

At the time of this writing, ZooKeeper is surely the world's most widely used data replication tool.  You run it on a few machines, normally 3, and it offers a file system API with very strong guarantees to the user.  In fact, if configured to do so, ZooKeeper implements the Paxos specification: Leslie Lamport's formalism of the properties required for correct state machine replication.  (My post-doc, Weijia Song, points out that actually, the Zookeeper Atomic Broadcast, ZAB, isn't necessarily a true Paxos protocol, but the issue he raises is very subtle, so I'll set that to the side.  For our purposes here, ZAB is close enough).

In a recent blog posting, we discussed some of the missing aspects of that very specification.  As a result, when I read the ZooKeeper documentation, I was intrigued to realize that the documentation more or less urges that the system be configured to violate Paxos!  In fact the document is short, and easy to read, so have a look if you are skeptical. 

You'll learn about all sorts of parameters that represent ZooKeeper's response to those missing specification elements, such as how to deal with disks that fill up completely, or avoiding inconsistency in the list of servers running the ZooKeeper service.

And then, in the middle of the same document, you run into a fascinating option: there is a small section called "Unsafe configuration options" that explains that "The following options can be useful, but be careful when you use them. The risk of each is explained along with the explanation of what the variable does."  Then we read about an option called ForceSync: "If this option is set to no, ZooKeeper will not require updates to be synced to the media." There is no discussion of risks at all.

Some people know about this but think of it in terms of a broader approach to "using Zookeeper asynchronously".  Used asynchronously, Zookeeper lets you start a series of operations but either ignore their termination, or at least not wait one by one.  Of course flow control always kicks in eventually, to prevent congestion, but you end up with a stream of requests.  In this mode it is nearly universal that you would also set ForceSync=no.

So how safe are such actions?

Elsewhere, on the ZooKeeper blog, Flavio Junquera writes that the system would perfectly well if this option is used, and that it can offer big speedups.  He comments that for safety, there are several options: "You could consider using write barriers, or battery-backed raid SSD".  The write barrier remark relates to a Linux system call, "fsync".  A battery-backed raid SSD is a type of SSD storage with a DRAM cache that can hold pending writes in memory (in DRAM), but with battery backup so that if power fails, the pending writes will definitely complete.  Then behind the DRAM are a set of SSD storage units arranged to handle transfers in parallel, so that the aggregate bandwidth might be enough to keep up with the DRAM transfer rates.

On StackOverflow and elsewhere, you can easily find threads encouraging you to configure ZooKeeper with ForceWrites=no, and assuring the reader that nobody has ever observed any bad consequences.

In effect, there is very little discussion of risks, except in the sense of "yes, you should definitely use this feature, but remember to also do these other things...."

So what's the issue, and why is it interesting?

At the core of any Paxos implementation is the transaction log where Paxos stores its state.  In Derecho, this takes the form of replicated data residing in the replicated C++ objects defined by the developer.  In classic Paxos, it was a list of log entries associated with the "acceptor role".  Most people understand this to have been an append-only disk file, but my colleague and friend Robbert van Renesse, a Paxos expert, questions that assumption.  He thinks that Leslie was deliberately vague about where the logs live, with the intent that it could equally well be used as an in-memory atomic multicast.  Derecho does exactly that: it has one protocol, with two configuration options, and you get to pick.  Durable storage on disk gives you a durable Paxos, and in-memory storage, a form of atomic multicast with total ordering and fault-tolerance.

The same is true in ZooKeeper, in which performance centers on the speed of the ZooKeeper transaction log.  You need to tell it where you want the log to reside.  Some popular options include placing it in RamDisk (in memory), or on a real disk, or perhaps an SSD.  Above you saw recommendations that it be on a battery-backed raid SSD.

The problem is that if you just put the log on a normal disk or even a normal SSD disk, you get Paxos guarantees of durability... but you also see a heck of a big slowdown.  Partly this is because DMA to an SSD is quite slow compared to copying in memory.  But the bigger issue is that each time you do an SSD write, if you actually wait for the write to fully complete ("a forced sync"), you pay a full millisecond just waiting.

Even with concurrency this limits the typical SSD configuration of ZooKeeper to about 1000 write operations per second.  

Early in the ZooKeeper story, the developers ran into this issue, and added a new option: ForceSync=no.  With it, ZooKeeper "on its own" ceases to be a true Paxos log, because it will build a backlog of in-memory updates queued up to be written to disk, and won't actually carry out those updates instantly.  But it gains hugely in performance: 50,000 writes per second become completely feasible.  A 50x speedup... at what cost?

This is where those comments about battery-backed SSDs and write barriers enter the picture.  And this is the puzzle: in fact, you can use ZooKeeper safely in this mode, at no cost and no risk at all.  But it depends on your perception of cost, and of risk.

Lets start by setting ForceWrites=no but ignoring the helpful advice.  ZooKeeper will be buggy.  But, to be bit by this particular bug two things have to happen.  First, you need to have a service that crashes and develops amnesia about a batch of committed transactions (updates) that were pending at the time of the crash.  And second, someone or something needs to notice.

The point about "someone noticing" is the key to why so many applications get away with setting ForceSync=no, and yet pay no attention to Flavio's advice.  Think about the sequence of events for an application using ZooKeeper.  Some application is about not to complete something important, like launching the rocket ship.  So it writes to the ZookKeeper log "... two, one, ignition!" and presses the launch button.

Exactly as this occurs, the power goes out, and on recovery, the system has no record that the button was about to get pushed.  So we have an inconsistency that Paxos normally doesn't permit: Lamport requires that Paxos must never forget a committed transaction, meaning that once the application is told the commit has occurred, Paxos has an obligation to not lose it.

But this is not a likely failure sequence!  The amnesia part, sure, that really is likely.  A bit like with a normal Linux file system: if a program crashes before calling fsync, the last bytes it wrote could easily be lost (maybe even the last few thousand).  We know that, and learn to call fsync.  But someone actually caring, about that specific operation, yet neglecting to manually call fsync?  Seems very unlikely...

So here we have ZooKeeper acting... like Linux file systems normally act!  In fact, you can manually call fsync anytime you like in ZooKeeper, so if you do need it, there it is.  That's the write-barrier approach.

The battery-backed raid SSD option is less common.

So who is wrong: Leslie, for including this rule in the specification?   The good user, who learns to call fsync when necessary?  Or the bad user, for recklessly breaking the properties of Paxos, all for a lousy 50x or 100x speedup?

As a builder, I have to feel sympathy for any developer who wants the speed.  And I honestly question that Paxos specification.  Maybe the requirement really is too strong!  Couldn't it be reexpressed in terms of fsync: "no committed request will ever be lost if fsync was invoked, and completed after the commit?"

In fact the interesting issue here is that when ForceSync=no, ZooKeeper simply imposes an extra obligation on the behavior of the developer (use fsync, or confirm that you have the right kind of specialized SSD).  As we discussed in that prior blog entry, Paxos already imposes obligations on its users, and doesn't express those either.  Why is this different?

Yet I also understand Leslie.  I've asked him about this, and he thinks developers will just get it wrong.  They want the speed, because otherwise they look bad, so they flip this switch, but do something for extra speed in a situation where it really isn't appropriate.

Here in a college town, with students who definitely drive unsafely and fast, I get it.

How many of the developers who push ZooKeeper's insane speed button actually know what they are doing, and think about when to manually call fsync?  Yet on the other hand, how many of their applications would break if the ZooKeeper storage subsystem were to slow down by 50x or 100x?

So you tell me: should systems follow the ZooKeeper lead?

Seriously: what do you think?  Should Derecho support ForceSync=no?

Thursday, 7 September 2017

Inadequacy of the Paxos specification, and what we can learn from the issue

In a blog ten days ago, I discussed the issue of specifications that omit coverage for cases that actually arise in real systems.  Since then two colleagues who follow the blog asked for examples to illustrate the issues, so I thought I might say a few more words on this, focusing on the classic specification of Paxos: Leslie Lamport's solution to the State Machine Replication problem (also sometimes called Consensus, or Uniform Agreement).

The traditional specification of Paxos has the following elements:
  • A specified set of participants.
  • An assignment of roles {leader, acceptor, learner} to the participants.  Each can have more than one role, and we often think of external clients as adding two additional roles: {command-initiator, command-consumer}. 
    • A leader runs a protocol for putting new commands into the Paxos log.
    • An acceptor holds a Paxos log (an ordered list of slots, each of which can be empty, or can hold a command), and information about which commands are known to have committed.  Any given acceptor might have gaps: slots for which it lacks the committed command (on top of this, there is also a somewhat subtle failure case in which a slot will permanently be left empty, so that every acceptor would have a gap at that spot).
    • A learner runs a protocol for computing the full list of committed commands and their ordering.  Any single acceptor might have gaps in its log, so the learner does this by merging logs in order to fill those gaps.
    • Paxos really says nothing at all about command initiators and consumers, except that the initiator waits for a response to the request that the command be posted (in doing so, Paxos eliminates what we sometimes refer to as a causal order obligation, in which a system is expected to track and respect ordering on pending asynchronous actions).
  • A rule for what it means for a command to be valid (non-triviality). 
  • Agreement on ordering.
  • Durability: in any state where a Paxos service responds to "learn" requests, it needs to return the entire list of previously ordered commands. 
Many authors just focus on the three bolded properties.  Yet notice that from this set of five elements, we can easily discern a whole series of additional, unspecified aspects:
  • There is a Paxos reconfiguration protocol, but it doesn't introduce additional specification elements, except to change the set of participants.  Yet there are several aspects one would normally wish to see addressed:
    • Malkhi has convincingly argued that one should terminate a membership epoch before starting actions in the next epoch.  This avoids having a command that was initiated in epoch k commit much later, perhaps in epoch k+1.  While Lamport has often said that this isn't necessarily a bad thing, Malkhi's point is that a delayed commit can be confusing, because many systems operate in a configuration-sensitive way.
    • A new member of a Paxos acceptor group should probably not have an initially empty log.  A means of performing state transfer to it is thus required.  Simply copying some existing log to the joiner is incorrect, because by doing so, a command that previously lacked a quorum and hence was uncommitted (Paxos has a scenario in which it leaves a slot "empty") can become committed because duplicating a log effectively duplicates a vote for that command.
    • When Paxos restarts after all its members crash, the protocol doesn't specify the rule for figuring out the proper initial configuration for the restarted service (this matters with dynamic membership).
  • The specification says nothing about flow-control, yet if a Paxos protocol has no flow control mechanism at all, a quorum of acceptors could advance unboundedly far into the future relative to a stalled or failed acceptor.  This might mean that the faulty acceptor has no feasible way to catch up later, and in effect, would decrease the resilience of the service: rather than having N members, we would have to think of it as having just N-1.
  • The specification says nothing about sizes of objects (size of commands), yet acceptors will presumably have bounded amounts of memory, and might not be able to accept arbitrarily large objects.  Solving this isn't necessarily hard (one could have commands that point to external objects), but then the objects would be less replicated than the commands, and one has to ask whether that somehow violates the durability property.
  • The specification says nothing about fairness or other quality of service properties, such as timeliness of response.  Yet real systems need fairness when many clients share a Paxos service and all the clients want a fair-share of access.  In fact, one can then ask what specific notion of fairness is desired: should it be round-robin (like in Derecho)?  Or should some clients be allowed to send 2x more commands than others, because they happen to be busier?  Should a slower client be "delayed" by activity of a faster client?
  • I mentioned that Paxos seemingly excludes situations where clients might issue a series of requests, without waiting for replies.  In practice, this is common (clients often "stream" requests).  We would want the client ordering to be respected by the system: if a client sends request A, then B, Paxos should preserve the A happened before B relationship.
  • At Google, the Paxos owner (in fact the leader of the team responsible for the Chubby service) pointed out to me that his big issue is concerned with wide area deployments of Paxos, which introduces a whole set of issues Lamport never considered:
    • Proper specification of a hierarchical Paxos.  Guerraoui and Pedone and Quema all have looked at this question, primarily in connection with Ring-Paxos protocols.
    • Heavy tailed behaviors.  Chubby struggles to deal with very delayed data that can be caused by overloaded or damaged WAN links.
    • One way to get around "late" data is to design systems that assume they have correct and current data, for example using locks or leases.  Chubby does this, but it turns out that when one does so, getting WAN service instances to release their read locks so that the data can be updated, or preventing them from renewing those leases, can be very slow.  This might violate a specification that requires fast normal read behavior, but also requires a fast way to be able to update "rarely changed" configuration parameters or program versions or other forms of WAN data that does change now and then.
Beyond these points, one encounters a further concern.  The Paxos specification is ideally suited to verifying the correctness of Paxos, with respect to its own "promises".  But the specification doesn't tell us anything at all about correct use of Paxos, or required behavior of the application using the service. 


For example, suppose that Paxos isn't the real repository for data but is playing an intermediary role: the program using the Paxos service might itself be a replicated database, or some other form of replicated service that wants updates delivered in a deterministic order.
  • Do we expect that such a service would always be able to use the identical order to the Paxos log?  What does this imply about the specification the client service must respect?
  • On restart from crashes, we now have two forms of state: state in the durable Paxos logs, and state in the database service replicas.  How should these be reconciled?
  • We won't want the Paxos state to grow without bounds, so we will need to truncate the Paxos logs.  There is a truncate protocol in Paxos, but what obligations fall upon the client service that wishes to make use of that truncate command?  How does truncation interplay with failure cases that might arise?
Believe or not, I could actually go on and list even more issues!  My point, though, is that when we use Lamport's core specification, we are really working from an inadequate specification that omits major, very important aspects of the real service we intend to build and use. 


But notice how hard it would be to check a Paxos specification for adequacy.  We would need a fairly elaborate (and adequate) specification of the environment, and of our larger goals.  Otherwise, questions such as flow-control or other aspects of bounding resource consumption, or of client state, could never even be posed.  So there is sense of chicken and egg: to understand if a Paxos specification is adequate, we really need an adequate specification of the setting where Paxos will be used, so that we can study the questions it poses about the Paxos service per-se.


I'll stop on that point, but there is more one can say about adequacy of specifications.  A good topic for some other posting down the road...