Tuesday, 27 October 2020

#EOF

If you've followed my blog, I want to thank you for your time and, I hope, interest.  However, I've decided to put the blog to the side for a while -- perhaps indefinitely.  In this blog, probably my last one, I thought I might simply share a few closing thoughts.  

I've enjoyed writing these essays, and for several years really looked forward to finding time to work on whatever would be next -- I maintained lists of topics that seemed ideal for my style of analysis, and spent weeks researching the technologies behind many of the essays before releasing them.  There is a real pleasure in engaging in that sort of purposeful research, and in communicating insights on topics people seem to care about.  It made for many enjoyable afternoons, starting during my 2016 sabbatical, when I wrote my first essays.  But I've reached a point where the technical path forward would dive into really obscure questions.  Were I not writing this final essay, I might have blogged on memory fences in RDMA.  I know quite a lot about the question!  But as fascinating as that topic might be, it just isn't one that deserves wide attention.  In 2018 or 2019, if I didn't have a good idea for the blog, I just had to stand near by my door in Gates Hall, and in no time, I would learn something fascinating that it made good sense to share.  But in 2020, that hallway is rather quiet.

Were my focus less technical, I think there would be many more topics on which one could blog.  Diversity in systems, for example, is finally improving... a welcome change. The business of cloud computing is evolving, and the new synthesis of systems and AI is exposing a wealth of open questions.  But I’ve written about some of those topics, and some of the others await a new generation of young researchers... their stories will be fascinating, but of course until the work has been done it would be pure speculation to try and anticipate it.  Unfortunately for this blog, speculation just isn’t my thing.

And so I've come to the end of my brief career as a blogger.  On the other hand, teaching and research have been keeping me busier than ever.  On which note:  My systems programming class starts in 10 minutes, and I need to launch Zoom!  So again, thank you for your attention, stay safe, and farewell!

Saturday, 19 September 2020

Who will host the future edge cloud?

As you know if you've followed my blog, I'm particularly excited about the potential of the IoT edge.  The term is a bit ambiguous, so I'll explain what I mean by it first, but then I want to speculate a little about who will end up "hosting" this important emerging real-estate.  The host of any large-scale cloud has the potential to earn quite a bit of revenue (witness AWS and Azure), so the question has big monetary implications.

With the explosive growth of machine learning in the standard cloud, it can be easy to overlook a key point: in fact, very little of the cloud is capable of reacting in real-time to evolving data.  The core issue is that ML emerged from the back-end big data analytic frameworks: massive storage infrastructures coupled to scalable computing infrastructures (think of Spark side-by-side with HDFS or Ceph).  Such a structure is designed for sharding, so we can first upload our data in a spread-out way (either using a randomized placement, or with some form of control over that step), then initiate computations which will run in a decentralized manner.  The resulting performance and scalability can be astonishing.

But if we work in this manner, we end up with batched jobs that recompute the models or perform other tasks after some delay, perhaps minutes and in many settings, days or even weeks.  Thus today when the cloud edge performs a task using a machine-learned model, that model will probably be somewhat stale.

By cloud edge, I have in mind the first tiers of a typical data center.  In systems like AWS, Azure, Facebook, Google (and the list, of course, is endless...), a data center is typically split between that back-end infrastructure and a front-end system.  This front-end portion is deployed as a series of layers: the first tier is a highly elastic pool of servers that handle the majority of incoming requests, such as to build web pages or process video uploads from webcams.  The idea is that a first tier server should be fast, lightweight and specialized, and we work with a pay for what you use model in which these servers are launched on demand, run while they are actively doing something, then terminate.  Thus any given node in the first-tier pool can be repurposed quite often.  

The increasingly popular "function" model is an example of a first-tier capability. In Google's cloud or AWS, the first tier can launch user-provided code: computing on demand, as needed.  (Azure supports a function service too, but one that is strangely slow: Whereas a Google or AWS function runs within as little as 50-100ms, Azure functions often need many seconds to launch... and hence are probably not something that will be useful in the future edge cloud I have in mind, unless Microsoft reimplements that tier to offer competitive speed).

The nodes supporting the function tier generally lack very high-end GPUs, simply because a GPU  needs a lot of care and feeding: we need to pre-provision it with the kernels (code libraries) we plan to use, perhaps would need to preload hyperparameters and models, etc.  My point about Azure being slow applies: If one wanted to leverage these kinds of resources, launching a function might take seconds... far too much for a first-tier that nominally will be elastic at a scale of fractions of a second, running totally different sets of functions on those same nodes.  This all adds up to a simple rule of thumb: A service running in this tier isn't really the obvious candidate for heavyweight ML tasks.

We solve this problem by pairing our first tier with a tier (call it the second tier) of services that provide supporting functionality.  The second tier servers are managed by something called the "App Service", and again are quite dynamic: a collection (or more often, a graph) of service pools, each pool providing some functionality specialized to a particular task, and interconnected because these pools might need help from one-another too.  Amazon once remarked that a typical web page actually reflects contributions from as many as 100 distinct servers: we have a web page builder in the first tier, and then those 100 servers are actually members of 100 service pools, each pool containing an elastically varied set of servers.  Thus, for Amazon's own uses, the second tier would be crowded with 100's of service pools, to say nothing of second tier uses associated with AWS customers.

What might a second-tier service be doing?  Well, think about the "also recommended" product list on your Amazon page.  Some ML model was used to realize that if you are looking at the Da Vinci Code (which you bought years ago), you might also enjoy reading  The Lost Symbol (a prior book of Brown's), or The Templar Codex (another "chase through Europe seeking a medieval artifact"), or perhaps An Instance of the Fingerpost (an obscure but excellent historical novel...).  Amazon has a whole service to compute recommendations, and it in turn is using other services to pull in sub-recommendations in various genre's, all specialized for "people like you" based on your history of browsing and purchases.

So just this one example might give us several services, each one operating as a pool because the load on it would vary so greatly as Amazon shoppers poke around.  We would run such a pool with a small number of service instances when load drops, but the App Service can spawn more on demand.  

Unlike a first tier, second tier service pools deal mostly with internal traffic (the first tier is triggered by requests coming over the Internet, from browsers and other web-enabled apps), and whereas the first tier spins on a dime, spawning or terminating instances in fractions of a second, the second tier might be much less quick to adapt: think in terms of processes that do run on multi-purpose nodes, but once they launch tend to own the hardware for minutes or hours.  So it makes sense to put GPUs and FPGAs on these second tier instances.

The second tier is closely paired to a third tier of vendor-supplied services, which often can be customized in various ways.  So in this third tier we have storage solutions (file systems, DHTs of various kinds), databases, AI engines, etc.  Some modern developers work purely with the pre-supplied options.  On the other hand, if you are tackling an unusual situation, like automating a dairy farm, the odds are that you won't find the mix of services you really need.  You'll pretty much be forced to either work with slow back-end systems (those run in batched style, once every few hours or days), or you might need to build a new second-tier service of your own.

With this context, let's revert to my real topic.  

Cloud operators like AWS and Azure make a great deal of their revenue in these second and third tiers of the cloud.  Yet in adopting this model, a hidden dynamic has emerged that favors the big warehouse style of clouds: the vast majority of today's web is centered around web pages and "high functionality" mobile devices.  Any rapidly-changing data lives and is used right on your mobile device; for example, this is where Google Maps runs.  Thus the data center isn't under enormous time pressure, even in the first and second tiers.  We tend to focus more on pipelines: we want a steady stream of web pages constructed within 100ms per page.  That may seem fast, but consider the speed at which a self-driving car needs to react -- on a high-speed road, 100ms is an eternity.   Perhaps this observation explains why Azure has been competitive despite its strangely slow function tier: if people currently write functions that often run for minutes, a few seconds to launch them wouldn't be noticeable.

But we are seeing more and more uses that need instantaneous responses.  Today's cloud edge is really just the outside of the cloud, and not responsive enough.  As we deploy IoT devices into the environment, we are starting to see a need for something else: an edge cloud (did you notice that the word order flipped?)  By this I mean a way to situate my computing closer to the devices.  The cloud edge might be too far away, and the latency of talking to it could cripple my IoT solution.

The need, then, is for a new form of data center: an edge cloud data center.  What might it look like, and where would it run?

My view is that it would look like the Google or AWS lambda tier (and I bet that by the time this plays out, the Azure function tier will have been fixed: the Azure cloud clearly is equal to any rival in every arena that has competitive importance right now).  Basically, the edge cloud will host first, second and third-tier software, probably using the exact same APIs we see today.

It would run on something like Microsoft's Azure IoT Edge: a small compute cluster (maybe even just a little box running a NUMA CPU chip), situated close to the sensors, with a WAN network link reaching back into the cloud for any tasks that can't be performed on this platform.

So who will run these boxes?  If we think of them in the aggregate, we are suddenly talking about a true edge cloud that could host third-party apps, charge for cycles and other resources consumed, and hence generate revenue for the operator.  These apps could then support smart homes, smart grid, smart roadways to guide smart cars... whatever makes sense, and represents a potential market.  Yet there is a chicken-and-egg puzzle here: the scale at which one can really call something a "cloud" is certainly national and perhaps global.  So any operator able to play in this market would need to deploy and manage an infrastructure at planetary scale: otherwise, we end up with just a handful of boxes of type A in the US, others of type B in China, each owned and operated by a distinct player, and each with its own customized standards.  I find it hard to see how that kind of heterogeneity could give us a true edge cloud.

I've been involved with 5G and for a while, wondered if telephony operators might be the answer.  They already need to deploy 5G point of presence systems, which often look quite similar to what I've described.  Indeed, Azure's IoT Edge seems to be an early winner in this market.  But the resulting deployments are still operated by literally thousands of players, and perhaps I'm underestimating by a factor of 10.  The number of likely 5G operators is simply too large for more than a few to tackle this opportunity (I can see AT&T doing it, for example, or Comcast -- but not some random little 5G operator that sells services through phone SIM cards).

Another possibility would be that one of the big cloud vendors goes all in on the edge cloud and deploys these boxes through various deals: perhaps the 5G players agree to rent the needed physical space, power and other resources, but the box itself belongs to Alphabet, Amazon, Microsoft or AT&T.  This would be a story similar to the land-grab that occurred when Akamai emerged as dominant in the content-serving market back in 1998.  On the other hand, it is notable that as Akamai grew, their deployment consolidated: around 2000, they were out to place an Akamai compute rack in every single ISP switching office in the world.  By 2010, Akamai was operating a smaller number of data centers, mostly ones they themselves owned, and yet obtaining better performance.  In fact, although I can't find any public studies of the question, I wouldn't be surprised if Akamai's global datacenter deployment today, in 2020, actually resembles the AWS, Azure and Google ones.  The question is of interest because first, it may suggest that Akamai has a bit of an incumbency advantage.  And second, any structure performant enough for content serving would probably work for IoT.

But perhaps we will soon know the answer.  The industry clearly is waking up to the potential of the IoT Edge, and the edge cloud has such obvious revenue appeal that only a very sleepy company could possibly overlook it.  If you look at the Google ngram graphs for terms like edge cloud, the trend is obvious: this is an idea whose time has arrived.

Friday, 21 August 2020

Knowledge logics (and love)

 Alice and Bob, two star-crossed lovers, had planned to meet for lunch in downtown Palo Alto if the weather was good, this to be determined by Alice, who would email Bob, who then would confirm (being a late-riser, there was some risk of him not actually seeing an email in time).  But they never actually did meet, and indeed, Alice never once sent an email -- despite the fact that Northern California had a near record of three months without a single cloudy day!  When those emails didn't arrive, Bob realized that he couldn't imagine life without her.

By now you've figured out the puzzle, if you tried.  The core problem is uncertainty: When Alice emails Bob  "I just checked, nice weather again (no surprise...).  How about Vegan Burger at 12:15?", she has no certainty Bob will see the email in time to meet her.  Bob, who woke up early that morning, replies instantly: "Can't wait!".  But did Alice see his reply?  Alice replies to Bob: "It will be so nice to see you again...".  And so it goes... with neither ever reaching a state of certainty.   So, realizing this, Alice didn't waste the bits on a pointless, unending exchange.  Pankaj Mehra wins the prize for being first to post the correct answer!  

In fact the deeper puzzle is precisely the opposite of the one I posed.  The question that we really need to ask is "Why, then, do lovers ever manage to meet for lunch?  And how can distributed systems reach agreement, on anything?"

Yoram Moses and Joe Halpern formalized this scenario using what came to be known as a "knowledge logic."  You start with a normal logic, but have a set of participants (Alice, Bob, ...) and a way for them to exchange messages, which are generally treated as reliable, but with unbounded delays.  If predicate P is true for Alice, we say that Alice "knows" P: K(P)Alice.  Now we can write K(K(P)Alice)Bob to denote that Bob knows that Alice knows P.  For example, Bob learns that Alice knows that the weather is fine. The problem is that at the moment Bob learns this (transitions from not knowing to knowing), Alice does not know that Bob knows that Alice knows that the weather is fine.  When Bob emails his confirmation, the same issue arises but with their roles reversed: now Alice knows that Bob knows that the weather is nice, but Bob doesn't know that Alice saw the confirmation, hence for him, the situation is actually unchanged.  This knowledge asymmetry is the root of their dilemma.

Last week I suggested that anyone puzzled by my puzzle watch The Princess Bride.  Early in the movie there is a fantastic scene structured around this exact question.  The criminal Vizzini -- who happens to be the smartest man in the world (Wallace Shawn) --engages in a duel to the death with Westley, who is the current incarnation of the Dread Pirate Robert (Cary Elwes).   The duel centers on deciding which of two cups of wine has been poisoned.  Vizzini debates himself until, after some number of back-and-forth ripostes, he suddenly concludes "but Westley certainly isn’t smart enough to have realized this!"  He seizes one of the cups and drinks it, while Westley drinks the other.  Shawn dies... but was his logic wrong?

The logic of knowledge is quite elaborate, and can express all sorts of statements about distributed systems and multi-party reasons.  It also comes with some theorems, including the following.  To get rid of all the names, let's say that Bob has "K1" knowledge of P if Bob knows that someone knows (or some set of people) P.  So if Bob knows that Alice knows the weather, he has K1 knowledge (Alice has K0 knowledge).  Once Alice reads Bob's first acknowledgement, she has K2 knowledge, and so forth.

The bizarre aspect of our lover's dilemma is that both Alice and Bob actually do know that the weather will be fine -- and both of them quickly learn that both of them know this to be true.  Both, that is, transition through a K1 knowledge state. Yet the particular (peculiar, one might say) setup is such that the weather being fine is not enough.  To actually meet, both must be sure that both will be there.  This is the aspect over which the plan stumbles.

K* knowledge, also called "Common Knowledge", is defined to be a level of knowledge that implies Kn for all values of n.  Yoram and Joe proved in their papers that common knowledge cannot be achieved in a symmetric system where all the participants have equal trust in one-another and communicate via messages.  Alice and Bob, in effect, agreed to require common knowledge, not realizing that with uncertain communication delays, this wouldn't be possible.

A curious feature of common knowledge is that if we simply introduce a special participant who is trusted to disseminate common knowledge, a system can simply accept the statements of this omnipotent participant and achieve things that, lacking such a participant, would be infeasible.

Here's a famous example that I often use when teaching the idea.  Alice and Bob, now a couple with children, have organized a birthday party for their daughter Carol.  The kids come in from the yard, where they have been playing and on their way in, they wash up... but poorly.  None washes their face, and as a result all have mud on their faces, but nowhere else.  So Bob announces "in our house, you need to be clean to sit at the table.  So if you want cake, you need a clean face!".  No child loves to wash twice (and children are like cats where water on their faces is concerned!).  Not one budges.  Bob, increasingly frustrated, repeats this a few times.  Dessert is in peril!

Meanwhile, in a different reality, Bob said something slightly differently: when the kids came in, his first remark was "Wow, I see a dirty face!"  All of the children hear him clearly, and of course, young children place absolute truth in adults.  In this situation, with N children, all march to the sink the N'th time Bob repeats the rule.

The oddity in all of these stories is that everyone seemingly knows the answer, yet they behave very differently depending on how common knowledge is handled.  Alice knows the weather is fine. When she gets Bob's reply, he knows this too, and she is sure he knows  -- and if she takes the next step and replies to his reply, he knows that she knows this. Clearly, this should be adequate for a nice veggie burger special!  

Or consider the children: every child sees mud on every face (except of course, their own), so on Bob's very first announcement, why didn't all of them head for the sink?  But in fact, in version two, they do head for the sink.  

In reality two, Bob's introduction of common knowledge seems to be of vital importance.  But common knowledge is indefinitely elusive!  You can construct N-child proofs that reduce the N-child case to the N-1 child case, and the crux of the proof turns out to be similar to the dilemma of Bob and Alice: with 2 children, perhaps Carol and Doug, you can play this out.  In reality one, Carol is hoping against hope that her face is clean (and similarly, Doug).  She sees mud on Doug's face, he sees mud on hers, so neither budges.  But Carol is hoping Doug sees a clean face, and Doug is hoping she is looking at his clean face.  Lacking certainty that at least one child has a dirty face, neither moves.  When Bob introduces that common knowledge, it changes the game, and both can deduce that their own face, tragically, must be muddy.  How can we avoid making the mistake of requiring Common Knowledge in a problem statement?

We often build systems that need to cooperate, elect leaders and so forth.  We certainly use a logic of knowledge when designing them; the resulting built-in facts are, in effect, common knowledge.  But the dynamic form of knowledge associated with new events is the more interesting case.  Why is it that we can solve problems like implementing state machine replication using virtually synchronous Paxos (the consistency model and protocol employed in Derecho)?  

Do we think of Derecho as a trusted source of common knowledge, or is it simply that these kinds of problems don't actually need common knowledge?

For me, the really fascinating insight is that when a system manages its own membership, creating epochs of stable membership, state machine replication seems to be a complete story for everything we ever need to do in distributed systems: if we need consistency, this model offers the answers; if we don't need consistency -- well, then use TCP and you should be happy!  Moreover, as far as I can tell, these state machine replication protocols use the membership as a form of common knowledge.  This suggests that a system like Derecho is actually a two-level knowledge structure, with one structure subordinate to and trusting the other, and that higher level structure permitted to hand common knowledge statements to the lower one!

This, however, is not something I have ever seen treated in a paper.  Perhaps someone has looked at it, but if so, it didn't cross my desk!

The other side of this story, equally fascinating, is that K1 knowledge is sufficient both for self-managed membership and for tasks such as ordered multicast delivery with safety in the event of a crash: In the Derecho paper I linked to earlier, we discuss this and present our reduction of the protocols to a knowledge logic and out implementation over 1-sided RDMA.  So, in this two-level formulation, we seem to have a K1 protocol at each level, but with one level viewed as a source of common knowledge by the other level!  

Back in the real world, there is a third aspect to the same insight.  If you reflect on it, Bob and Alice's dilemma "made sense" and does seem to capture the way we describe situations and problems.  Yet the K1 style of solution used in real systems also seems relevant: in real life, Bob and Alice would never have gotten into such a jam -- Alice would have emailed that the weather is fine, Bob would have acknowledged.  His iphone would show a little check mark: sent.  Then a second one: read by Alice (or more accurately, it popped up on her screen and she swiped it away).   Thus in the real world, Alice and Bob make a kind of leap of faith and begin to treat the weather as common knowledge.

This idea of a leap of faith is really about probabilistic reasoning: we simply decide, at some point, that the likelihood of a problem is so low that it can be dismissed.  Logicians (including both Joe and Yoram) have looked at this issue, and shown that it has a great deal of relevance to AI.  But I'm not actually convinced that Derecho involves such a leap of faith.

If anyone knows of work on this question, I would love to hear about it!  And in the meanwhile, stay safe and enjoy the rest of your summer, happy in the knowledge that it all worked out for Alice and Bob.

Wednesday, 12 August 2020

A little puzzle to amuse you on a summer evening

Alice and Bob, both logicians, met at the company coffee bar just before Covid struck.  Both wanted to see each other again, but safety is paramount, so they agreed to meet for lunch outdoors, at a safe social distance, in a local park.  

Being detail oriented both instantly realized that they would need a way to cancel just in case of a rainy day (unlikely in Silicon Valley, but not impossible).  This led to the following plan: Alice, who was in the habit of checking the weather each morning, would email Bob if conditions looked good.  Bob, a late riser, would confirm her email (once he was awake and read it), and on this basis they could be sure to actually go forward with the plan only if the skies would be blue, the birds happily chirping, and only if both were definitely aware that the rendezvous was on.

Months later, they happened upon one another in Trader Joe’s, and it was love.  Yet they had never once managed to meet on a lunch date, although it had been sunny without exception for the entire time since their first meeting!  Your task is to explain why this happened.

This being a puzzle, it wouldn’t be right for me to go ahead and solve it for you, but I should provide a  little more information. First, it may help to know that although both followed the protocol to the letter, neither was the kind of person to engage in useless activities.  And thus Alice had never once sent Bob an email confirming the promising weather.  Bob, who fully appreciated Alice’s dilemma, didn’t blame her: after all, he could easily confirm her reasoning.  Indeed, they both quickly concluded that it is impossible for committed logicians to meet for a lunch date, and this shared insight only increased their affection!

As Bob explained it later, when Alice failed to send that first email, he realized that she was the one.

Explain their conclusion.  Is there a protocol that would have worked?  If you need a hint, I recommend watching the Princess Bride: a wonderful family-friendly holiday movie that just happens to feature a related puzzle!

I’ll have a bit more to say about the answer and what it teaches us about distributed systems in a week.  (As for Alice and Bob, no worries.  They’ve set a date for the wedding, and plan for it to go forward rain or shine!)

Thursday, 23 July 2020

Game of Thrones

Beyond the pandemic this has been a period of national dialog about bias.  In that context I thought it might be appropriate to reflect on assessment and bias in computer science.  The topic is a large one, but blog postings need to be reasonably short, so for today, I want to highlight a narrower question.

Assessment is fundamental to a research career: work is constantly assessed in various ways.  We participate in the process, and are ourselves assessed, and careers thrive or fall on it.  As a PhD student I helped review some papers for a few conferences, but as a new faculty member I was surprised to receive a blizzard of requests: paper reviews, proposal reviews, tenure and other promotion reviews, award reviews, reviews of entire departments and research centers.  Over my career, I've led reviews of portions of the National Science Foundation and other equally large agencies.

The very idea of ranking quality of work, or individuals, or departments poses fundamental questions.  One can assess the expected mileage for a car without too much debate.  But the subjective nature of assessment in computer research is the puzzle: suppose that you have an idea that promises big payoffs but departs from standard ways of building cloud computing platforms.  I'm asked to assess you for a promotion, or your group for research funding, or your department as a whole.  Should I applaud your audacity and risk?  Or should I criticize you for going way out on a limb?  Will the cloud computing vendors adopt your methods, and if they do, will you get explicit recognition?  

How impactful is your work?  Universities rarely pursue patents, because a series of court cases led to a recognition that software patents only protect specific instances of programs: the mathematical content cannot be patented.  As a result, when we invent ideas in academic settings, and then publish papers on them and release them as open source software, we often abandon the possibility of seeking a patent.  Companies are free to read those papers and may be influenced (they might even use the software), yet they are under no obligation to cite that prior work in any public way.  In fact, because companies do protect their products with patents, and patent law treats prior-work citations according to a different model than we use in research, ideas can be highly impactful and yet the inventor may never get proper recognition. 

Rather than focusing on promotion and tenure assessments, or even program committees for conferences, let's drill down on the question in a less emotionally charged context: rankings of entire departments or universities.  

One can actually trace the concept of ranking departments to colleagues of mine here at Cornell: early in the development of computing as a field, people like Juris Harmannis and David Gries were heavily involved in the creation of organizations like the Association for Computing Machinery and the Computing Research Association (CRA).  The CRA was explicitly founded to aggregate data on topics like salaries and workloads, and over time, this expanded to include an early role in assessing quality and ranking departments.

This was "early days" for the field, and the real goal of these rankings was to promote the idea that national research funding should be spread over a broader and more diverse set of institutions.  The need for self-advocacy drove the process as a whole: unlike established disciplines such as physics or mechanical and aerospace, computer science had few voices in Washington back then, and was not winning much of the national research investment funding stream.  One can trace this to the evolution of computer science from mathematics on the one hand, and electrical engineering on the other.  Thus, we saw support for theoretical computer science that emerged as an outgrowth of the mathematics arms of organizations like NSF, AFOSR and ONR, all of which had long track records of supporting the mathematical and theoretical sciences, side by side with support for computer engineering from corporate partnerships and the military.  My own work is in computer systems: an engineering discipline.  For my area, the defining event occurred back in 1978 when DARPA emerged as a funding leader, soon followed by AFRL and ARO.  In distinction from the theory side of the field, all of these focused on concrete "deliverables:" useful artifacts, like the original BSD Unix extensions, or the Internet.

Even these early steps were consequential.  When I joined Cornell, the department was widely seen as a leader in theory but weak in systems: precisely the message conveyed by those national assessments I just listed.  As a young researcher, I applied for NSF funding from a program that nominally supported work of all kinds.  Foolish me: the NSF program officer called for a quiet chat to explain that although my proposal was ranked high, the NSF program only wanted to support research with a predominantly theoretical focus.  We negotiated a strange compromise: NSF would fund the theoretical aspects of my work, and didn't wish to be cited on papers that were entirely focused on software systems!  Obviously, this was a quirky situation centered in many ways on that one program officer, but I find it quite telling that as a new researcher, I was already being warned that my work would be judged in one of two very distinct ways.  Much later, I was asked to chair an external assessment of NSF itself, and was somehow not surprised to discover that this same bias towards mathematics was still very evident, although I didn't hear any stories of phone calls like the one I had received!

Meanwhile, though, Jim Gray did me a favor.  Shortly after that call, I was contacted by someone at DARPA who was creating a new program in software fault-tolerance, and had spoken to Jim.  They were excited about the idea of software tools for fault-tolerance and requested a short proposal focused on software deliverables.  The program director emphasized that DARPA didn't want to see incremental work: he was ready to spend a lot but wanted to "buy" a disruptive new idea, the kind of advance that could create entire new industries.   And he explicitly said that Jim had suggested a few people, but he was limiting himself to the very top departments.

When I took this job, it impressed me that Cornell was ranked among the top five departments (at that time my colleagues griped endlessly that we should be even higher ranked!), but I didn't really appreciate the degree to which this would be beneficial.  Systems research is expensive, and without that funding, it would have been very hard to succeed.  In retrospect, it is absolutely certain that Cornell's broader reputation was a key reason that people invested in my vision.

Assessments and rankings play a central role at every step of a research career.  The particular NSF program officer who I spoke to make it clear that he was only prepared to support my work because of Cornell's high repute in theory: he viewed it as a bet that "for a change" systems research might have a robust theoretical basis if I engaged in a meaningful way with my colleagues.  As it turned out, this worked well because it created an obligation: In effect, I agreed to set aside a part of my effort to really pursue the theory underpinning state machine replication and fault-tolerant consistency, and this in turn shaped the software systems I built.  

In effect, NSF made a bet on my group, and DARPA did too, based very much on reputation -- but of course at the start, I myself had no real reputation at all.  This was a bet on Jim Gray's reputation, and on Cornell's reputation.  And they did so with very distinct goals: for DARPA, this was about creating a substantial software group and impacting a real and important user community.  Meanwhile, NSF was keen to see papers in top conferences and journals, and wanted to see that those papers were being cited and triggering follow-on work by a broad crowd: the NSF officer wanted the distributed systems theory community to think more about relevant impact and to see examples of theoretically rigorous work that  had real-work value.  Thus both were betting on a form of disruption... but in different ways, and both were basing that bet heavily on Cornell's ranking: they both believed that Cornell offered a high enough podium to amplify the potential disruptive impact.  That it happened to be me who benefitted from this was somewhat incidental to both organizations, although it obviously worked in my favor!

As this played out, DARPA kept pushing to move faster.  They urged me to spin off companies (I ultimately did so a few times... one was quite successful, the others, less so).  There was even some discussion of them investing venture capital, although in the end holding stock in a startup was deemed to be outside of DARPA's authorized scope.   But even that one successful company opened doors: my group was invited to design the core consistency and resiliency mechanisms of the French air traffic control system, the New York Stock Exchange, the US Navy AEGIS, and many other real systems.  How many PhD students can say that as their PhD topic, they redesigned the floor trading system of the NYSE?  One of mine did!  And those successes in turn brought contacts with development teams at many companies facing a next set of challenges, which became follow-on research projects.  Along the way, I also was promoted to tenure, which wasn't such an easy thing for a systems researcher at Cornell back then (less of an issue today). So you can look at this and recognize that in a direct way, assessments of the department, and expectations that I really imposed on myself when defining "milestones" shaped and guided my career.

Thus, that original CRA ranking actually had a huge impact that really shaped my career.  The CRA ranking was just one of many.  US News and World Report quickly emerged through its special issues that rank colleges and universities, including research specializations such as computing.   Back in 2010 there was a National Research Counsel ranking (produced by a group within the National Academy of Sciences).  The NRC set out to put ranking on a firmer statistical footing, but then seems to have left those rankings untouched since releasing them a decade ago.   Of late, one sees a new commercial ranking site more and more often: the "QS ranking of top research universities".  

Should we trust a ranking such as the one by US News and World Report, or the QS one?  As it happens, US News and World Report is relatively transparent about its process, or at least was transparent about it when I inquired back in the late 1990's.  At that time, their ranking of computer science PhD research programs had emerged as the winner; people need to cite some sort of consensus ranking in various situations, and this was the ranking most of us would point to.  Relatively few, however, realized that the entire US News and World Report ranking centers on a single number provided by each of two people in each institution they track.  (QS seems to be similar, but doesn't disclose details of the way they arrive at their ranking, so we won't say more about that one).

The way it works is this: every three years (the next cycle should start soon, with the results published in spring 2021), the company sends a questionnaire that collected various data about research expenditures, program sizes, and a few other metrics, but also includes a little fill-in-the-bubble sheet, listing 500 programs that grant PhD or MS/MEng degrees.  500 is a large number, particularly because all of these programs are located in the United States (perhaps also Canada).  As a result, the survey also reaches out to a great many schools that don't do research and do not grant PhD degrees. The relevance of this will be clear in a moment.

All of this data is "used" but not all of it contributes to the ranking.  US News and World Report does summarize statistics, but the actual ranking depends only on this second bubble-sheet and the 1000 or so people who are asked to complete it.  The bubble-form is sent to two people at each graduate-degree granting school, one copy to the department chair or dean, and one to the director of the graduate program.  These two people rate each program by its research strength: from exceptional to poor, five options.  Presumably, many don't respond, but the publication takes the forms they do receive, computes average scores for each department, and voila!  An instant ranking.

There is an aspect of this approach that I have to admire for its frankness.  As you might expect, it is terribly hard to devise a formula that would properly weight research funding, papers, impact, program size, coverage of hot areas, etc.  CRA believed this could be done, even so, and the early CRA ranking centered on a somewhat secretive formula that was created by a few national research leaders, who in effect locked in their own intuition about the proper ranking.  Later, this mysterious formula came under scrutiny, and when the NRC decided to create a new and "better" ranking, they concluded that no single formula could properly address the diversity of students and student goals.  Accordingly, they collected all sorts of metrics, but their web site asked each visitor to provide their own weights for each element.  This makes a lot of sense: there is no obvious way to come up with standardized weights that would somehow reflect all of our different value models and objects.  Yet it leads to a very unstable ranking: the CRA ranking was quite stable for decades (a topic that led to criticism by schools not in the top few!), but when using the NRC version, even tiny changes in weights led to very different rankings. Moreover, it is quite easy to end up with a ranking that produces nonsensical results!  For example, it seems natural that funding levels should be a factor in rankings, yet this ignores the huge research centers that some schools have attracted in specialty areas -- centers that they are permitted to include into their funding summaries.  Thus any weight at all on funding will hugely promote the standings of those schools.  Yet this overlooks the actual substantive quality and nature of those centers, some of which do classified work or are very narrow in other senses.  If a center doesn't contributed to the academic profile of a department, reflecting it into a funding metric is a questionable choice.  And again, this is just one of many examples I could give.  The trouble with metrics is that they are often biased in ways that one might not have anticipated.

Yet even this statement expresses a form of biased expectations: like any researcher, I have a pretty good idea of the top programs and groups in my own research area, and hence a strong sense of what the ranking should look like both in systems and more broadly, for computer science departments as a whole.  For example, Cornell is quite low in the most recent (2018) ranking of systems research departments in US News and World Report.  This really bugs me because we have a superb group, and I know perfectly well that we should be one of the top five or six in any sane ranking.  Of course, I can remind myself that by reducing the ranking to a 1-5 bubble score vote that polled 1000 or so computer science academics is terribly simplistic.  For me, the ranking that matters would be one you could get by polling my peers, who are concentrated at a much smaller set of schools: perhaps 25 here in the United States, and another 25 globally.  US News and World Report polled department chair and directors, and included 475 schools that don't conduct research in systems, at all.  Yet it still rankles!

On the other hand, there is also an argument to be made that US News and World Report has hit on a brilliant and "robust" approach, at least for overall rankings of departments as a whole.  Consider this: by asking the chair of a computer science department and the director of a graduate program to distill their impressions into a single number, the US News and World Report treats these individuals as a kind of biological neural network, which their system "samples."  Granted, the basis of their individual responses may be hard to quantify, yet if one accepts that as professionals in the field those impressions have meaning, then one has to accept that the resulting ranking is based on... something.  

But what, in fact, is this basis?  The survey doesn't include any guidance.  Respondents can, of course, visit department web sites -- but how many do so?  My guess is that most just zip down the page filling in blobs based on vague impressions, skipping the schools they have never heard of.

For example, what is the first thing that comes to mind when you think about University of Illinois (UIUC)?  Presumably, you'll note that the department is renowned for its work on high performance computing hardware, but of course that really isn't a computer science topic.  Then you might point to the Mosaic web browser: back when CERN came up with the idea of turning the Internet into a world wide web, UIUC created the first tools to make this a reality.  Ask 1000 professional academics about the history of the web, and I would bet that quite a few of them do know this aspect of history -- enough to bias the ranking substantially. It isn't an accident that UIUC soared in the subsequent US News and World report ranking, and stayed there.  

Game of Thrones focused on individuals, grudges, and nasty back-stabbing behavior.  Yet I've written almost entirely about assessment through a crowd-sourcing approach that really eliminates most opportunities for individual bias to torque the outcome.  Ironically I was very critical of the US News and World Report rankings for years: to me, they rank but do not "assess" in any meaningful sense.  But what I see today is that any other approach to ranking would almost certainly be worse, and much more at risk of manipulation.  As for assessment... I honestly have no idea whether what we do is valid, even after participating in the system for nearly 40 years.

This thought invites the next: to ask whether computer systems has endured its own Game of Thrones played out through paper reviews and promotion letters.  The area certainly has had its share of colorful disputes, and anger doesn't lend itself to fairness.  But what has helped, a lot, is that that more than two decades ago the systems area leadership began to take concrete actions aimed at supporting existing researchers while also educating newcomers.  We have a workshop specifically focused on diversity, there are more and more anti-bias mechanisms in place, and we talk openly about the harm bias can cause and the importance of active measures, not just lip-service.  To me it is obvious that this has helped.  The field is far more diverse today, more welcoming, and even paper reviews have evolved to focus more on positivism and less on destructive negativism.  We have a long path to follow, but we're moving in a good direction.

Let me end by sharing a story about Jim Gray, who passed away in 2012.  Jim, used to go out of his way to engage with every new systems person he could find.  I mentioned one example of how he helped me above, but in fact this was just one of many times he helped in my career.  In fact I thought of him as an advisor and mentor from the very first time I met him, not long after I started my graduate studies at Berkeley: he urged me to work on fault-tolerance and consistency questions "outside of database settings", and that advice ultimately shaped everything!  The only thing he ever asked in return was that those of us he helped "play it forward," by helping others when the opportunity arose.  

How did Jim act on this philosophy?  He was constantly reaching out, offered to read papers, and without fail would find something to get excited about, something to encourage.  He wrote wonderful tenure and promotion letters, taking the time to think hard about each element of a candidate's work, listening hard to their personal research and teaching vision statements, and invariably looking for the positive story.  He was behind that first phone call I got from DARPA, and later, when I launched companies, he helped introduce me to potential customers who might have been very, very, nervous without his endorsement.  For Jim, strengths always outweighed weaknesses,  risk was a virtue, and even a project that ultimately failed often yielded insights we could build upon.  Where others saw high-risk work that rejected things they had worked on, Jim loved debating and took actual pleasure in finding weaknesses in his own past approaches.  He invariably viewed edgy work as having promise.  He could be blunt when something didn't work, but he didn't view failure as an individual weakness.  Instead he would ask why this happened, what assumption had we made that turned out to be wrong, and what intuition we needed to learn to question.  There was always a next step, built on insights from the first step.  

This was a style of assessment that we might all consider emulating.

Saturday, 13 June 2020

That's impossible!

Distributed computing, like many mature areas of computer science, has its share of impossibility results.  I was lucky to be in the field before many of them were discovered, and because I'm basically a software builder, my research group built all sorts of distributed systems and tools.  Later, though, the theory community showed that some of those things were impossible, which leaves a puzzle: How can anything solve an impossible problem?  

Today, I thought it might be fun to explore an example.  I want to go deeply enough into the question to shed light on the limitations of what is feasible in computer networks and cloud datacenters.  At the same time, we'll also discover limitations on what can be modelled and proved, and even some limitations on the subtle ways that choices of wording (in English) can shape perceptions about what can and cannot be done.  

When we talk about consistency in a distributed system, people generally map the work to their favorite model. For fans of database models, this might be transactional serializability and the ACID properties: Atomicity, Correctness, Isolation and Durability.  Distributed systems researchers would point to State Machine Replication or Linearizability.  The systems I created also worry about dynamic changes to the pool of servers, and have a consistency mechanism called virtual synchrony: membership changes in atomic steps.  During a period when membership is stable (an epoch), we use state machine replication within the members.  Protocols like Paxos solve this part of the problem.

If you take a big step back and abstract, all of these forms of consistency boil down to forms of fault-tolerant consensus: we have some set of processes, and they propose various kinds of actions, and vote on what action to do next.  Consistency is the property that that they all decide the same thing.  Thus, anything we can say about fault-tolerant consensus sheds light on fundamental limitations that apply to databases, distributed systems, and all sorts of other settings.

Any mathematically-based discussion starts with a model.  For example, what does it mean to say that a failure has occurred?  The most obvious choice is to say that well, some component halts -- it crashes suddenly and without doing anything horrible right as it collapses.  But it is easy to come up with other failure models.  

For example, consider timing.  If a processor has a local clock that fails (drifts from the truth, or perhaps starts to show nonsense values), sometimes the remainder of the system stays healthy, even including the process on the machine with the faulty clock.  Yet that could cause the process running on that machine to "lie" about what time some event occurred, or to miss a deadline.  We tend to wrap all of these up, and call them timing faults.  

Communication faults are tricky to model too.  You may not be aware of this, but on Linux a process could send a message, but then the operating system or network could drop it.  Should we blame the process itself?  The network?  And should we say that the message wasn't sent, or wasn't received?  Worse, a network could split into parts that can't talk to each other: network "partitioning" failures.  

Then we get the "hit by a cosmic ray" issues.  Things like that (or more mundane problems like a power fluctuation) can cause the computer memory to flip bits.  As a result, a process could experience some form of data corruption.  And this doesn't even consider the case where the process is hijacked by a virus.  We tend to lump all such issues into what we call "malicious" failure models, but even within the malicious models, there is a range that includes whether or not one allows collusion, as opposed to a strictly isolated form of nefarious misbehavior: A virus that can infect one process might be able to infect every process running the same code, and then mount a coordinated attack on something else. In contrast, that bit flip is just a case of flakey hardware and would only impact a single process at a time.

There is a lot of work that explores this range of behaviors.  In fact one model, called the BAR model of failures, starts with these cases and then goes further by introducing incentives: are the participants out to cause chaos (a so-called byzantine case)?  Or are they altruistic?  Purely rational?  Crash failures are then layered in, giving a cross-product that you can study to yield an abundance of impossibility results for tasks like reaching agreement or electing a leader.  

For our purposes today, the result I want to discuss is one from a paper by Fischer, Lynch and Paterson called Impossibility of Consensus With One Faulty Process.  We often refer to this just as the FLP impossibility result, and it is arguably the most famous of all such results.   As the title suggests, the paper seemingly shows that agreement (consensus) is actually not possible if a system is at risk of crash failures.  On the face of it, the FLP assumptions about the network and the possible failures are very mild -- and the result seemingly implies that databases, Cornell's amazing new Derecho system, dozens of other Paxos-based systems, Blockchain solutions for Bitcoin (all of which can solve consensus) are "doing something impossible."  Indeed, FLP seems to imply that the very idea of consistency as a goal is hopeless: if we accept the title, don't waste your time building a database!  But as we will see, a lot centers on the details of what the words in that title actually mean.  The result is a real, valid, one.  It does apply to all of those kinds of systems.  But it doesn't mean we should pack up and head home.

The paper came out in 1985, which was an important year for me: it just happened to be the year when I wrote my first paper about a distributed system my students and I had created, one that implemented atomic multicast and Paxos and could be used to solve all sorts of practical agreement problems.  We called it Isis (as it turned out, an unfortunate choice).  Isis was released to the public, open source, and by 1987 or so it had a surprisingly wide uptake.  The system was ultimately used in the New York Stock Exchange to run the trading floor, was adopted in the French Air Traffic Control System where still is used for many purposes, and even into the Oracle database product, which launches Isis every time the Oracle cluster management system boots -- and this is just a few examples.  

As you can guess, right from the start I was getting questions about FLP, including in panel sessions at major conferences where this oddity was debated.  My mother (an academic) has a saying that academic debates become heated precisely because the issues really don't matter.  The big conference in systems, SOSP, was infamous in that time period for fireworks, very much validating my mother's point.  In retrospect, I would have to say that relatively few SOSP attendees cared in a deep sense about FLP.  But they were curious to understand the result:  The paper is short and easy to read, but surprisingly hard to map to any simple intuition about what it "means".  And let's face it: they also enjoyed lively panel debates.

Isis users didn't fuss too much about FLP: if they knew about it at all, they perceived it as an oddity.  For a decade, all those traders on the floor of the NYSE happily traded stocks.  Eventually the parts of the system that used Isis were phased out during a routine upgrade, but not because it had caused any issues -- in fact there wasn't a single disruption to trading during that entire ten-year period (crashes did occur, but Isis orchestrated self-healing recovery in every case).  

By now, the French ATC system has been safely guiding airplanes since 1995 and will probably not be upgraded before 2025: a 30 year run!  The designers of those platforms, and others, liked Isis both as a practical tool and because of its consistency-preserving consensus mechanisms.  Isis created a world of state machine replication elements, which enabled self-managed, self-healing applications.  Moreover, just as Isis itself could be described easily using the mathematics of state machine replication, those applications could also be proved correct and safe, even when components failed.  

For example, one obviously wants to be sure that an Air Traffic Control System guarantees that each plane and each airport runway will have a single designated controller who is responsible for managing it.  Isis allowed the French system to obtain this property.  Each flight should have a single active flight plan; any change to the flight plan subsumes the prior one.  Isis was central to their correctness proof for this property, too (one can think of an ATC system as centering on a log of flight plan versions, in which each change is a durable append to the log, and the current flight plan is always the version closest to the end of the log).  

ATC systems never generate huge loads, and for this reason it was also possible to quantify the system's performance profile, and even to show that performance was stable across a wide range of stress tests and failure sequences (one does this by designing tests that carefully measure delays and bandwidth even as failures are injected to mimic scenarios believed possible during actual operation).  This enabled the developers to convince skeptics that if the workstation used by some controller were to fail, someone else would take over the role within 30 seconds.   During certification, the French used red-teams that were free to pour over the protocols, the code, and the way the system used it.  Then they would often demand proofs for challenging scenarios they would construct.  Sometimes the best response included a mathematical argument, but more often, these red-teams wanted to see experimental responses: an experiment that would mimic the case they worried about, and ride it out successfully.  Over time, the system evolved to have an enormous range of unit tests and integration tests.  Isis passed every test... and yet, FLP sits there.   

Curiously, the one time I can remember this coming up, the red-team dismissed the FLP work, feeling that it made assumptions that didn't apply in an ATC setting (as we will see, FLP is posed in a very abstracted way, and assumes a very general model of the network).  Yet I still felt that I needed a really cogent answer.  Suppose that someone really were to challenge Isis in this way.  What would be the best possible way to respond and explain how Isis relates to the FLP result?  

In fact this was an actual worry for me.  Those customers trusted me, and under my guidance, we were using the Isis system for applications where it genuinely matters that the solution have safe, consistent, self-healing behavior, and within seconds too!   We had correctness proofs for the system (really, proofs of safety, not liveness, but the performance work had the flavor of a liveness claim).   FLP proclaimed this combination of guarantees to be impossible... how can that be?  Indeed, there were follow-on papers that appeared soon after FLP, pointing out that Isis and its version of Paxos were clearly subject to the FLP impossibility result.   I had no choice but to figure out the proper rebuttal.  

A great deal of the explanation centers on the FLP paper's approach to modelling the problem.  Earlier, I said a few words about their system model, but didn't mention the network, and I didn't explain how they define asynchronous execution.  So let's tackle those aspects first.  The FLP paper assumes a network that mimics several aspects of TCP.  Processes are permitted to send each other messages, and FLP assumes that these are sent over reliable network channels that never drop, corrupt or duplicate packets.  I mentioned earlier that Linux, and the network itself, can both drop messages.  But in fact if you use TCP, the TCP protocol itself compensates.  On the face of it, FLP seems to assume exactly what TCP guarantees in a standard networked system, like a cloud data center.  TCP, as you probably know, obtains this behavior by sequencing data, and then using explicit acknowledgments or complaints ("negative" acknowledgements) to trigger retransmissions and rate control.  Duplicates can be filtered out because TCP has already seen those bytes.  

On closer study, the FLP network model departs from TCP by allowing out-of-order delivery.  Moreover, this matters: their proof requires this property, because it involves delaying some messages and allowing others to skip past the delayed ones (we'll see how they use this feature in a moment).  For a while, it seemed plausible that this could be the key.  Perhaps state machine replication seems to be very possible because we run on TCP (mostly), and it delivers messages in order.  However in the end, it turned out that this particular aspect of the FLP model was unimportant.  FLP applies even to a protocol built over TCP or RDMA.

Another puzzle relates to the way the FLP model defines asynchronous behavior.  In the FLP description of processes, there are no clocks and indeed nothing even faintly resembling a clock: a set of processes could run the protocol for a billion years before achieving agreement, and this would be "just fine."  Obviously, an air traffic control system wouldn't be happy with billion year delays, so real systems like Isis and Derecho have timers built in: if some process isn't responsive, the others vote the silent one out.  To avoid a partitioning event (where our ATC system might split in half, with two subsets that each believe the other to have crashed, implying that two controllers would think themselves responsible for the same plane), we just require that in any such vote, a majority of the system has to "survive" long enough to vote in favor of the new membership.  The majority rule eliminates the risk of split-brain behaviors, which could threaten safety.   

These points, it turns out, are a bit more complicated than the one involving TCP.  The first thing to appreciate is that rapid responsiveness is very important in an ATC system.  When a pilot is approaching a landing strip, the ATC system is under a timed obligation to tell the pilot what to do: should plane A land first, or will plane B land first?  Can plane A to climb 2500 feet to avoid that thunderhead?  

Failures can threaten those quick responses, and this means that an ATC system may be in a hurry to kick out an unresponsive component.  Yet if a pilot was cleared to land, but then the original controller's computer freezes and a new controller takes over, he or she should know about that prior clearance.  This turns out to be the same requirement as the FLP rule that if any process decides v, then every process decides v.  This tells us that even though ATC systems are biased towards availability, the most fundamental aspect of the FLP way of defining consensus still applies.

What about clocks?  Real-world control systems like ATC platforms make heavy use of time, and take precautions to protect against errors that could be introduced by machines with faulty clocks.  FLP had no clocks... does this somehow take us out of the FLP scope?  One can imagine that it might: clocks enable timed actions, and we use timeout for fault detection: an ATC platform will give up on any tardy process and treat it as faulty, simply because we don't want the entire system to end up waiting for some very overloaded machine to catch up.   Doesn't FLP itself need a way to model the detection of failures?  And because timeouts are a form of non-determinism, how can we reconcile this use of time with the deterministic state machine model for the protocol itself?

As it turns out, FLP does have a way to handle non-determinism.  In the state machine formalism, processes are state machines, but FLP allows states to have null transition edges.  That is, if a process (call it p) is in some state s, FLP models p as having a set of possible next states that we could reach from s: perhaps, states s' or s'' (loop-back edges are fine, by the way, so one these could just be s itself).  Each such transition is described by an edge in a state transition graph, and these edges are labeled by a kind of pattern that will unambiguously tell us which transition to take.    Thus, when a message is available, we could either consume that message, or take a null transition: a non-deterministic event. 

Given that the behavior of a deterministic state machine is fully determined by its sequence of inputs, you can see that the real question centers on the decision as to whether the next input will be a message or a null.  FLP gives the network this power: they model the network as an active entity that makes its own decisions.  In general, the network has a set of messages ready to deliver, and decides which to deliver, in what order, and whether or not to deliver a null to some process rather than a message.  Thus, in the FLP model, timeouts are viewed as a network "decision" to deliver a null message in a state where the protocol may have been waiting for something else (perhaps p is waiting for a message from q, but instead gets a null, and interprets this to mean that a timeout has occurred).

FLP's network is unusually powerful.  In effect, it is able to scrutinize each and every message, selectively deciding when to deliver it.  There is an obligation to eventually deliver every message... but no bound on how long the network can first delay it.  And here, it turns out, is the crux of why FLP concludes that consensus is impossible, even though protocols based on the Paxos model solve consensus billions of times every day.  Real systems always timeout when a process has been unresponsive for more than a few seconds.  But FLP doesn't require the network to behave that way.  The network controls null transitions, yet it might not do so in a way that has anything to do with timeouts.  Here, then, is another possible candidate for explaining why FLP doesn't preclude building your favorite consensus-based technology: FLP's way of using nulls has no connection at all to timeouts, and doesn't actually model the way that timeouts are used in real systems.  And let me add: this is a very credible argument.  Many people invested years thinking about this exact argument, or ideas somewhat like this.

There is an old adage about big systems that if something can happen, it will happen.   The FLP authors would argue that their network simply schedules messages, that nulls are examples of places where a timeout could have occurred, and that because even TCP may have to retransmit a few times before a message gets through, that delayed delivery is a real and valid thing.  Thus any schedule FLP invents could arise in a real system, particularly if you imagine that a network link was somehow broken, then later repaired.   But as we will see (and one can easily turn the observation into a rigorous proof), the FLP network really has impossible superpowers, because the particular messages that end up being delayed this way, and the particular moments when a null transition occurs, need to be chosen with exquisite care.  Yes, each individual event could occur in the wild, but every one of them would be so unlikely that any sequence of them would be of zero probability: the probability of a sequence of unlikely events is the product of their probabilities, and with small probabilities, we approach zero exponentially quickly.    Yet just this kind of zero-probability sequence is the core of the FLP proof.

To see all of this in action, let's walk through a scenario that the FLP proof constructs.  Imagine that we have built a system that must vote on something, 0/1 and that we happen to be in a scenario with a near-tie.  Further, assume that we were given a tie-breaking rule.  Normally, there actually won't be a tie, but a failure could result in a tie, and then we would use the tie-breaker.  For example, perhaps we have 11 processes, 5 vote 0 and 6 vote 1.  The winner of this vote should be 1, but only if all of the 6 votes for 1 are tabulated.  If one crashes, we have a 5:5 tie, and the tie-breaking rule might award the win to 0.

FLP sets up this scenario, and the proof centers on delaying messages to confuse the protocol about whether to use its normal rule, or the tie-breaking rule.  They show that any fault tolerant protocol that can switch over and use its tie-breaking rule would require a few state-machine transitions during which, presumably, it would be internally reconfiguring to switch modes.  Then, just at this moment, FLP delivers the delayed vote, selecting some other message and delaying it instead.  In effect, just as the vote itself was a multi-step procedure, they are playing with the idea that switching to tie-breaking mode would also be a form of consensus: another multi-step procedure.

In a formal analysis they actually treat both of these as transitions from a state with two possible outcomes (0 or 1) to a state with just one decision.  And what they prove is that in any such state, a network can selectively delay some messages, and selectively deliver other messages (or nulls), and force the system back to a two-outcome condition.  Then they let the delayed messages through and attack again, in the identical way.  Thus, any decision is indefinitely delayed.

Interestingly, they can do this without any failures at all: they prove that any fault tolerant consensus protocol has a run in which no process fails, and yet even though nothing fails, no decision is ever reached!  But the key is that the network must have that superpower enabling it to selectively delay just the right messages at just the right instant, while also knowing just when to deliver the delayed messages, so that the network obligation to eventually deliver every message is respected.

From this, we can see that FLP authors actually meant something peculiar by their use of the word "impossibility."  In normal conversation, impossible means just what it sounds like.  If I tell you that it is impossible for a protocol that can only decide 0 or 1 to decide -1, we would all agree on this.  But given what I've described, if I claimed that it is impossible to build a network that has the properties assumed by FLP, you would probably agree with me.  The FLP authors would disagree: for them, "exponentially convergent to 0 probability" still leaves that tiny, ultra-unlikely bad case.  No matter how unlikely, if a system could exhibit some behavior, the FLP model would consider that it is possible for that behavior to occur. 

Conversely, in FLP, impossible also has a meaning... but not the meaning I might instinctively assume.  Think about the classic definition of correctness for a protocol, as Dijkstra first did: a correct algorithm is one that guarantees safety (nothing bad happens, using a predicate to define bad behavior), and liveness (something good eventually happens).  The FLP definition of impossibility centers on liveness: if you really dig deep, FLP is telling us that if we have a safe consensus protocol for some asynchronous distributed system, it cannot also be a live protocol.  A person could be forgiven for assuming from the paper's title that that there are no correct (safe) consensus protocols, but this is not what the paper actually does.  In fact it does the opposite: it assumes we have a safe protocol, and then shows that the protocol cannot also guarantee liveness, by pointing out that the zero-probability schedule discussed above is capable of delaying decisions indefinitely.

We arrive at a peculiar insight.  On the one hand, ATC systems and other similar distributed applications need to make quick progress: we don't want an ATC system to delay for a billion years before deciding who can land on runway 2-E.   This leads them to reconfigure if a process seems unresponsive, dropping that slow process to remain highly available.  At the same time, they do require safety guarantees, and those stem from the safety properties of consensus.  Due to the risk of network partitioning, we know this approach can't guarantee liveness, but we accept that because the frequency of network partitioning is very low -- data center power loss is a more common problem.

Then, on the other hand, we have FLP.  One could brush it to the side and say that well, FLP isn't relevant here.  But is that really correct?  FLP doesn't consider the network partitioning scenario we knew about (and that already precluded liveness, given our availability goals).  Yet FLP seems to be warning that even if we could somehow eliminate this exposure to a crash due to partitioning, there is actually is another "unstoppable" scenario, involving a peculiar network behavior, that no consensus protocol can defend against.  

But this leads to another insight: Reverting to Dijkstra, those of us who deliver safety-critical code to ATC organizations might actually want to prove that our systems are safe and live.  FLP teaches us that if you wish to prove that a system like Derecho will always make progress, you'll need to introduce some extra assumptions beyond the ones FLP employs.  Without extra assumptions, we can't create such proofs because if we could, we would have violated FLP.  How might those assumptions look?  Think of them as a list: "if there are no network partitions, and if the network has no way to selectively delay messages... ".  What would that list of conditions need to include?

There has been wonderful work on this question too: in a famous paper, Chandra and Toueg (later joined by Hadzilacos) described a very basic two-phase commit protocol for consensus, and worked out the weakest assumptions one can make that would still allow it to guarantee liveness: something they called the <>W failure detection oracle.  The plain-English name for <>W is "eventually weak", and in plain English, <>W is a failure detector that can make mistakes, but where eventually, some non-faulty process is recognized as healthy by all the other non-faulty processes.  This state needs to be sustained for long enough for the consensus protocol to terminate, and <>W expresses that by just saying "and we stay in this state forever".  

More recent work by Idit Keidar and her systems, notably Alex Shraer, showed that in real systems, one can generally assume a stronger failure detector called <>P.  This failure detector can make mistakes for a while, but eventually settles into a state in which it makes perfect failure discoveries for long enough to allow consensus to complete.  

In a real system, there is actually a way to create a <>P guarantee.  The trick centers on making sure that if any process suspects q of having failed, q really has crashed.  How do we solve that aspect?  The real but somewhat silly answer is that real systems reboot, reimage or replace malfunctioning components.  James Hamilton came up with this little adage, and we refer to it as "Hamilton's three R's."  In effect, we unplug q.  Voila!  That process has definitely failed.

In a setting where we can assume <>P, FLP would actually not apply.  Now, Derecho and similar systems can be shown to be both safe and live.   Of course, events that violate our assumptions can still crash the system -- network partitioning, loss of datacenter power -- but our proof would hold as long as those conditions for progress are maintained.   

If you really ponder the point, with a <>P solution based on Hamilton's three R's, FLP becomes a network partitioning attack: rather than delaying messages, it kills so many processes that Derecho would shut down, and Paxos would stop accepting new updates or letting applications learn (read) the committed part of the Paxos log.

All of this discussion... just to address one impossibility result.  In fact we have many others, and they come with oddities too.  For example, there are many such results for Byzantine Agreement, where we have a set of processes, and some subset of them can fail in any way you like, including malicious behavior crafted to disrupt the system.  But the Byzantine model normally is explored in networks with perfect clocks and perfect synchronization -- and we don't have such networks.  Moreover, with Byzantine solutions, if the number of faulty components reaches and then passes the assumed limit, all bets are off.

Let's generalize this to the broader question I posed at the start.  If distributed computing theory is full of these kinds of pitfalls and peculiar results, what does that tell us about the discipline as a science?    To me, the moral of the story is simply that theory motivated by real systems can shed valuable light but that one has to view these results with a degree of sophistication and caution: they might not mean quite what you would assume, even if the paper you are studying has an especially provocative title.  The mathematics teaches deep things, but it often isn't trivial to relate those insights back to your actual system.

But this also makes it very hard to teach this material, especially to students who distributed computing as an interesting curiosity, but not necessarily as a topic they really want to wrestle with.  While a comprehensive course would certainly delve deeply into the theory, the subtle nature of the result makes it very hard to include FLP in a more practical course, like my spring class on cloud computing.  For very practical students, when they hear that FLP says that consensus is impossible, there can be a tendency to jump wholeheartedly into Brewer's CAP model, with its "you can have two out of three" mantra. It happens that CAP is not a theorem and is often not even true, yet it can seem like a very simple and appealing rule of thumb.  CAP tells my students to just assume that consistency is hard and promises them (perhaps, falsely) that inconsistency is far easier.

I did spend part of one lecture on FLP this spring.  I worried that the course might otherwise be unbalanced -- that I needed to at least expose the students to a few of the most important theoretical results.  In that lecture, my main message was that one shouldn't take every paper's title at face value.  I get them to propose definitions of "impossible" and then surprised them with the FLP meaning of impossibility.  And I do think they understood the point: most later got a quiz question on this right.  A good tradeoff: after all, FLP really is a lovely bit of mathematics, and at least they heard about it.

Wednesday, 20 May 2020

Contact Tracing Apps Don't Work Very Well

The tension between privacy and the public interest is old, hence it is no surprise to see the question surface with respect to covid-19 contact-tracing apps.  

Proponents start by postulating that almost everyone has a mobile phone with Bluetooth capability.  In fact, not everyone has a mobile phone that can run apps (such devices are expensive).  Personal values bear on this too: even if an app had magical covid-prevention superpowers, not everyone would install it.  Indeed, not everyone would even be open to dialog.  

But let's set that thought to the side and just assume that in fact everyone has a suitable device and agrees to run the app.  Given this assumption, one can configure the phone to emit Bluetooth chirps within a 2m radius (achieved by limiting the signal power).  Chirps are just random numbers, not encrypted identifiers.  Each phone maintains a secure on-board record of chirps it generated, and those that it heard.  On this we can build a primitive contact-tracing structure.

Suppose that someone becomes ill.  The infected user would upload chirps the phone sent during the past 2 weeks into an anonymous database hosted by the health authority.  This step requires a permission code provided by the health authority, intended to block malicious users from undertaking a form of DDoS exploit.  In some proposals, the phone would also upload chirps it heard:  Bluetooth isn't perfect, hence "I heard you" could be useful as a form of redundancy.   The explicit permission step could be an issue: a person with a 104.6 fever who feels like she was hit by a cement truck might not be in shape to do much of anything.  But let's just move on.  

The next task is to inform people that they may have been exposed.  For this, we introduce a query mechanism.  At some frequency, each phone sends the database a filtering query encoding chirps it heard (for example, it could compute a Bloom Filter and pass it to the database as an argument to its query).  The database uses the filter to select chirps that might match ones the phone actually heard.  The filter doesn't need to be overly precise: we do want the response sent to the phone to include all the infected chirps, but it actually is desirable to include others that aren't ones the phone was researching.  Then, as a last step, the phone checks to see whether it actually did hear (or emit) any of these.  

Why did we want the database to always sent a non-empty list?  Well, if every response includes a set of chirps, the mere fact of a non-empty response reveals nothing.  Indeed, we might even pad the response to some constant size!

Next, assume that your phone discovers some actual matches.  It takes time in close proximity to become infected.  Thus, we would want to know whether there was just a brief exposure, as opposed to an extended period of contact.  A problematic contact might be something like this: "On Friday afternoon your phone detected a a close exposure over a ten minute period", meaning that it received positive chirps at a strong Bluetooth power level.  The time-constant and signal strength constant is a parameter, set using epidemic models that balance risk of infection against risk of false positives.

Finally, given a problematic contact your device would walk you through a process to decide if you need to get tested and self-quarantine.    This dialog is private: no public agency knows who you are, where you have been, or what chirps your device emitted or heard.  

Covid contact-tracing technology can easily result in false positives.  For example, perhaps a covid-positive person walked past your office a few times, but you kept the door closed...  the dialog might trigger and yet that isn't, by itself, determinative.   Moreover, things can go wrong in precisely the opposite way too.  Suppose that you were briefly at some sort of crowded event -- maybe the line waiting to enter the local grocery store.  Later you learn that in fact someone tested positive at this location... but the good news is that your app didn't pick anything up!  If you were certain that everyone was using a compatible app, that might genuinely tell you something.   But we already noted that the rate of use of an app like this might not be very high, and moreover, some people might sometimes disable it, or their phones might be in a place that blocks Bluetooth signals.  The absence of a notification conveys very little information.  Thus, the technology can also yield false negatives.

The kind of Covid contact tracing app I've described above is respectful of privacy.  Nobody can force you to use the app, and for all the reasons mentioned, it might not be active at a particular moment in time. Some of the apps won't even tell you when or where you were exposed or for how long, although at that extreme of protectiveness, you have to question whether the data is even useful.  And the government heath authority can't compel you to get tested, or to upload your chirps even if you do test positive.

But there are other apps that adopt more nuanced stances.  Suppose that your phone were to also track chirp signal power, GPS locations, and time (CovidSafe, created at University of Washington, has most of this information).  Now you might be told that you had a low-risk (low signal power) period of exposure  on the bus from B lot to Day Hall, but also had a short but close-proximity exposure when purchasing an expresso at the coffee bar.  The app would presumably help you decide if either of these crosses the risk threshold at which self-quarantine and testing is recommended.  On the other hand, to provide that type of nuanced advice, much more data is being collected.  Even if held in an encrypted form on the phone, there are reasons to ask at what point too much information is being captured.  After all, we all have seen endless reporting on situations in which highly sensitive data leaked or was even deliberately shared in ways contrary to stated policy and without permission.

Another issue now arises.  GPS isn't incredibly accurate, which matters because Covid is far more likely to spread with prolonged close exposure to an infectious person: a few meters makes a big difference (an especially big deal in cities, where reflections off surfaces can make GPS even less accurate -- which is a shame because a city is precisely the sort of place where you could have frequent but remote brief periods of proximity to Covid-positive individuals).  You would ideally want to know more.  And cities raise another big issue: GPS doesn't work inside buildings.  Would the entire 50-story building be treated as a single "place"?   If so, with chirps bouncing around in corridors and stairwells and atria, the rate of false positives would soar!

On campus we can do something to push back on this limitation.  One idea would be to try and improve indoor localization.  For example, imagine that we were to set up a proxy phone within spaces that the campus wants to track, like the Gimme! Coffee café in Gates Hall.  Then when so-and-so tests positive, the café itself learns that "it was exposed".  That notification could be useful to schedule a deep cleaning, and it would also enable the system to relay the risk notification, by listing the chirps that the café proxy phone emitted during the period from when the exposure occurred (on the theory that if you spend an hour at a table that was used by a covid positive person who was in the café twenty minutes, ago, that presumably creates a risk).   In effect, we would treat the space as an extension of the covid positive person who was in it, if they were there for long enough to contaminate it.

Similarly, a phone could be configured to listen for nearby WiFi signals.  With that information, the phone could "name" locations in terms of the MAC addresses it heard and their power levels.  Phone A could then warn that during a period when A's user was presumed infectious, there was a 90-minute period with 4-bars WiFi X and 2-bars WiFi Y, with WiFi Z flickering at a very low level.  One might hope that this defines a somewhat smaller space.    We could then create a concept of a WiFi signal strength distance metric, at which point phone B could discover problematic proximity to A.  This could work if the WiFi signals are reasonably steady and the triangulation is of high quality.  But WiFi devices vary their power levels depending on numbers of users and choice of channels, and some settings, like elevators, rapidly zip through a range of WiFi connectivity options...  Presumably there are research papers on such topics... 

Another idea I heard about recently was suggested by an avid FitBit user (the little app that encourages you to do a bit more walking each day).  Perhaps one could have a "social distancing score" for each user (indeed, if Fitbit devices can hear one-another, maybe Fitbit itself could compute such a score).  The score would indicate your degree of isolation, and your goal would be to have as normal a day as possible while driving that number down.  Notice that the score wouldn't be limited to contacts with Covid positive people.  Rather, it would simply measure the degree to which you are exposed to dense environments where spread is more likely to occur rapidly.  To do this, though, you really want to use more than just random numbers as your "chirp", because otherwise, a day spent at home with your family might look like a lot of contacts, and yet you all live together.  So the app would really want to count the number of distinct individuals with whom you have prolonged contacts.  A way to do this is for each device to stick to the same random number for a whole day, or at least for a few hours.  Yet such a step would also reduce anonymity... a problematic choice.

As you may be aware, Facebook owns Fitbit, and of course Facebook knows all about us.  This makes Facebook particularly qualified to correlate location and contact data with your social network, enabling it to build models of how the virus might spread if someone in your social group is ever exposed.  Doing so would enable various forms of proactive response.  For example, if a person is egregiously ignoring social distancing guidance, the public authorities could step in and urge that he or she change their evil ways.  If the social network were to have an exposure, we might be able to warn its members to "Stay clear of Sharon; she was exposed to Sally, and now she is at risk."  But these ideas, while cute, clearly have sharp edges that could easily become a genuine threat.  In particular, under the European GDPR (a legal framework for privacy protection), it might not even be legal to do research on such ideas, at least within the European union.  Here in the US, Facebook could certainly explore the options, but it would probably think twice before introducing products.

Indeed, once you begin to think about what an intrusive government or employer could do, you realize that there are already far too many options for tracking us, if a sufficiently large entity were so-inclined.  It would be easy to combine contact tracing from apps with other forms of contact data.  Most buildings these days use card-swipes to unlock doors and elevators, so that offers one source of rather precise location information.  It might be possible to track purchases at food kiosks that accept cards, and in settings where there are security cameras, it would even be possible to do image recognition...   There are people who already live their days in fear that this sort of big-brother scenario is a real thing, and in constant use.  Could covid contact tracing put substance behind their (at present, mostly unwarranted) worries?

Meanwhile, as it turns out, there is considerable debate within the medical community concerning exactly how Covid spreads.  Above, I commented that just knowing you were exposed is probably not enough.  Clearly, virus particles need to get from the infected person to the exposed one.   The problem is that while everyone agrees that direct interactions with a person actively shedding virus are highly risky, there is much less certainty about indirect interactions, like using the same table or taking the same bus.  If you follow the news, you'll know of documented cases in which covid spread fairly long distances through the air, from a person coughing at one table in a restaurant all the way around the room to people fairly far away, and you'll learn that covid can survive for long periods on some surfaces.   But nobody knows how frequent such cases really are, or how often they give rise to new infections.    Thus if we ratchet up our behavioral tracing technology, we potentially intrude on privacy without necessarily gaining a greater prevention.

When I've raised this point with people, a person I'm chatting with will often remark that "well, I don't have anything to hide, and I would be happy to take any protection this offers at all, even if the coverage isn't perfect."  This tendency to personalize the question is striking to me, and I tend to classify it along with the tendency to assume that everyone has equal technology capabilities, or similar politics and civic inclinations.  One sees this sort of mistaken generalization quite often, which is a surprise given the degree to which the public sphere has become polarized and political.  

Indeed, my own reaction is to worry that even if I myself don't see a risk to being traced in some way,  other people might have legitimate reasons to keep some sort of activity private.  And I don't necessarily mean illicit activities.  A person might simply want privacy to deal with a health issue or to avoid the risk of some kind of discrimination.  A person may need privacy to help a friend or family person deal with a crisis: but simply something that isn't suitable for a public space.  So yes, perhaps a few people do have nasty things to hide, but my own presumption tends to be that all of us sometimes have a need for privacy, and hence that all of us should respect one-another's needs without prying into the reasons.  We shouldn't impose a tracking regime on everyone unless the value is so huge that the harm the tracking system itself imposes is clearly small in comparison.

In Singapore, these contract-tracing apps were aggressively pushed by the government -- a government that at times has been notorious for repressing dissidents.  Apparently, this overly assertive rollout triggered a significant public rejection:  people were worried by the government's seeming bias in favor of monitoring and its seeming dismissal of the privacy risks, concluded that whatever other people might do, they themselves didn't want to be traced, and many rejected the app.  Others installed it (why rock the boat?), but then took the obvious, minor, steps needed to defeat it.  Such a sequence renders the technology pointless: a nuisance at best, an intrusion at worst, but infective as a legitimate covid-prevention tool.  In fact just last week (mid May) the UK had a debate about whether or not to include location tracking in their national app.  Even the debate itself seems to have reduced the public appetite for the app, and this seems to be true even though the UK ultimately leaned towards recommending a version that has no location tracing at all (and hence is especially weak, as such tools go).

I find this curious because, as you may know, the UK deployed a great many public video cameras back in the 1980's (a period when there was a lot of worry about street crimes together with high-visibility frequency terrorist threats).  Those cameras live on, and yet seem not to have limited value.  

When I spent a few months in Cambridge in 2016, I wasn't very conscious of them, but now and then something would remind me to actually look for the things, and they still seem to be ubiquitous.  Meanwhile, during that same visit, there was a rash of bicycle thefts and a small surge in drug-related street violence.  The cameras apparently had no real value in stopping such events, even though the mode of the bicycle thefts was highly visible: thieves were showing up with metal saws or acetylene torches, cutting through the 2-inch thick steel bike stand supports that the city installed during the last rash of thefts, and then reassembling the stands using metal rods and duct-tape, so that at a glance, they seemed to be intact.  Later a truck could pull up, they could simply pull the stand off its supports, load the bikes, and reassemble the stand.  

Considering quite how "visible" such things should be to a camera, one might expect that a CTV system should be able to prevent such flagrant crimes.  Yet they failed to do so during my visit.  This underscores the broader British worry that monitoring often fails in its stated purpose, yet leaves a lingering loss of privacy.  After all: the devices may not be foiling thefts, yet someone might still be using them for cyberstalking. We all know about web sites that aggregate open webcams, whether the people imaged know it or not.  Some of those sites even use security exploits to break into cameras that were nominally disabled.

There is no question that a genuinely comprehensive, successful, privacy-preserving Covid tracing solution could be valuable.  A recent report in the MIT technology review shows that if one could trace 90% of the contacts for each Covid-positive individual, the infection can be stopped in its tracks.  Clearly this is worthwhile if it can be done.  On the other hand, we've seen how many technical obstacles this statement raises.

And these are just technical dimensions.  The report I cited wasn't even focused on technology!  That study focused on human factors at scale, which already limit the odds of reaching the 90% level of coverage.  The reasons were mundane, but also seem hard to overcome.  Many people (myself included) don't answer the phone if a call seems like possible spam.  For quite a few, calls from the local health department probably have that look.  Some people wouldn't trust a random caller who claims to be a contact tracer.  Some people speak languages other than English and could have difficulty understanding the questions being posed, or recommendations.  Some distrust the government.  The list is long, and it isn't one on which "more technology" jumps out as the answer.  

Suppose that we set contact tracing per-se to the side.  Might there be other options worth exploring?  A different use of "interaction" information could be to just understand where transmission exposures are occurring, with the goal of dedensifying those spots, or perhaps using other forms of policy to reduce exposure events.  An analyst searching for those locations would need ways to carry out the stated task, yet we would also want to block him or her from learning irrelevant private information.  After all, if the goal is to show that a lot of exposure occurs at the Sunflower Dining Hall, it isn't necessary to also know that John and Mary have been meeting there daily for weeks.

This question centers on data mining with a sensitive database, and the task would probably need to occur on a big-data analytic platform (a cloud system).  As a specialist in cloud computing, I can point to many technical options for such a task.  For example, we could upload our oversight data into a platform running within an Intel SGX security enclave, with hardware-supported protection.  A person who legitimately can log into such a system (via HTTPS connections to it, for example) would be allowed to use the database for tasks like contact tracing, or to discover hot-spots on campus where a lot of risk occurs -- so this solution doesn't protect against a nosy researcher.  The good news is that unauthorized observers would learn nothing because all the data moved over the network is encrypted at all times, if you trust the software (but should we trust the software?)  

There are lots of other options.  You could also upload the data in an encrypted form, and perhaps query it without decrypting it, or perhaps even carry out the analysis using a fully homomorphic data access scheme.  You can create a database but inject noise into the query results, concealing individual data (this is called the differential privacy query model).  

On the other hand, the most secure solutions are actually the least widely used.  Fully homomorphic computing and Intel SGX, for example, are viewed as too costly.  Few cloud systems deploy SGX tools; there are a variety of reasons, but the main one is just that SGX requires a whole specialized "ecosystem" and we lack this.  More common is to simply trust the cloud (and maybe even the people who built and operate it), and then use encryption to form a virtually private enclave within which the work would be done using standard tools: the very same spreadsheets and databases and machine-learning tools any of us use when trying to make sense of large data sets.

But this all leads back to the same core question.  If we are go down this path, and explore a series of increasingly aggressive steps to collect data and analyze it, to what degree is all of that activity measurably improving public safety?  I mentioned the MIT study because at least it has a numerical goal: for contact tracing, a 90% level of coverage is effective; below 90% we rapidly lose impact.  But we've touched upon a great many other ideas... so many that it wouldn't be plausible to do a comprehensive study of the most effective place to live on the resulting spectrum of options.

The ultimate choice is one that pits an unquantifiable form of covid-safety tracing against the specter of intrusive oversight that potentially violates individual privacy rights without necessarily bringing meaningful value.   On the positive side, even a panacea might reassure a public nearly panicked over this virus, by sending the message that "we are doing everything humanly possible, and we regret any inconvenience."  Oddly, I'm told, the inconvenience is somehow a plus in such situations.  The mix of reassurance with some form of individual "impact" can be valuable: it provides an outlet and focus for anger and this reduces the threat that some unbalanced individual might lash out in a harmful way. Still, even when deploying a panacea, there needs to be some form of cost-benefit analysis!

Where, then, is the magic balancing point for Covid contact tracing?  I can't speak for my employer, but I'll share my own personal opinion.  I have no issue with installing CovidSafe on my phone, and I would probably be a good citizen and leave it running if doing so doesn't kill my battery.  Moreover, I would actually want to know if someone who later tested positive spent an hour at the some table where I sat down not longer afterwards.  But I'm under no illusion that covid contact tracing is really going to be solved with technology.  The MIT study has it right: this is simply a very hard and very human task, and we delude ourselves to imagine that a phone app could somehow magically whisk it away.