A Few Thoughts on Distributed Computing: privacy

Showing posts with label privacy. Show all posts

Wednesday, 20 May 2020

Contact Tracing Apps Don't Work Very Well

The tension between privacy and the public interest is old, hence it is no surprise to see the question surface with respect to covid-19 contact-tracing apps.

Proponents start by postulating that almost everyone has a mobile phone with Bluetooth capability. In fact, not everyone has a mobile phone that can run apps (such devices are expensive). Personal values bear on this too: even if an app had magical covid-prevention superpowers, not everyone would install it. Indeed, not everyone would even be open to dialog.

But let's set that thought to the side and just assume that in fact everyone has a suitable device and agrees to run the app. Given this assumption, one can configure the phone to emit Bluetooth chirps within a 2m radius (achieved by limiting the signal power). Chirps are just random numbers, not encrypted identifiers. Each phone maintains a secure on-board record of chirps it generated, and those that it heard. On this we can build a primitive contact-tracing structure.

Suppose that someone becomes ill. The infected user would upload chirps the phone sent during the past 2 weeks into an anonymous database hosted by the health authority. This step requires a permission code provided by the health authority, intended to block malicious users from undertaking a form of DDoS exploit. In some proposals, the phone would also upload chirps it heard: Bluetooth isn't perfect, hence "I heard you" could be useful as a form of redundancy. The explicit permission step could be an issue: a person with a 104.6 fever who feels like she was hit by a cement truck might not be in shape to do much of anything. But let's just move on.

The next task is to inform people that they may have been exposed. For this, we introduce a query mechanism. At some frequency, each phone sends the database a filtering query encoding chirps it heard (for example, it could compute a Bloom Filter and pass it to the database as an argument to its query). The database uses the filter to select chirps that might match ones the phone actually heard. The filter doesn't need to be overly precise: we do want the response sent to the phone to include all the infected chirps, but it actually is desirable to include others that aren't ones the phone was researching. Then, as a last step, the phone checks to see whether it actually did hear (or emit) any of these.

Why did we want the database to always sent a non-empty list? Well, if every response includes a set of chirps, the mere fact of a non-empty response reveals nothing. Indeed, we might even pad the response to some constant size!

Next, assume that your phone discovers some actual matches. It takes time in close proximity to become infected. Thus, we would want to know whether there was just a brief exposure, as opposed to an extended period of contact. A problematic contact might be something like this: "On Friday afternoon your phone detected a a close exposure over a ten minute period", meaning that it received positive chirps at a strong Bluetooth power level. The time-constant and signal strength constant is a parameter, set using epidemic models that balance risk of infection against risk of false positives.

Finally, given a problematic contact your device would walk you through a process to decide if you need to get tested and self-quarantine. This dialog is private: no public agency knows who you are, where you have been, or what chirps your device emitted or heard.

Covid contact-tracing technology can easily result in false positives. For example, perhaps a covid-positive person walked past your office a few times, but you kept the door closed... the dialog might trigger and yet that isn't, by itself, determinative. Moreover, things can go wrong in precisely the opposite way too. Suppose that you were briefly at some sort of crowded event -- maybe the line waiting to enter the local grocery store. Later you learn that in fact someone tested positive at this location... but the good news is that your app didn't pick anything up! If you were certain that everyone was using a compatible app, that might genuinely tell you something. But we already noted that the rate of use of an app like this might not be very high, and moreover, some people might sometimes disable it, or their phones might be in a place that blocks Bluetooth signals. The absence of a notification conveys very little information. Thus, the technology can also yield false negatives.

The kind of Covid contact tracing app I've described above is respectful of privacy. Nobody can force you to use the app, and for all the reasons mentioned, it might not be active at a particular moment in time. Some of the apps won't even tell you when or where you were exposed or for how long, although at that extreme of protectiveness, you have to question whether the data is even useful. And the government heath authority can't compel you to get tested, or to upload your chirps even if you do test positive.

But there are other apps that adopt more nuanced stances. Suppose that your phone were to also track chirp signal power, GPS locations, and time (CovidSafe, created at University of Washington, has most of this information). Now you might be told that you had a low-risk (low signal power) period of exposure on the bus from B lot to Day Hall, but also had a short but close-proximity exposure when purchasing an expresso at the coffee bar. The app would presumably help you decide if either of these crosses the risk threshold at which self-quarantine and testing is recommended. On the other hand, to provide that type of nuanced advice, much more data is being collected. Even if held in an encrypted form on the phone, there are reasons to ask at what point too much information is being captured. After all, we all have seen endless reporting on situations in which highly sensitive data leaked or was even deliberately shared in ways contrary to stated policy and without permission.

Another issue now arises. GPS isn't incredibly accurate, which matters because Covid is far more likely to spread with prolonged close exposure to an infectious person: a few meters makes a big difference (an especially big deal in cities, where reflections off surfaces can make GPS even less accurate -- which is a shame because a city is precisely the sort of place where you could have frequent but remote brief periods of proximity to Covid-positive individuals). You would ideally want to know more. And cities raise another big issue: GPS doesn't work inside buildings. Would the entire 50-story building be treated as a single "place"? If so, with chirps bouncing around in corridors and stairwells and atria, the rate of false positives would soar!

On campus we can do something to push back on this limitation. One idea would be to try and improve indoor localization. For example, imagine that we were to set up a proxy phone within spaces that the campus wants to track, like the Gimme! Coffee café in Gates Hall. Then when so-and-so tests positive, the café itself learns that "it was exposed". That notification could be useful to schedule a deep cleaning, and it would also enable the system to relay the risk notification, by listing the chirps that the café proxy phone emitted during the period from when the exposure occurred (on the theory that if you spend an hour at a table that was used by a covid positive person who was in the café twenty minutes, ago, that presumably creates a risk). In effect, we would treat the space as an extension of the covid positive person who was in it, if they were there for long enough to contaminate it.

Similarly, a phone could be configured to listen for nearby WiFi signals. With that information, the phone could "name" locations in terms of the MAC addresses it heard and their power levels. Phone A could then warn that during a period when A's user was presumed infectious, there was a 90-minute period with 4-bars WiFi X and 2-bars WiFi Y, with WiFi Z flickering at a very low level. One might hope that this defines a somewhat smaller space. We could then create a concept of a WiFi signal strength distance metric, at which point phone B could discover problematic proximity to A. This could work if the WiFi signals are reasonably steady and the triangulation is of high quality. But WiFi devices vary their power levels depending on numbers of users and choice of channels, and some settings, like elevators, rapidly zip through a range of WiFi connectivity options... Presumably there are research papers on such topics...

Another idea I heard about recently was suggested by an avid FitBit user (the little app that encourages you to do a bit more walking each day). Perhaps one could have a "social distancing score" for each user (indeed, if Fitbit devices can hear one-another, maybe Fitbit itself could compute such a score). The score would indicate your degree of isolation, and your goal would be to have as normal a day as possible while driving that number down. Notice that the score wouldn't be limited to contacts with Covid positive people. Rather, it would simply measure the degree to which you are exposed to dense environments where spread is more likely to occur rapidly. To do this, though, you really want to use more than just random numbers as your "chirp", because otherwise, a day spent at home with your family might look like a lot of contacts, and yet you all live together. So the app would really want to count the number of distinct individuals with whom you have prolonged contacts. A way to do this is for each device to stick to the same random number for a whole day, or at least for a few hours. Yet such a step would also reduce anonymity... a problematic choice.

As you may be aware, Facebook owns Fitbit, and of course Facebook knows all about us. This makes Facebook particularly qualified to correlate location and contact data with your social network, enabling it to build models of how the virus might spread if someone in your social group is ever exposed. Doing so would enable various forms of proactive response. For example, if a person is egregiously ignoring social distancing guidance, the public authorities could step in and urge that he or she change their evil ways. If the social network were to have an exposure, we might be able to warn its members to "Stay clear of Sharon; she was exposed to Sally, and now she is at risk." But these ideas, while cute, clearly have sharp edges that could easily become a genuine threat. In particular, under the European GDPR (a legal framework for privacy protection), it might not even be legal to do research on such ideas, at least within the European union. Here in the US, Facebook could certainly explore the options, but it would probably think twice before introducing products.

Indeed, once you begin to think about what an intrusive government or employer could do, you realize that there are already far too many options for tracking us, if a sufficiently large entity were so-inclined. It would be easy to combine contact tracing from apps with other forms of contact data. Most buildings these days use card-swipes to unlock doors and elevators, so that offers one source of rather precise location information. It might be possible to track purchases at food kiosks that accept cards, and in settings where there are security cameras, it would even be possible to do image recognition... There are people who already live their days in fear that this sort of big-brother scenario is a real thing, and in constant use. Could covid contact tracing put substance behind their (at present, mostly unwarranted) worries?

Meanwhile, as it turns out, there is considerable debate within the medical community concerning exactly how Covid spreads. Above, I commented that just knowing you were exposed is probably not enough. Clearly, virus particles need to get from the infected person to the exposed one. The problem is that while everyone agrees that direct interactions with a person actively shedding virus are highly risky, there is much less certainty about indirect interactions, like using the same table or taking the same bus. If you follow the news, you'll know of documented cases in which covid spread fairly long distances through the air, from a person coughing at one table in a restaurant all the way around the room to people fairly far away, and you'll learn that covid can survive for long periods on some surfaces. But nobody knows how frequent such cases really are, or how often they give rise to new infections. Thus if we ratchet up our behavioral tracing technology, we potentially intrude on privacy without necessarily gaining a greater prevention.

When I've raised this point with people, a person I'm chatting with will often remark that "well, I don't have anything to hide, and I would be happy to take any protection this offers at all, even if the coverage isn't perfect." This tendency to personalize the question is striking to me, and I tend to classify it along with the tendency to assume that everyone has equal technology capabilities, or similar politics and civic inclinations. One sees this sort of mistaken generalization quite often, which is a surprise given the degree to which the public sphere has become polarized and political.

Indeed, my own reaction is to worry that even if I myself don't see a risk to being traced in some way, other people might have legitimate reasons to keep some sort of activity private. And I don't necessarily mean illicit activities. A person might simply want privacy to deal with a health issue or to avoid the risk of some kind of discrimination. A person may need privacy to help a friend or family person deal with a crisis: but simply something that isn't suitable for a public space. So yes, perhaps a few people do have nasty things to hide, but my own presumption tends to be that all of us sometimes have a need for privacy, and hence that all of us should respect one-another's needs without prying into the reasons. We shouldn't impose a tracking regime on everyone unless the value is so huge that the harm the tracking system itself imposes is clearly small in comparison.

In Singapore, these contract-tracing apps were aggressively pushed by the government -- a government that at times has been notorious for repressing dissidents. Apparently, this overly assertive rollout triggered a significant public rejection: people were worried by the government's seeming bias in favor of monitoring and its seeming dismissal of the privacy risks, concluded that whatever other people might do, they themselves didn't want to be traced, and many rejected the app. Others installed it (why rock the boat?), but then took the obvious, minor, steps needed to defeat it. Such a sequence renders the technology pointless: a nuisance at best, an intrusion at worst, but infective as a legitimate covid-prevention tool. In fact just last week (mid May) the UK had a debate about whether or not to include location tracking in their national app. Even the debate itself seems to have reduced the public appetite for the app, and this seems to be true even though the UK ultimately leaned towards recommending a version that has no location tracing at all (and hence is especially weak, as such tools go).

I find this curious because, as you may know, the UK deployed a great many public video cameras back in the 1980's (a period when there was a lot of worry about street crimes together with high-visibility frequency terrorist threats). Those cameras live on, and yet seem not to have limited value.

When I spent a few months in Cambridge in 2016, I wasn't very conscious of them, but now and then something would remind me to actually look for the things, and they still seem to be ubiquitous. Meanwhile, during that same visit, there was a rash of bicycle thefts and a small surge in drug-related street violence. The cameras apparently had no real value in stopping such events, even though the mode of the bicycle thefts was highly visible: thieves were showing up with metal saws or acetylene torches, cutting through the 2-inch thick steel bike stand supports that the city installed during the last rash of thefts, and then reassembling the stands using metal rods and duct-tape, so that at a glance, they seemed to be intact. Later a truck could pull up, they could simply pull the stand off its supports, load the bikes, and reassemble the stand.

Considering quite how "visible" such things should be to a camera, one might expect that a CTV system should be able to prevent such flagrant crimes. Yet they failed to do so during my visit. This underscores the broader British worry that monitoring often fails in its stated purpose, yet leaves a lingering loss of privacy. After all: the devices may not be foiling thefts, yet someone might still be using them for cyberstalking. We all know about web sites that aggregate open webcams, whether the people imaged know it or not. Some of those sites even use security exploits to break into cameras that were nominally disabled.

There is no question that a genuinely comprehensive, successful, privacy-preserving Covid tracing solution could be valuable. A recent report in the MIT technology review shows that if one could trace 90% of the contacts for each Covid-positive individual, the infection can be stopped in its tracks. Clearly this is worthwhile if it can be done. On the other hand, we've seen how many technical obstacles this statement raises.

And these are just technical dimensions. The report I cited wasn't even focused on technology! That study focused on human factors at scale, which already limit the odds of reaching the 90% level of coverage. The reasons were mundane, but also seem hard to overcome. Many people (myself included) don't answer the phone if a call seems like possible spam. For quite a few, calls from the local health department probably have that look. Some people wouldn't trust a random caller who claims to be a contact tracer. Some people speak languages other than English and could have difficulty understanding the questions being posed, or recommendations. Some distrust the government. The list is long, and it isn't one on which "more technology" jumps out as the answer.

Suppose that we set contact tracing per-se to the side. Might there be other options worth exploring? A different use of "interaction" information could be to just understand where transmission exposures are occurring, with the goal of dedensifying those spots, or perhaps using other forms of policy to reduce exposure events. An analyst searching for those locations would need ways to carry out the stated task, yet we would also want to block him or her from learning irrelevant private information. After all, if the goal is to show that a lot of exposure occurs at the Sunflower Dining Hall, it isn't necessary to also know that John and Mary have been meeting there daily for weeks.

This question centers on data mining with a sensitive database, and the task would probably need to occur on a big-data analytic platform (a cloud system). As a specialist in cloud computing, I can point to many technical options for such a task. For example, we could upload our oversight data into a platform running within an Intel SGX security enclave, with hardware-supported protection. A person who legitimately can log into such a system (via HTTPS connections to it, for example) would be allowed to use the database for tasks like contact tracing, or to discover hot-spots on campus where a lot of risk occurs -- so this solution doesn't protect against a nosy researcher. The good news is that unauthorized observers would learn nothing because all the data moved over the network is encrypted at all times, if you trust the software (but should we trust the software?)

There are lots of other options. You could also upload the data in an encrypted form, and perhaps query it without decrypting it, or perhaps even carry out the analysis using a fully homomorphic data access scheme. You can create a database but inject noise into the query results, concealing individual data (this is called the differential privacy query model).

On the other hand, the most secure solutions are actually the least widely used. Fully homomorphic computing and Intel SGX, for example, are viewed as too costly. Few cloud systems deploy SGX tools; there are a variety of reasons, but the main one is just that SGX requires a whole specialized "ecosystem" and we lack this. More common is to simply trust the cloud (and maybe even the people who built and operate it), and then use encryption to form a virtually private enclave within which the work would be done using standard tools: the very same spreadsheets and databases and machine-learning tools any of us use when trying to make sense of large data sets.

But this all leads back to the same core question. If we are go down this path, and explore a series of increasingly aggressive steps to collect data and analyze it, to what degree is all of that activity measurably improving public safety? I mentioned the MIT study because at least it has a numerical goal: for contact tracing, a 90% level of coverage is effective; below 90% we rapidly lose impact. But we've touched upon a great many other ideas... so many that it wouldn't be plausible to do a comprehensive study of the most effective place to live on the resulting spectrum of options.

The ultimate choice is one that pits an unquantifiable form of covid-safety tracing against the specter of intrusive oversight that potentially violates individual privacy rights without necessarily bringing meaningful value. On the positive side, even a panacea might reassure a public nearly panicked over this virus, by sending the message that "we are doing everything humanly possible, and we regret any inconvenience." Oddly, I'm told, the inconvenience is somehow a plus in such situations. The mix of reassurance with some form of individual "impact" can be valuable: it provides an outlet and focus for anger and this reduces the threat that some unbalanced individual might lash out in a harmful way. Still, even when deploying a panacea, there needs to be some form of cost-benefit analysis!

Where, then, is the magic balancing point for Covid contact tracing? I can't speak for my employer, but I'll share my own personal opinion. I have no issue with installing CovidSafe on my phone, and I would probably be a good citizen and leave it running if doing so doesn't kill my battery. Moreover, I would actually want to know if someone who later tested positive spent an hour at the some table where I sat down not longer afterwards. But I'm under no illusion that covid contact tracing is really going to be solved with technology. The MIT study has it right: this is simply a very hard and very human task, and we delude ourselves to imagine that a phone app could somehow magically whisk it away.

Friday, 21 February 2020

Quantum crypto: Caveat emptor...

There is a great deal of buzz around the idea that with quantum cryptographic network links, we can shift to eavesdropping-proof communication that would be secure against every known form of attack. The catch? There is a serious risk of being tricked into believing that a totally insecure network link is a quantum cryptographic one. In fact, it may be much easier and cheaper to build and market a fake quantum link than to create a real one! Worse, the user probably wouldn't even be able to tell the difference. You could eavesdrop on a naive user or her easily if you built one of these fakes and managed to sell it. What's not to love, if you are hoping to steal secrets?

So, first things first. How does quantum cryptography actually work, and why is it secure? A good place to start is to think about a random pad: this is a source of random bits that is created in a paired read-once form. You and your friend each have one copy of the identical pad. For each message, you tear off one "sheet" of random bits, and use the random bits as the basis of a coding scheme.

For example, the current sheet could be used as a key for a fast stream cryptographic protocol. You would use it for a little while (perhaps even for just one message), then switch to the next sheet, which serves as the next key. Even if an attacker somehow was able to figure out what key was used for one message, that information wouldn't help for the next message.

This is basically how quantum cryptography works, too. We have some source of entangled photons, and a device that can measure polarization, or "spin". Say that up is 1 and down is 0. In principle, you'll see a completely random sequences of 0/1 bits, just like one sheet of a random pad.

Because the photons are entangled, even though the property itself is random, if we measure this same property for both of the entangled photons, we obtain the same bit sequence.

Thus if we generate entangle photons, sending one member of the pair to one endpoint and the other photon to the other endpoint, we've created a quantum one-time pad. Notice that no information is actually being communicated. In some sense, the photons do not carry information-per se, and can't be forced to do so. The actual bits will be random, but because the photons are entangled, we are able to leverage the correlation to read exactly two copies out, one copy at each endpoint. Then we can use this to obscure our messages (a classical method is used to authenticate the parties at each end, such as with RSA-based public and private keys).

Quantum cryptography of this form is suddenly being discussed very widely in the media, and there are more and more companies willing to sell you these cables, together with the hardware to generate entangled photons and to read out the binary bit strings using measurements on the entangled photon pairs. So why shouldn't everyone leap up this very moment and rush down to Home Depot to buy one?

To see the issue, think back to the VW emissions scandal from 2015. It turned out that from 2011 to 2015, the company was selling high-emission engines that had a way to sense when they were being tested. In those periods, they would switch to a less economical (but very clean) mode of operations This would fool the department of motor vehicles, after which the car could revert to its evil, dirty ways.

Suppose the same mindset was adopted by a quantum cable vendor. For the non-tested case, instead of entangling photons the company could generate a pseudo-random sequence of perfectly correlated unentangled ones. For example, it could just generate lots of photons and filter out the ones with an unwanted polarization. The two endpoint receivers measure polarization and see the same bits. This leads them to think they share a secret one-time pad... but in fact the vendor of the cable not only knows the bit sequence but selected it!

To understand why this would be viable, it helps to realize that today's optical communication hardware already encodes data using properties like the polarization or spin of photons. So the hardware actually exists, and it even runs at high data rates! Yet the quantum cable vendor will know exactly what a user will measure at the endpoints.

How does this compare to the quantum version? In a true quantum crytographic network link, the vendor hardware generates entanged data in a superposition state. Now, this is actually tricky to achieve (superpositions are hard to maintain). As a result, the vendor can predict that both endpoints will see correlated data, but because some photons will decorrelate in transmission, there will also be some quantum noise. (A careful fake could mimic this too, simply by computing the statistical properties of the hardware and then deliberately transmitting different data in each direction now and then).

So as a consumer, how would you test a device to unmash this sort of nefarious behavior?

The only way that a skeptic can test a quantum communication device is by running what is called a Bell's Inequality experiment. With Bell's, the skeptic runs the vendor's cable, but then makes a random measurement choice at the endpoints. For example, rather than always measuring polarization at some preagreed angle, it could be measured at a randomly selected multiple of 10 degrees. The idea is to pick an entangled superposition property and then to measure it in a way that simply cannot be predicted ahead of time.

Our fraudulent vendor can't know, when generating the original photons, what you will decide to measure, and hence can't spoof an entanglement behavior. In effect, because you are making random measurements, you'll measure random values. But if the cable is legitimate and the photons are genuinely entangled, now and then the two experiments will happen to measure the same property in the identical way -- for example, you will measure polarization at the identical angle at both endpoints. Now entanglement kicks in: both will see the same result. How often would this occur? Well, if you and I make random selections in a range of values (say, the value that a dice throw will yield), sometimes we'll bet on the same thing. The odds can be predicted very easily.

When we bet on the same thing, we almost always read the same value (as mentioned earlier, quantum noise prevents it from being a perfect match). This elevated correlation implies that you've purchased a genuine quantum cryptography device.

But now think back to VW again. The company didn't run with low emissions all the time -- they had a way to sense that the engine was being tested, and selected between emission modes based on the likelihood that someone might be watching. Our fraudulent vendor could try the same trick. When the cable is connected to the normal communication infrastructure (which the vendor supplies, and hence can probably detect quite easily), the cable uses fake entanglement and the fraudulent vendor can decode every message with ease. When the cable is disconnected from the normal endpoint hardware, again easy to detect, the vendor sends entangled photons, and a Bell's test would pass!

Clearly, a quantum communications device will only be trustworthy if the user can verify the entire device. But how plausible is this? A device of this kind is extremely complex.

My worry is that naïve operators of systems that really need very good security, like hospitals, could easily be fooled. The appeal of a quantum secure link could lure them to spend quite a lot of money, and yet most such devices may be black boxes, much like any other hardware we purchase. Even if a device somehow could be deconstructed, who would have the ability to validate the design and implementation? A skilled skeptical buyer might have no possible way to actually validate the design!

So, will quantum security of this form ever be a reality? They already are, in lab experiments where the full system is implemented from the ground up. But one cannot just purchase components and cobble such a solution together: the CIO of a hospital complex who wants a secure network would need to purchase an off-the-shelf solution. I can easily see how one might spend money and end up with a system that would look as if it was doing something. But I simply don't see a practical option for convincing a skeptical auditor that the solution actually works!

Monday, 7 January 2019

Leave no trace behind: A practical model for IoT privacy?

IoT confronts us with a seeming paradox.

There is overwhelming evidence that machine learning requires big data, specialized hardware accelerators and substantial amounts of computing resources, hence must occur on the cloud.

The eyes and ears of IoT, in contrast, are lightweight power-limited sensors that would generally have primitive computing capabilities, mostly “dedicated” storage capacity (for storing acquired images or other captured data), and limited programmability. These devices have bandwidth adequate to upload metadata, such as thumbnails, and then can upload selected larger data objects, but they can’t transmit everything. And the actuators of the IoT world are equally limited: controllers for TVs and stereo receivers, curtains that offer robot controls, and similar simple, narrowly targeted robotic functionality.

It follows that IoT necessarily will be a cloud “play.” Certainly, we will see some form of nearby point-of-presence in the home or office, handling local tasks with good real-time guarantees and shielding the cloud from mundane workloads. But complex tasks will occur on the cloud, because no other model makes sense.

And here is the puzzle: notwithstanding these realities, IoT systems will collect data of incredible sensitivity! In aggregate, they will watch us every second of every day. There can be no privacy in a smart world equipped with pervasive sensing capabilities. How then can we avoid creating a dystopian future, a kind of technological Big Brother that watches continuously, knows every secret, and can impose any draconian policy that might be in the interests of the owners and operators of the infrastructure?

Indeed, the issue goes further: won’t society reject this degree of intrusiveness? In China, we already can see how dangerous IoT is becoming. Conversely, in Europe, privacy constraints are already very strong, and some countries, like Israel, even include a right to privacy in its constitution. If we want IoT to boom, we had better not focus on an IoT model that would be illegal in those markets, and would play into China’s most repressive instincts!

IoT is the most promising candidate for the next wave of technology disruption. But for this disruption to occur, and for it to enable the next wave of innovation and commerce, we need to protect the nascent concept against the risk posed by this seemingly inherent need to overshare with the cloud.

But there may be an answer. Think about the rule for camping: pack it in, then pack it out, leaving no trace behind. Could we extend the cloud to support a no-trace-left behind computing model?

What I have in mind is this. Our device, perhaps a smart microphone like Alexa, Siri, or Cortana hears a command but needs cloud help to understand it. Perhaps the command is uttered in a heavy accent, or makes reference to the speaker’s past history, or has a big data dimension. These are typical of cases where big data and hardware accelerators and all that cloud technology make a huge difference.

So we ship the information up to Google, Microsoft, Amazon. And here is the crux of my small proposal: suppose that this provider made a binding contractual commitment to retain no trace and to use every available technical trick to prevent attackers from sneaking in and stealing the sensitive data.

Today, many cloud operators do the opposite. But I’m proposing that the cloud operator forgo all that information, give up on the data-sales opportunities, and commit to perform the requested task in a secured zone (a secure “enclave” in security terminology).

Could this be done, technically? To me it seems obvious that the problem isn’t even very hard!

The home device can use firewalls, securely register and bind to its sensors, and send data over a secured protocol like https. Perfect? No. But https really is very secure.

In the cloud, the vendor would need to avoid cohosting the computation on nodes that could possibly also host adversarial code, which avoids the issue of leakage p such as with “meltdown.” It would have to monitor for intrusions, and for insider “spies” trying to corrupt the platform. It would need to scrub the execution environment before and after the task, making a serious effort to not leave traces of your question.

The vendor would have to carry this even further, since a machine learning tool that can answer a question like “does this rash look like it needs a doctor to evaluate it?” might need to consult with a number of specialized microservices. Those could be written by third parties hoping to sell data to insurance companies. We wouldn’t want any of them retaining data or leaking it. Same for apps that might run in the home.

But there is a popular “stateless” model for cloud computing that can solve this problem. We want those microservices walled off and by locking them into a stateless model (think of a firewall that blocks attempts to send data out), and only allowing them to talk to other stateless microservices, it can be done. A serious attempt to monitor behavior would be needed too: those third party apps will cheat if they can.

Today, many cloud companies are dependent on capturing private data and selling it. But I don’t see why other companies, not addicted to being evil, couldn’t offer this model. Microsoft has made very public commitments to be being a trusted, privacy-preserving, cloud partner. What I’ve described would be right up their alley! And if Azure jumped in with such a model, how long would it be before everyone else rushes to catch up?

To me this is the key: IoT needs privacy, yet by its nature, a smart world will be an interconnected, cloud style environment, with many tasks occurring in massive data centers. The cloud, up to now, has evolved to capture every wisp of personal information it can, and doing so made some people very wealthy, and enabled China to take steps breathtaking in their intrusiveness. But there is no reason that the future IoT cloud needs to operate that way. A “leave no trace model”, even if supported only by one big provider like Microsoft, could be the catalyst we’ve all been waiting for. And just think how hard it will be for companies (or countries) locked into spying and reporting everything, to compete with that new model.

Let’s learn to pack it in... and then to pack up the leftovers and clear them out. The time is ripe for this, the technology is feasible, and the competition will be left reeling!

Sunday, 18 February 2018

Trolled!

Recent revelations about troll postings to Facebook, Twitter and other media sites create an obvious question: can we trust the authenticity of online commentary, or should we basically distrust everything? After all, even a posting by my brother or an email from my mother could be some kind of forgery.

The question has broader dimensions. Twenty-two years ago, David Cooper and I published a little paper on secure and private email. In this paper, we asked whether there is a way for two people (who know one-another) to send and receive emails in a monitored environment, with some form of untrusted observer trying to detect that communication has occurred, and hoping to retrieve the message itself, too.

Our 1995 solution uses cryptography and fake traffic to solve the problem. The fake traffic ensures that there is a steady flow of bytes, whether or not the two people are communicating. Then we designed a kind of shared storage system that plays the role of an email server: you can hide data inside it. The email itself was encrypted, but also broken into bits in the storage layer and hidden inside vast amounts of noise. Then the act of sending or receiving an email was mapped to a vast amount of reading and rewriting of blocks in the storage system. We showed that an observer learns very little in this case, and yet you can send and receive emails in a way that guarantees the authenticity of the messages.

This week a Cornell visitor told us about ways to improve on that style of older system, but the broad framework was similar. I have always loved solutions built from little cryptographic building blocks, so I thought this was a really fun talk. The problem, of course, is that nobody adopts tools like these, and unless everyone is using them, the mere fact of having a copy of the software might tip bad actors off to your interest in secret communication (then they can abduct you and force you to reveal everything). To really work, we would need a universally adopted standard, one that nearly everyone was using even without realizing it -- the WhatsApp of secure email. That way, when they come to question you, you can pretend to have absolutely no idea what they are talking about.

The other problem is that in contemporary society, there is a slight bias against privacy. While most people would agree that we have a right to privacy, they seem to mean "unless you are trying to hide a secret we want to know about." So there is a contradiction in the sense that we accept the right to privacy, yet also seem to believe in a broader societal right to intrude, particularly if the individuals are celebrities -- as if privacy rights vanish with any form of fame or notoriety. There is also a significant community that assumes that privacy is something people would want primarily as a way to hide something: an unusual sexual preference, or criminal activity, or terrorism.

Back to the trolls. In the cases recently publicized by the FBI, CIA and NSA, we learned that Russia has at least one (maybe more) companies, with large numbers of employees (80 or more) who work full time, day in and day out, planting fake news, false commentaries and incendiary remarks in the US and European press and social networks. Here in Ithaca, the local example seems to be a recent event in which a dispute arose about the diversity of casting for a high school play (although the lead role is that of Carmen, a gypsy woman and hence someone who would normally have dark skin, the casting didn't reflect that aspect of the role). This was then cited as part of a pattern, and a controversy around casting erupted.

Any small town has such episodes, but this one was unusual because suddenly, a torrent of really vile postings, full of racist threads, swamped the local debate. One had the sense that Ithaca (a northern town that once had a big role on the underground railway for helping escaped slaves reach freedom) was some sort of a hotbed of racism. But of course there is another easy explanation: perhaps we are just seeing the effects of this Russian-led trolling. The story and this outburst of racism are precisely in line with what the FBI reported on. In fact, some of the nasty stuff is home grown and purely American. But these trolling companies apparently are masters at rabble-rousing and unifying the Archie Bunkers of the world to charge in whatever direction they point.

So here we have dual questions. With Facebook, or in the commentary on an article in a newspaper, I want to be confident that I'm seeing a "legitimate" comment, not one manufactured in a rabble-rousing factory in Russia. Arguably, this formulation is at odds with anonymity, because just knowing an account name won't give me much confidence that the person behind the account is a real person. Trolls create thousands of accounts and use them to create the illusion that massive numbers of people agree passionately about whatever topic they are posting about. They even use a few as fake counter-arguers to make it all seem more real.

So it isn't enough that Facebook has an account named Abraham Lincoln, and that the person posting on that account has the password. There is some sense in which you want to know that this is really good old honest Abe posting from the great beyond, and not an imposter (or even a much younger namesake). Facebook doesn't try to offer that assurance.

This is a technical question, and it may well have a technical answer, although honestly, I don't see an immediate way to solve it. A quick summary:

Desired is a way to communicate, either one-to-one (email), or one-to-many (Facebook, within a social network), or one-to-all (commentary on newspapers and other public media web sites).
If the individuals wish to do so privately, we would wish for a way to do this that reveals no information to the authorities, under the assumption that "everyone uses social media". So there should be a way to communicate privately that somehow hides itself as completely normal web site browsing or other normal activities.
If the individuals wish to post publically, others should be able to authenticate both the name of the person behind the posting (yes, this is the real "Ken Birman," not a forged and twisted fake person operated as a kind of web avatar under control of trolls in St. Petersburg), and the authenticity of the posting (nobody forged this posting, that sort of thing).
In this public mode, we should have several variations:

Trust in the postings by people in our own social network.
Indirect trust when some unknown person posts, but is "vouched for" by a more trusted person. You can think of this as a form of "trust distance".
A warning (think of it as a "red frame of shame") on any posting that isn't trustworthy at all. The idea would be to put a nice bright red frame around the troll postings.

When someone reacts to a troll posting by reposting it, or replying to it, it would be nice if the social networking site or media site could flag that secondary posting too ("oops! The poster was trolled!"). A cute little icon, perhaps? This could become a valuable tool for educating the population at large about the phenomenon, since we often see secondary commentary without understanding the context in which the secondary remark was made.

Then we would want these ideas widely adopted by email systems, Facebook, Twitter, Google, the New York Times, the Breitbart News, and so forth. Ideally, every interaction would offer these options, so that any mail we send, posting we make, or any that we read, is always protected in the intended manner.

Could this agenda be carried out? I believe so, if we are willing to trust the root authentication system. The Cornell visitor from this week pointed out that there is always an issue of the root of trust: once someone can spoof the entire Internet, you don't have any real protections at all. This extends to your computer too: if you are using a virtualized computer that pretends to support the trust framework but in reality, shares your information with the authorities, all privacy bets are (obviously) off. If the display system carefully and selectively removes some red frames, and inserts others where they don't belong, we're back to square zero.

So there are limits. But I think that with "reasonable assumptions" the game becomes one of creating the right building blocks and then assembling them into solutions with the various options. Then industry would need to be convinced to adopt those solutions (perhaps under threat of sanctions for violating European privacy rules, which are much tougher than the ones in the US). So my bet is that it could be done, and frankly, when we consider the scale of damage these hackers and trolls are causing, it is about time that we did something about it.