Why the collision of big data and privacy will require a new realpolitik

When it comes to protecting privacy in the digital age, anonymization is a terrifically important concept. In the context of the location data collected by so many mobile apps these days, it generally refers to the decoupling of the location data from identifiers such as the user’s name or phone number. Used in this way, anonymization is supposed to allow the collection of huge amounts of information for business purposes while minimizing the risks if, for example, someone were to hack the developer’s database.

Except, according to research published in Scientific Reports on Monday, people’s day-to-day movement is usually so predictable that even anonymized location data can be linked to individuals with relative ease if correlated with a piece of outside information. Why? Because our movement patterns give us away.

The paper, entitled Unique in the Crowd: The privacy bounds of human mobility, took an anonymized dataset from an unidentified mobile operator containing call information for around 1.5 million users over 14 months. The purpose of the study was to figure out how many data points — based on time and location — were needed to identify individual users. The answer, for 95 percent of the “anonymous” users logged in that database, was just four.

From the paper:

“We showed that the uniqueness of human mobility traces is high, thereby emphasizing the importance of the idiosyncrasy of human movements for individual privacy. Indeed, this uniqueness means that little outside information is needed to re-identify the trace of a targeted individual even in a sparse, large-scale, and coarse mobility dataset. Given the amount of information that can be inferred from mobility data, as well as the potentially large number of simply anonymized mobility datasets available, this is a growing concern.”

Just because you’re paranoid…

For those already worrying about the privacy-busting implications of mobile device use, this should come as no surprise. As CIA CTO Ira “Gus” Hunt stressed last week at GigaOM’s Structure:Data conference, mobility and security do not go hand-in-hand. You can be constantly tracked through your mobile device, even when it is switched off. What’s more, those sensors you’re pairing with your device make it ridiculously easy to identify you.

From Hunt’s speech:

“You guys know the Fitbit, right? It’s just a simple three-axis accelerometer. We like these things because they don’t have any – well, I won’t go into that [laughter]. What happens is, they discovered that just simply by looking at the data what they can find out is with pretty good accuracy what your gender is, whether you’re tall or you’re short, whether you’re heavy or light, but what’s really most intriguing is that you can be 100 percent guaranteed to be identified by simply your gait – how you walk.”

One of the explicit purposes of Unique in the Crowd was to raise awareness. As the authors put it: “these findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.”

But this isn’t just about mobility; it’s also about the implications of our big data society. These are effectively two sides of the same coin – mobile devices make it easy to collect data, while big data capabilities make it increasingly trivial to take the resulting mass of supposedly anonymized data and tease out the kind of specificity that the anonymizers were trying to erase.

This was precisely the sort of problem foreseen by Europe’s cybersecurity agency, ENISA, a few months back when evaluating the continent’s proposed “right to be forgotten”. If a citizen really wants all traces of their personal data removed from the web, ENISA pointed out, that would have to mean removing their data from anonymized datasets as well as from more obvious repositories such as social networks and search indices.

As ENISA said at the time:

“Removing forgotten information from all aggregated or derived forms may present a significant technical challenge. On the other hand, not removing such information from aggregated forms is risky, because it may be possible to infer the forgotten raw information by correlating different aggregated forms.”

Shall we just give up now?

The Unique in the Crowd authors stressed in a BBC interview that “we really don’t think that we should stop collecting or using this data — there’s way too much to gain for all of us — companies, scientists, and users.” So what can be done?

Personally speaking, I have been writing about issues around data privacy for many years, and I still cannot see any easy solution to this problem. If it were simply a case of which side of the argument carries more weight, I would have no hesitation in siding with the privacy brigade: selling data to advertisers in order to fund that “free” app does not justify the creation of a surveillance society.

But it’s just not that simple. That Fitbit is also trying to help you keep fit — the fact that it can identify you by accident doesn’t change that fact. Mobile operators’ datasets help keep their networks running. Location-based services don’t work without location. We even hope big data capabilities will help us fight diseases and socio-economics problems. And, most importantly, despite the fact that most people in the U.S. and European Union insist they want better data privacy, we see time and again that this desire doesn’t translate into action – people still give up their data without much consideration.

What we need is a new realpolitik for data privacy. We are not going to stop all this data collection, so we need to develop workable guidelines for protecting people. Those developing data-centric products also have to start thinking responsibly – and so do the privacy brigade. Neither camp will entirely get its way: there will be greater regulation of data privacy, one way or another, but the masses will also not be rising up against the data barons anytime soon.

There needs to be better regulation that works in practice – unlike Europe’s messy cookie law or the “right to be forgotten”. It may be that the restrictions will need to be on the use of data rather than its collection, as proposed in a recent World Economic Forum report. However, regulators tend not to be very proactive, particularly when the risks, while inevitable, remain mostly theoretical.

I suspect the really useful regulation will come some way down the line, as a reactive measure. I just shudder to think what event will necessitate it.