On data privacy, there are lots of questions and no real answers

The biggest takeaway from the Foreign Intelligence Surveillance Court opinion finding certain NSA activities illegal and unconstitutional shouldn’t be that the NSA acted badly. It should be that when it comes to how we legislate and regulate data in an era of big data, no one really has a clue to handle it. There’s too much data, it’s too easy to collect and it’s really hard to define what’s acceptable use.

The headlines over the past 24 hours have all been about how many messages between U.S. citizens the agency has collected and that it did so “intentionally.” But the opinion — 85 pages in total — is not so cut and dry (in legalese, even “intentionally” doesn’t mean intentionally). If anything, it raises more questions and debate points than it answers.

Here’s the TL;DR version: The secret court has no idea what’s going on and no real way to to find out other than by taking the NSA’s word for it. Its power to rule on the legality of certain actions is hamstrung as a result. The NSA seems actually to kind of care about doing things legally, but even it has a hard time keeping track of the deluge of data it’s collecting.

Here’s a longer list of questions and concerns:

  • Is the real issue that the communications were incidentally collected, or how many were collected? The court seems to suggest the latter.
  • Is approximately 46,000 questionable messages (.197 percent of “single communication transactions”) and between 2,000 and 10,000 seemingly illegal “multiple communication transactions” a significant amount? What about in light of the fact that the NSA was collecting about 13.25 million of these messages a year — legally — through its “upstream” program at the time?
  • The court acknowledges it can’t actually analyze all the data in question, so it’s relying on the NSA’s analysis to obtain those numbers.
  • Consider that we’re talking only about a small fraction (just over 5 percent) of what the NSA collects in a year under Section 702. The court doesn’t appear to have much issue with certain other massive collection efforts from ISPs. It doesn’t touch the collection of metadata from wireless carriers.
  • What about the very real possibility that none of these questionable communications are never viewed by an analyst, and instead just sit in a repository somewhere? What if the communications were never read or queried, per se, but just fed into some systems doing deep text analysis over the entire body of intercepted communications?
  • Does the fact that the NSA’s systems can’t automatically distinguish between legal and illegal communications matter? Or that the agency apparently does some degree of self-reporting, and actually purges clearly illegal records from its system?
  • The court actually gives the NSA quite a bit of leeway in interpreting the laws that govern it — and consider that determining the reasonableness of collection is on an acquisition-by-acquisition basis.
  • This ruling only found the minimization efforts — not the targeting tactics — to be in violation of the relevant statutes and the Constitution, and only in certain situations. I haven’t read anything detailing how the NSA has amended its minimization efforts in light of this ruling.
  • The opinion mentioned other cases where the NSA apparently violated the scope of what the court thought were its rights. A lot has been made of those offenses, but do we know changed as a result of them?
  • Does it matter that many U.S. citizens, if post-PRISM-leak polls are to be believed, would actually welcome this type of data collection?

Oh, and national security. Can’t forget about that one.

On the web, it’s not even what’s collected that matters

Regulating privacy in the consumer world isn’t any easier. If anything, consumer privacy (especially if no one is willing to pay for services) underscores just how complicated a mess this is and how hard it is to keep up with data and technology. That’s why although federal regulators have been proposing consumer-privacy guidelines for years, but they never get much traction.

This FTC privacy report from 2010 for example, talks a lot about personally identifiable information, or PII. But guess what — pretty much all data can now be used to identify us. This is especially true of the stuff we open up to the world on Twitter or LinkedIn that any other service is free to snatch up and tie back to us. Not much use focusing on PII anymore.

So the latest FTC privacy report, from 2012, talks about the “linkability” of data back to a particular user. It suggests actively de-identifying data and forbidding the use of trying to re-identify it. That’s not a bad idea, but there will certainly be exceptions (like for first-party data) and it seems like companies engaged in online advertising or data collection would fight this type of regulation very hard.

The FTC is putting a lot of thought into privacy.

The FTC is putting a lot of thought into privacy.

Even if audits were in place — where regulators actually came in and analyzed someone’s ad-targeting models, let’s say — prohibitions wouldn’t be foolproof. (This sort of thing happens in the banking industry, for example, where factors such as race cannot affect credit rating.) I was speaking with a professor recently, Ankur Teredesai from the University of Washington-Tacoma, who noted that good data scientists can always find proxies. They just test the hell out of other data points until they find something, or some combination, that gets them to the same place. And that’s what goes in the official scoring model.

It’s all about how you use it

OK, so we should regulate how companies use data rather than what they can collect. That has been a big push from the technology industry, and it certainly makes more sense than limiting what’s gathered — especially considering that most rights to gather data are granted contractually (capitalists love contracts) and it’s just so easy to collect it. But how do you regulate usage?

Even if we make companies like Google and Facebook tell users how they’re using user data, there’s still the challenge of timing. Surely, we can’t expect companies to get consent from users every time they’re experimenting with some new product or new model using our data, right? That seems like it would be a pretty big hindrance on innovation — push your product ideas into the public eye or get slammed with penalties.

Further, granting real permission is based on having all the facts. “We’re going to use your personal data for targeted advertising” is a lot different than saying “We’re going to take your age, city, site behavior and — ooh, you signed in via Twitter — Twitter account info to predict that you’re black, white, rich, poor, healthy or suffering from herpes.” If we were to mandate the latter type of disclosure, would we expect consent every time a company’s data scientists reweighted the variables in their models or found some new correlations? Could we revoke permissions because something happened and our profiles suddenly look less appealing?

How does it know I'm in Las Vegas? If I hadn't clicked "do not track, " that would be more personalized.

How does it know I’m in Las Vegas? If I hadn’t clicked “do not track, ” that would be more personalized.

Then there’s the idea of data lockers or, as the World Economic Forum has suggested, permissions that travel along with data. These are potentially great ideas, but they’ll still fall prey to any opaqueness around use versus inference. Depending on how they play out, they might also require the technologically challenging feat of getting every web company in the world to agree on certain standards and systems for data management.

I like data privacy as much as the next guy, but I don’t see how we’re going to achieve it anytime soon. Whether it’s national security or web-surfing, there’s a whole lot of data being collected and there aren’t a lot of good options for dealing with it. We might be stuck for quite a while with the current cycle of spotting egregious violations, debating them, punishing violators/passing reactionary laws and then moving on.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Cloud and data second-quarter 2013: analysis and outlook
  • A near-term outlook for big data
  • NewNet Q4: Platform mania and social commerce shakeout


GigaOM