In social data, a fight between science and privacy

There is a fight brewing between academics and industry over the use of your Facebook activity or search history. Academics are upset that companies like Google, Microsoft and Facebook have the keys to the underlying data, but they won’t share. The issue may be a small war between universities and giant companies, but its part of the larger debate over consumer privacy versus the social good. As with many things, though, perhaps the answer lies in a grand compromise.

John Markoff of the New York Times gave a good overview of the debate in an article on Monday. He writes that social scientists — from universities and some large companies – are upset with Google and its cohorts because their usual practice of not releasing the source data from studies makes it very difficult to validate or invalidate company research.

Google, the only company among itself, Facebook and Microsoft that responded to Markoff’s request for comment, predictably cited privacy concerns as a primary reason it doesn’t release much of the data behind its research.

It’s tough to argue with privacy

The argument for openness is clear and powerful — access to source data is key to validating one’s claims, even more so if data really has become so important as to be called the fourth paradigm of science – but the argument that releasing data would potentially violate consumer privacy is difficult to overcome if you’re looking at the debate objectively. Google and Facebook, in particular, are locked into decrees with the Federal Trade Commission that have them subject to privacy audits for the next two decades. They’re also constantly under the public microscope for privacy violations both real and perceived. Legally and publicly, they can’t afford too many more gaffes.

And gaffes do happen, even when data is supposed to be anonymous. In 2006, for example, AOL released three months worth of supposedly anonymous search records for researchers to use. The only problem is that they weren’t anonymous at all — New York Times reporters were able to connect the anonymous data to individuals, and others noted some disturbing personal data that would be devastating if it were linked to an individual.

If you ask a privacy expert today, you’ll likely hear that this situation is common thanks to our newfound proficiencies in analytics. It’s one reason that rules about protecting users’ personally identifiable information aren’t as meaningful as they once were. During a recent conversation with NYU Stern School of Business professor Arun Sundararajan about intent-based privacy, he called anonymity a “gray area.”

“Targeting has gotten to the point where firms can know who you are without knowing your name,” he told me.

Ankur Teredesai

But what if the two opponents borrow a practice from the legal industry to solve their problem, protect consumer privacy and still ensure robust scientific review? Ankur Teredesai, a University of Washington professor and data mining researcher, says that while he would love access to data from sites such as Google, it’s a little unreasonable to expect firms to make their source data public without some serious checks in place. One compromise he suggested was for conferences (or, presumably, academic journals) to condition presentation or publication on release of data to a small number of reviewers rather than the entire research community.

It’s like when parties in litigation demand trade secrets from the other side — a judge typically views documents in private and decides whether they’re discoverable and, if so, whether major portions should be redacted. At the very least, peer review by editors prohibited from releasing the data could help ensure studies are accurate, even if it means other researchers can’t actually get their hands on the data.

Additionally, Teredesai explained, it’s important to draw a distinction between data from behavioral studies and source data for technical papers. In the case of the latter, source data doesn’t matter nearly as much as being able to test the algorithms and replicate the results on any similar dataset.

Still, Teredesai acknowledged, keeping data private is a little annoying and academic researchers end up having to reinvent the wheel to use social or consumer data. Already, he said, he’s spending a not-insignificant amount time and resources gathering data manually from Twitter because Twitter stopped whitelisting access to its data for academia. Getting relevant data from other sources can be much more difficult.

A world divided?

If there isn’t some sort of compromise, it’s conceivable we could see a divided research space with industry on one side and academia on the other, each playing by their own rules. In that scenario, industry wins. Markoff quotes a letter to the journal Nature from HP Labs social media director and industry-research critic that sums up this particular concern:

“If this trend continues,” [Huberman] wrote, “we’ll see a small group of scientists with access to private data repositories enjoy an unfair amount of attention in the community at the expense of equally talented researchers whose only flaw is the lack of right ‘connections’ to private data.”

ACM SIGKDD Chair and ChoozOn CTO Usama Fayyad told me last year during a discussion at the International Conference on Knowledge Discovery and Data Mining that industry has already taken the lead in big data research for the same reasons Huberman cited as concerns in the social science realm. It was easy to hire talent when he was chief data officer at Yahoo, Fayyad said, because Yahoo had the huge datasets they wanted to work with and the hardware on which to run analyses.

That hasn’t changed. Everywhere you turn, you hear stories of large web companies plucking students out of Stanford Ph.D. programs and paying them a lot of money to come to work on computer science problems, or of students dropping out to launch their own startups.

Already, Teresedai noted, industry has spearheaded a group within the ACM called WSDM, whose conference doesn’t have strict requirements on openness or replicability as many other academic conferences do. Armed with lots of good data and lots of money to hire talent, industry can play by its own rules and expect continued praise for its work in social science research.

Shared mission, shared effort.

That’s why a tight relationship between web companies and everyone else around social and consumer data is so critical. Oren Etzioni, founder and CEO of Decide.com and computer science professor at the University of Washington, told me during a recent conversation that the relationship between university computer science departments and industry is a “bright spot in the economy.” It’s a cycle where research begins in universities and then is carried on in commercial settings where projects can get the resources they need to flourish and become more useful.

Maybe a similarly symbiotic relationship is possible around data, even if both sides have to bend a bit. Etzioni said university researchers certainly like it when new technologies and techniques are open sourced so they use and study them, adding, however, that “where you stand [on an issue] is often influenced by where you sit.”

Feature image courtesy of Shutterstock user llin Sergey; train image courtesy of Flickr user John Spooner.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.