Why the trick to analyzing Twitter data is more data

With election season in full swing and presidential debates kicking off on Wednesday, there’s a lot of talk about the role social media — particularly Twitter — will play in gauging how well candidates performed and how well they’re faring among voters. But not everyone is buying into Twitter’s status as a divining rod for public interest or sentiment on a particular topic, especially politics. There’s merit to both positions, as the real value in analyzing tweets comes not from the messages themselves, but from their inclusion in a giant data set that includes data from everywhere and anywhere someone can think of.

On Tuesday, MIT Technology Review reporter David Talbot wrote a post titled “The Real Debate Will Take Place on Twitter and Facebook.” It’s not an uncommon sentiment, as pollsters, marketers and anyone else whose job entails discerning public opinion is trying to use Twitter to find out just that information. And Talbot is right — on Thursday morning, probably earlier, analysts of all stripes will be poring through Twitter data to see who users thought won and lost the presidential debate, and what talking points struck a nerve.

Twitter itself will probably get into the act, too. The company has taken to posting tweets per minute during big events as a way of gauging their popularity, and it likely will supplement its regularly updated ranking of public sentiment toward President Barack Obama and challenger Mitt Romney with stats about how many tweets the debate generated.

Not so fast

However, in a column and subsequent blog post on Friday, Wall Street Journal “Numbers Guy” Carl Bialik laid out a convincing case for why Twitter is probably not too good an indicator of how excited the populace is about a given topic, or even how it feels. The difficulty in assessing the value of Twitter data really boils down to knowing what you’re looking for. Bialik serves up a litany of reasons Twitter might be ineffective at predicting, for example, elections or other broad issues of national concern, including:

  • It’s a large sample size, but still just a fraction of the population.
  • Even among internet users, Twitter skews toward a younger, more-connected demographic.
  • Twitters undercounts total tweets when the system can’t make a link between tweets and a given event.
  • The number of tweets per minute on any topic will naturally rise as Twitter’s user count rises.
  • Tweeting may be a sign that someone is less engaged in an activity (e.g., watching a presidential debate) than someone watching intently.
  • Sentiment analysis can be skewed by who’s tweeting about an issue (e.g., ardent supporters only, adversaries only, or the public at large).

Those are all good points, and it would be fair to call a fool anyone thinking Twitter alone is the answer to their data needs. For Twitter data alone, it might be true there are only a few uses cases where it might be better than more-traditional methods of assessing public opinion.

For example, if there’s one thing Twitter users appear to be great at, it’s consuming products and commenting on those experiences. Telephone surveys and focus groups take time and effort (and think about the demographic biases inherent in people who’ll take the time to do them), but a tweet only takes a few seconds and those opinions are often unsolicited and unfiltered. I rarely tweet about politics or things not related to my day job, but even I will chime in if my experience with a product, service, movie or airline has been particularly good or bad.

And if you’re someone trying to reach a demographic that uses Twitter heavily, activity and sentiment on Twitter might be a good measure of how effective your message or product actually is. As Bialik’s column points out, a Pew Research study found that more than 30 percent of internet users between 18 and 24 years old are on Twitter, while others have found the average Twitter user is a 37-year-old woman.

Variety is the spice of life — and tweet data

What smart companies do today, though, and what anyone looking to derive meaning from Twitter data should do, is analyze it against other data sets that actually have meaning to them. Without context, tweets are just tweets, but part of what makes big data so big is the variety of data now available to analyze. Tying tweets to other data gives them, and the other data sources, meaning beyond their surface value.

A lot of insights might be gleaned over time by studying the correlation between Twitter activity and particular marketing campaigns or campaign strategies, or perhaps spikes and dropoffs in sales. Adding in some advanced sentiment analysis, one could find how users’ feelings change with those increases or decreases in activity, and perhaps how strongly based on the intensity of their language.

Going forward, I think the real value of Twitter will come from analyzing using Twitter to actually gauge how specific demographics or types of individuals are feeling about certain topics. One could limit analysis of Twitter activity or sentiment to users by geographic area, or who identify themselves as “father,” “mother,” “Java programmer” or whatever. It’s the sort of micro-targeting political campaigns already engage in today, only digital.

Taking a page from the Klout playbook, pollsters or marketing managers might actually look for trends among influential Twitter users (however their algorithms score influence) whose voices have a broader reach than some random 20-year-old from Miami. As the Technology Review‘s Talbot points out, this is already happening to some degree, even if firms are just identifying influential users and not yet targeting them with messages.

Services such as Gnip and DataSift already provide much of this information as part of their social data offerings. At our Structure: Europe conference later this month in Amsterdam, I’ll actually be speaking with DataSift Co-Founder and CTO Nick Halstead about how a company can go about setting up an infrastructure that lets them take advantage of Twitter data successfully and in real-time.

Is data on how many people are tweeting or how they feel really telling about who’ll win an election or what will win Best Picture? Probably not. But seeing the rates of tweets pick up and the sentiment sour every time you espouse a certain opinion, update a certain product or, say, place more limits on how developers can use your API could be truly meaningful information.

Feature image courtesy of Shutterstock user Pavel Ignatov.


GigaOM