There’s a war raging on the Twitter platform, and chances are you’re already part of it even if you don’t know it. It’s a war against the growing plague of Twitterbots — fraudulent accounts that exist to spread spam, further some group’s agenda, engage in social phishing or just annoy you.
According to estimates I’ve heard, around 7 percent of the average user’s followers are fake. Some studies (albeit less-than-scientific, like this one and this one) put that number as high as 35 percent. Twitter has sued in federal court to shut down some of the biggest suppliers of these fake accounts, but until that’s resolved — and probably even after — the bots will just keep on coming.
(For what it’s worth, my own quick, four-point analysis suggests the numbers rise the more followers you have. StatusPeople’s Fake Follower check thinks 5 percent of my 6,864 followers are fake, and friends with similar follower numbers have similar percentages of fake followers. When I checked a few of my followers with more than 1 million of their own followers, the fake predictions came in at more than 20 percent both times.)
Striking down fake accounts with laser accuracy
However, a group of researchers and Twitter employees recently conducted an experiment that resulted in the identification and deletion of millions of fake accounts. And it did this without ever getting into the increasingly difficult challenge (although possibly about to get easier thanks to advances in deep learning) of trying to spot fake accounts by analyzing the content of their tweets.
According to Chris Grier, a researcher from the University of California, Berkeley, who is part of the team, this experiment was the culmination of years of research into the underground economy, the last few focused on Twitter. Over the course of 10 months, they bought 121,027 accounts from 27 merchants that the researchers estimate account for between 10 and 20 percent of fake-follower volume on Twitter. They used what they learned during the registration process alone to determine some of these merchants’ methods for getting around Twitter’s anti-spam processes, and also to train machine-learning algorithms that can perform massive sweeps of Twitter’s roles and highlight the fakes.
Those algorithms turned out to be thorough. In one big sweep, the model identified and deleted 95 percent of accounts then registered by the 27 services it targeted. Better yet, their method was deadly accurate, falsely identifying accounts as fake only 0.0058 percent of the time (or about 6 out of 10,000). To the best of Grier’s knowledge, Twitter is still using the model to run occasional sweeps, although he noted their approach could be converted into an online system that would identify fake accounts in real time.
I reached out to Twitter for a comment on whether or how it’s still using the researchers’ system, but have not received a response.
Forget what they tweet, look at how they registered
What they did, Grier explained, was to analyze the numerous distinct signals that appear at registration time and that pretty clearly (obviously) signify fraudulent behavior. He pointed to the lazy practice of automatically generating a Twitter handle that’s just the fake user’s first and last name followed by four digits.
Speed matters, too. “How fast you fill out the [registration] form is a good signal whether you’re human or not,” Grier said. Humans can fill out forms fast, he noted, but not in a tenth of a second while also solving a CAPTCHA.
Most interesting (to me, at least), was that although Grier and the team performed some serious text and statistical analysis to achieve maximum accuracy, many things were evident just with the human eye. “When you see a list of 1,000 accounts, it’s clear what the patterns are,” Grier said, citing the automated naming conventions and the clearly phony email addresses.
Snuffing out fake accounts across the web?
Grier said he’d like to see the team’s work expand into services beyond Twitter, some of which have fewer guards in place to stop fake accounts. While Twitter’s IP address limits aren’t much of a barrier to determined fake-account sellers (“We’re pretty sure they’re using big botnets,” he said), the limits do have some effect. CAPTCHAs and email verification work some, too, even if they’re not foolproof.
Actually, the team found, the extra cost of designing systems that can overcome email verification helps inflate the price of fake followers. Grier’s team paid about $ 40 per 1,000 accounts, but followers on services without email verfication were about 20 percent less expensive.
Expanding it to other services will probably take some tight work between the researchers and the companies, though. Previously, Grier noted, his team had used public Twitter data, but this time it had access to private Twitter data thanks to the presence of two employees on the team. As we’ve covered before, there’s a rift of sorts between the academic and web communities when it comes to sharing data that’s potentially valuable for research, but also potentially sensitive from a commercial perspective.
Ideally, Grier said, his team could gain enough knowledge about how merchants selling fake accounts work that it could build models for fingerprinting them across the entire web.
There’s only one problem with that strategy: Right now, the collective market for fake accounts is like a gang of hydra-ninja hybrids. It’s like a hydra in the sense that even if Grier’s team cut the heads off these 27 merchants, there are still the remaining heads responsible for between 80 and 90 percent of fake Twitter accounts alone. It’s like a ninja in the sense that so many merchants — especially freelancers and those operating on web forums — are able to operate in the shadows.
Unfortunately, Grier acknowledged, no one knows just how many of these services are out there.
Feature image courtesy of Shutterstock user DeiMosz.
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
- Why the next front in big data might be psychological
- Can Mining and Filtering Monetize NewNet?
- How banks use big data to manage human risks