How MailChimp learned to treat data like orange juice and rethink email in the process

MailChimp Chief Data Scientist John Foreman likes to talk about orange juice. On the surface, it’s a strange way to start a discussion about data, but it all starts to make sense when you peel back the rind. It’s a way of thinking that’s letting MailChimp — which sends about 35 billion emails a year on behalf of roughly 3 million users — transform itself into a data-driven business 12 years into its existence.

When you’re in Atlanta, as I was during a recent trip, the obvious place to start talking about orange juice and data is with Coca-Cola. Foreman can tell you all about how the beverage giant — whose headquarters tower over the city just a just a mile away from MailChimp’s office — uses advanced algorithms and giant vats of different juices to ensure the proper flavor of its Simply Orange line of orange juice. However, it’s something else Coca-Cola is doing that inspired the way Foreman thinks about data and that’s helping MailChimp re-imagine what it means to engage with fans, readers and customer through their inboxes.

Anyone familiar with how large web companies came to pioneer the practice of what we now call “big data” should appreciate the analogy. Coca-Cola, which also owns Minute Maid, produces a lot of excess pulp when it makes orange juice. For decades, presumably, it had just been throwing that pulp away, but in 2006 it decided to make use of it by launching a new product called Minute Maid Pulpy. Sold primarily in Asian countries, Pulpy has become a billion-dollar business for Coca-Cola.

Once MailChimp is done with its primary business of sending emails, it has a lot of pulp of its own in the form of data. And rather than just ignoring it or writing up some cute blog posts (which he also does), Foreman and his bosses want to turn that data into revenue.

First things first: Making better orange juice

Neil Bainton

Neil Bainton

Actually, though, MailChimp first brought in Foreman in 2011 to help the company improve its core business of letting users build and send their emails. MailChimp’s culture was built around many things, COO Neil Bainton told me, but data wasn’t one of them. It had “various fits and starts” through the years trying to work data into its business model, and each step just added more complexity.

The challenges were technological as well as cultural, but Foreman had a plan, of which focus was a key aspect. Keeping a tight focus meant Foreman and his lone-developer sidekick could build what they needed to in a short timeframe. It also meant the company didn’t have to worry about some massive overnight transformation into a data-obsessed company like Google.

John Foreman

John Foreman

“[They] don’t need to be afraid the entire culture is gonna fall down if we bring in this weird math guy,” he joked.

Foreman’s first project — deploying artificial intelligence models that would automatically detect spammy email lists from MailChimp’s users – is actually critical to the way MailChimp operates, though. It was up and running in production within a year, after a technologically challenging effort of merging separate database instances for each customer into a single environment that would let MailChimp run complex analyses across its customer base.

It’s such an important project, Foreman explained, because internet service and email providers keep reputation scores on the IP addresses that send email through their systems. Because MailChimp serves as the email engine for its millions of users, sending too many messages that get flagged as spam and lower MailChimp’s reputation will have a negative impact on everyone. The company used to deal with spam manually, and only after recipients began complaining about the messages they received.

“It used to be before we had that AI model in place that everyone had a crappier experience,” Foreman said.

Say goodbye to those ’90s fans, Pearl Jam

Source: MailChimp

Source: MailChimp

Now, however, MailChimp knows some of the telltale signs of spam for which it should be on the lookout. If too high a percentage of email addresses on a given list are also available via publicly available lists or those you can buy on sketchy corners of the internet, it’s probably spam. Too many old and far-more-likely-to-be-dead Earthlink or Compuserve addresses, or letters within one keystroke of each other as if someone just mashed the keyboard? Probably spam.

Thankfully, though, about 98 percent of the spam that MailChimp identifies is what Foreman calls “ignorant” — that is, people or companies that just don’t know the laws or best practices around sending emails. But ignorance doesn’t mean MailChimp relaxes its rules. Recently, it even flagged Pearl Jam for spammy practices because the band was trying to reconnect with old fans whose email addresses read like a who’s who list of 1990s email providers.

Having such a high percentage of ignorant spam actually has a positive effect on the company’s overall goal of monetizing its vast data repositories. Because the AI model automates what used to be a manual process, and because most innocent spammers will fall in line quickly once they’re notified (as opposed to nefarious spammers who constantly try to outsmart the system), MailChimp can pretty much set the model loose, forget about it and get to work on new efforts, Foreman said.

Now, about that pulp

Spam under control, MailChimp can focus its efforts on actually building new products with data, just like Coca-Cola did with that extra pulp. One of its first orders of business is figuring out how to help customers get to know better the people to whom they’re sending their newsletters.

With this in mind, the company built a service called Wavelength that shows customers other newsletters that are similar to theirs. But the system that powers Wavelength also stores pretty much every interaction that every email address in the company’s database has with the newsletters they’re sent. That means what emails they open and when they open them, what links they click and when they click them, and what other newsletters they’re subscribed to. MailChimp also has a feature called Ecommerce360 that lets customers track clicks right through to conversions (marketing speak for someone actually buying something).

The company has been playing around with this data to identify clusters of users based on their behaviors and their interests — some of which Foreman has detailed on the company’s blog — and now it wants to roll it out to customers via a product MailChimp is calling ChimpQuery. Built atop Google’s BigQuery analytics service, ChimpQuery will let customers start doing this type of clustering and segmentation on their own, while saving MailChimp the troubles of hosting that infrastructure itself. (You can play with a monstrous, interactive graph of the entire MailChimp subscriber list here.)

If you sell knitting supplies and you find out there’s a big cluster of people on your mailing list who also are interested in wedding planning and custom jewelry, there might be an opportunity to create your content with these interests in mind or even to partner with companies in those spaces.

A sample cluster of subscribers.

A sample cluster of subscribers.

Another topic that has been on Foreman’s mind lately is what he calls “frequency elasticity of engagement.” He’s done research suggesting that blasting the heck out of your email list might actually have detrimental effects in the long term (regardless of how the Obama campaign successfully exploited this strategy) but noted that engagement also has a lot to do with content and a particular company’s given user list. MailChimp’s data could help customers figure out the ideal schedule for emailing their subscribers.

For example, Birchbox has really high engagement because people love the service and have to open their emails to find out what goodies they’re receiving. Emails from a company like Papa John’s, on the other hand, might sit in someone’s inbox essentially as spam until they want to order a pizza and go searching for a coupon. Everyone has to figure out what pace and engagement metrics work for them.

Reining expectations back in

However, now that management is fully sold on the power of data, Foreman sometimes finds himself managing expectations rather than just pitching his ideas. COO Bainton, for example, is adamant that MailChimp start aiding its publishing-industry customers by using techniques such as natural-language processing and semantic analysis to help them personalize emails based on readers stated and unstated interests (that is, what boxes they check when they sign up and what stuff they actually click on).

Foreman, well, he’s pretty sure that’s too big a challenge for MailChimp to tackle considering how many publishing customers it has. MailChimp would have to understand all those customers’ industries to some degree (open source tools tend to highlight technically but not situationally relevant relationships, he said, and don’t always understand things like sarcasm) and probably the different languages they publish in, as well. Rather than understand content, he’d rather focus personalization efforts around how users are connected.

The company also needs to balance its ambitions with what’s legally and socially acceptable. The creep factor might be more important than what’s legal when it comes to email marketing. MailChimp determines the legality of everything it does before rolling it out, Foreman explained, but in era of “post-modern spam” where legitimacy is in the eye of the recipient and where some people use their “spam” button as a proxy for unsubscribing, companies must be careful not to offend.

“The more we can tell you about that list without getting creepy is really useful,” Bainton said. However, he added, ”I think expectation is more important than law.”

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Why the next front in big data might be psychological
  • Will Hadoop Vendors Profit from Banks’ Big Data Woes?
  • The Red-Hot Data Warehouse Market: Who’s Buying Next?

    


GigaOM