Pinterest, Flipboard and Yelp tell how to save big bucks in the cloud

Amazon Web Services can be a great platform for startups when they’re small, but costs can outpace revenue growth pretty quick — especially if you’re offering a a free consumer service. At AWS’s Re: Invent user conference last week, engineers from Pinterest, Flipboard and Yelp shared their impressive and sometimes ingenious techniques for keeping costs under control and their bottom lines healthy.

Pinterest Operations Engineer Ryan Park had the stage to himself for a session on Wednesday, while Flipboard Chief Architect Greg Scallon and Yelp Engineering Manager Jim Blomo teamed up with Kleiner Perkins Caufield Byers Partner Ray Bradford to form a trifecta of wisdom on Thursday.

Know — and measure — your costs

Flipboard’s Scallon had a paradoxical lesson for the audience when it comes to managing cloud-based infrastructure: Embrace the cloud, but be afraid of the cloud. Yes, it’s flexible and affordable if done right, but all it takes is poor planning or a handful of servers left running ad infinitum, and the costs can begin to grow out of control. That’s why Flipboard assigns members of its engineering team the title of “chief miser,” which means they’re the ones who decide that applications are using the right resources and using them wisely.

Thanks to a variety of practices, including its miserly ways, Scallon said Flipboard is now running about 900 instances at any given time. That’s down from a peak of about 1,500.

Some stats on Flipboard's AWS usage

Some stats on Flipboard’s AWS usage

One way to help ensure this sort lean operation is to understand your business inputs and outputs, Kleiner Perkins’s Bradford explained. He suggests companies ask, for example, what it costs them to serve a free user on their platform and how does that change with scale or affect the experience they can offer premium users. Pick metrics that really matter, he said (e.g., infrastructure cost per user per month) and then consider how long your current  architecture can sustain that cost before it’s time to retool.

The secret weapon: Source your instances wisely

Pinterest, Yelp and Flipboard all swear by AWS’s pre-paid Reserved Instances in order to save money over the long haul. In fact, Flipboard’s Scallon said, the e-reading startup sees cost savings of about 80 percent over three years by using heavy-duty Reserved Instances instead of on-demand instances for its base workloads, and the break-even point might be only eight or nine months. Pinterest’s Park cited savings of about 70 percent over three years using them.

20121129_154538

The trick is queuing another job to take up the waste.

Yelp’s Blomo said his company is a heavy Elastic MapReduce (EMR) user, peaking at more than 350 Elastic MapReduce instances when many developers run their Hadoop jobs simultaneously or when it’s doing nightly analysis of its log files. In order to keep costs in check, Yelp uses Reserved Instances whenever possible to save on hourly bills and has implemented a job-flow pooling system to keep Hadoop jobs running continuously as resources become available. This helps avoid the situation where a job completes in 61 minutes, for example, thus triggering the charge for a full hour of resources even though it only used a minute worth of the second hour.

In order to best gauge when it should use what type instance, Yelp created a tool called EMRio that analyzes past usage to determine what resources are the most-efficient choice for any given job.

emrio

The results of EMRio

When it comes to optimizing costs on AWS, though, Pinterest appears to have it all figured out — even how to make use of the somewhat tricky Spot Instances that are priced based on demand and can be terminated without notice if the market price outgrows a user’s bid. Park explained how Pinterest uses the heck out of Reserved Instances and created its own auto-scaling “watchdog” service that decides whether to use Spot Instances or on-demand instances when more resources are required.

Ryan Park dropping knowledge -- and graphs

Ryan Park dropping knowledge — and graphs

Although Spot Instance prices occasionally spike through the roof, Park’s experience is that they typically remain stable and can result in “massive” savings if you know how to use them effectively. Using Spot Instances to power Pinterest’s approximately 80 front-end servers costs only about $ 20 per hour, he said. All told, Pinterest has reduced its daily computing bill to about $ 440 from about $ 1,200.

All this being said, though, Park, Blomo and Scallon all acknowledged that the flexibility of being able to mix on-demand, reserved and spot servers might not be all it’s cracked up to be if you don’t understand how they all work. Reserved Instances are inflexible in terms of size and region once you reserve them, and Spot Instances must be used wisely for jobs or applications that can handle their easy come, easy go nature. And now there’s even more to consider because Reserved Instances can be resold via AWS’s spot marketplace.

“It gets a little tricky,” Blomo said.

Pick your challenges

Although decisions such database type and structure are largely architectural, there might be elements of cost efficiency at play, as well. Maybe Kleiner Perkins’s Bradford put it best while leading off the session with Scallon and Blomo. Bradford presented a slide containing a simple quote from Instagram Founder Mike Krieger: “Your users around the world don’t care that you wrote your own database.” Sometimes, Bradford added, it might be best to use what works — maybe even a managed service — rather than whatever’s trending highest on Hacker News.

Pinterest’s Park expressed a similar sentiment during his session, citing a lesson his team learned about trying out too many new databases. The site used to use MongoDB, Cassandra, Redis and other databases simultaneously, but learning all the new technologies and managing them became burdensome. Now, he said, Pinterest uses good, old-fashioned MySQL (granted, it sharded MySQL 4,000 times) and memcached — as well as Redis — because they have strong communities and new engineers are more likely to know how to work with them.

After explaining EMRio and some other custom-built Hadoop tools to the crowd, Yelp’s Blomo noted that companies should carefully consider whether the time and money it takes to build stuff will actually result in commensurate savings once those tools or systems are in production. That can require some tough balancing of criteria such as cost, performance, flexibility and user experience.

But it’s important to use human resources wisely. As Bradford said during his presentation, “There’s no free lunch when it comes to developer time.”


GigaOM