How Facebook solves the IT culture wars and scales its site

Scaling isn’t just a matter of software and code, there’s also a huge cultural issue at play. At Facebook, solving problems between the engineering and operations teams, quelling fears about job loss related to automation at the employee level, and delivering tools to monitor the company’s IT operations all play big roles in helping the site scale to 950 million users.

In a talk at the Surge Conference in Baltimore, Md., Pedro Canahuati, director of production engineering and site reliability at Facebook, explained how the social network keeps the site available, reliable and efficient.

Scale smart not not just fast.

Adding servers is essential to keeping Facebook available to the rapidly growing user base, but to remain reliable for the long term, Facebook needed a system that could scale up in a hurry if it needed to add tens of thousands of servers at a time. In August 2009 when Canahuati arrived at Facebook, he said the site spent seven weeks to get 10,000 servers into production once they were plugged in. Thanks to the development of Triforce, it now takes seven days to get the site running on 10,000 servers. But in Canahuati’s opinion, the rate of getting software on servers is still too slow, so there are two Facebook employees still working on making that process faster.

Building a tool itself saves Facebook the headache of dealing with open source code that might not be able to hack the strain of its infrastructure demands, while also keeping Facebook from paying a vendor a license fee that could become astronomical as it scaled. Understanding the process of building the tools to keep the site’s infrastructure up and running, and the impact that process plays on the ultimate goal of reliability, is where Facebook has carved out another advantage that other large web services could learn from.

Like his peers at Pinterest, Canahuati stressed that when you are at massive scale, you need to keep it simple.

Make it easy to see what’s wrong.

He also added a few more tenets, including “instrument your world.” He explained how Facbeook collects a lot of data across many of its systems in order to understand how the different services that comprise the site are performing and interacting. Tools such as Scuba and Claspin are examples of this effort to take complex operational data and make it easy to understand.

“Smart visualizations are often overlooked,” said Canahuati, but scanning data at a glance and then acting quickly on it is an essential ingredient for keeping systems up and running.

Automation can be a dream — or a nightmare.

One of Facebook’s secrets to scale is automation, but automation can create its own problems. At its worst, one could create a system that automatically brings down the whole site. Elsewhere, automation can mask a more systemic problem.

However, a more persistent worry is that engineers and operations people working to automate certain actions might think they’re coding themselves out of a job. To alleviate those fears, Canahuati says Facebook has implemented several strategies that boil down into keeping a lean team works on multiple jobs and problems. This ensures that when automation solves one problem, there’s another one waiting in the wings to be solved, and also sets the expectation that employees are responsible for the whole site broadly and not just one tool set.

This approach is carried all the way to the way Facebook manages and hires its employees. It expects the engineering team that builds Facebook products to be aware of the operations side of things and build tools that help the operations team out. Operations employees are expected to be able to code and work with the engineering teams. “The software guys who build the code must take ownership for it working on a big system at scale,” Canahuati said.

But these operationally aware engineering teams and engineering-aware operational teams must have buy in at the top because people who code generally cost more, so hiring operational team members who code requires a bigger budget.

The tactic seems to work for Facebook, which is clearly trying to build a culture of responsibility and effectiveness that can scale the same way its servers do. Hence its continued release of tools that others can use to monitor their own giant deployments, as well as its operational slogan: “Fix more, whine less.”

In many ways, Canahuati’s points are good policies for any corporate culture today. Traits such as communication between teams, blame avoidance and employees that think strategically instead of just about their skillsets are helpful no matter how many servers you have.


GigaOM