How Facebook moved 30 petabytes of Hadoop data

For anyone who didn’t know, Facebook is a huge Hadoop user, and it does some very cool things to stretch the open source big-data platform to meet Facebook’s unique needs. Today, it shared the latest of those innovations — moving its whopping 30-petabyte cluster from one data center to another.

Facebook’s Paul Yang detailed the process on the Facebook Engineering page. The move was necessary because Facebook had run out of both power and space to expand the cluster — very likely the largest in the world — and had to find it a new home. Yang writes that there were two options, physical migration of the machines or replication, and Facebook chose replication to minimize downtime.

Once it made that decision, Facebook’s data team undertook a multi-step process to copy over data, trying to ensure that any file changes made during the copying process were accounted for before the new system went live. Perhaps not surprisingly, the sheer size of Facebook’s cluster created problems:

There were many challenges in both the replication and switchover steps. For replication, the challenges were in developing a system that could handle the size of the warehouse. The warehouse had grown to millions of files, directories, and Hive objects. Although we’d previously used a similar replication system for smaller clusters, the rate of object creation meant that the previous system couldn’t keep up.

Ultimately, Yang writes, the migration proved that disaster recovery is possible with Hadoop clusters. This could be an important capability for organizations considering relying on Hadoop (by running Hive atop the Hadoop Disributed File System) as a data warehouse, like Facebook does. As Yang notes, “Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.”

For Facebook, though, it looks like its fast-growing Hadoop data warehouse is just part of a larger trend toward needing more space. Last night, Facebook confirmed it’s building a second data center in Prineville, Ore., next to its existing one. That will make three for the company, which also is building a data center in Forest City, N.C.

Image courtesy of Flickr user daretothink.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.