If there’s one thing that web startups –particularly engineers at web startups — think about a lot, it’s what they call “scale.” What they usually mean by that is the ability to take the beta service tested with a dozen friends and turn it into a globe-spanning colossus, with millions of users interacting simultaneously, all while ensuring that those users don’t experience delays in the service. But the route to those twinned goals is never an easy one, says former Twitter engineer Alex Payne, who now works for a financial startup called BankSimple. As Payne notes in a blog post, focusing on speed can not only send you in the wrong direction, but leave you high and dry when you are in desperate need of true scalability.
The impetus for Payne’s post was an ongoing discussion about a software program called Node, which is used for running JavaScript code on a virtual machine. But alongside his comments about that specific topic, the software engineer noted that many startups confuse engineering for speed with the ability to build something that can really scale. He writes that scaling is so hard that “the ability to scale is a deep competitive advantage of the sort that you can’t simply go out and download, copy, purchase, or steal.” As he noted later in his post, the availability of high-powered computing systems and plenty of bandwidth is great for speed, but that doesn’t solve the scale problem:
The power of today’s hardware is such that, for example, you can build a web application that supports thousands of users using one of the slowest available programming languages, brutally inefficient datastore access and storage patterns, zero caching, no sensible distribution of work, no attention to locality, etc. etc. Basically, you can apply every available anti-pattern and still come out the other end with a workable system, simply because the hardware can move faster than your bad decision-making.
When it comes to truly scaling to Twitter or even Facebook size, however, those stop-gap solutions don’t really work any more, Payne says.
When your system is faced with a deluge of work to do, no one technology is going to make it all better. When you’re operating at scale, pushing the needle means a complex, coordinated dance of well-applied technologies, development techniques, statistical analyses, intra-organizational communication, judicious engineering management, speedy and reliable operationalization of hardware and software, vigilant monitoring, and so forth. Scaling is hard.
Debates about scale aren’t just an esoteric discussion of interest to engineers and developers. As Twitter’s repeated issues with reliability have shown, getting the right architecture in place to grow quickly and seamlessly — that is, the right combination of both software and hardware — is incredibly important, as Om noted, because it’s very difficult to re-engineer a service as large and fast-growing as Twitter is after the fact. It’s a little like realizing that you have the wrong kind of airplane, and then trying to convert the one you have into the one you need, all while you are still flying, and without crashing or disturbing your passengers.
Twitter investor and VC Fred Wilson said recently that the service repeatedly breaks because “it wasn’t built right — [it] was built kind of as a hack and they didn’t really architect it to scale and they’ve never been able to catch up.” In the past, Twitter’s own founders have admitted that the architecture they chose couldn’t keep up with the company’s growth, in part because they didn’t expect the SMS-style service they started with to become such a widely used form of communication — used not just for personal updates, but for everything from breaking news stories to providing customer support for major corporations.
In blog post, Payne notes that Twitter has solved some of its small problems with software changes, but that the service “is still fighting an uphill battle” to scale in a more substantial way. And unfortunately for every startup that is hoping something from the NoSQL movement or a specific development language or Node will be the magic ingredient that will transform their service into one with Facebook scale, Payne adds that “there are no panaceas for problems of significant scale.” For more on how Facebook has managed to grow to serve more than 500 million users and handle 100 billion hits a day, check out this recent post from the social network’s head of engineering.
Related content from GigaOM Pro (sub req’d): Social Networks Need to Grin and Bear Infrastructure Costs
Post and thumbnail photos courtesy of Flickr user jpctalbot