What happens if you can’t believe your own dashboard? Whether it’s for your car, your plane or your computing cloud, it’s not a good thing if the console that’s supposed to tell you what’s really going on just isn’t doing so.
That’s why the recent Heroku-Rap Genius dustup is important. To recap: About two years ago Rap Genius, which runs its Ruby-based application on Heroku’s platform as a service, started noticing performance issues. As traffic grew, it dutifully added more Heroku resources, aka “dynos,” in Heroku parlance. But performance still lagged. Rap Genius dealt with lots of customer complaints although its Heroku log files and related New Relic dashboard said nothing was amiss.
Customer mandate: transparency and trust
It turns out that Heroku, the PaaS company acquired by Salesforce.com in 2010, had tinkered with the routing underpinnings of its site in such a way that jobs were not getting deployed optimally. This move from “intelligent load distribution” to “random load distribution” plus the fact that this change was not documented let alone publicized to customers, was the issue.
In a February 13 Rap Genius blog post detailing the issue, the company said:
“A Rails dyno isn’t what it used to be. In mid-2010, Heroku quietly redesigned its routing system, and the change — nowhere documented, nowhere instrumented — radically degraded throughput on the platform. Dollar for dollar a dyno became worth a fraction of its former self.”
That blog post which generated a ton of “up-votes” on Hacker News and Heroku issued an apology, which TechCrunch covered.
Rap Genius Tom Lehman described what happened to me in a recent phone interview. “We had been running 90 dynos at $ 20,000 a month which we thought was sufficient based on the incorrect data we were getting but it turned out that 90 dynos was woefully inefficient. So we upgraded to 300 dynos at $ 40,000 per month and performance is still bad. We can’t pay $ 40,000 a month for this.”
On February 16, Heroku issued a more detailed apology and outlined a plan of action including:
- Improving our documentation so that it accurately reflects how our service works across both Bamboo and Cedar stacks
- Removing incorrect and confusing metrics reported by Heroku or partner services like New Relic
- Adding metrics that let customers determine queuing impact on application response times
- Providing additional tools that developers can use to augment our latency and queuing metrics
- Working to better support concurrent-request Rails apps on Cedar
When asked for comment, Heroku referred back to its blog post.
Lehman said his company is in a tight spot. It can’t sustain payments of $ 40,000 per month. “Unless something changes we have to move.”
the likely destination? Amazon Web Services, a transition he would not take lightly because Heroku does much that AWS cannot. On the other hand, many of Rap Genius’ third-party providers are already on AWS. ”I still have love for Heroku. Without it we couldn’t get to where we are today but they have not been 100 percent upfront with customers.”
In his view, this should not be the end of the story. “We feel Heroku (and therefore Salesforce.com) overcharged and misled a bunch of small (and big!) start-ups and if they indeed did something wrong they should be held accountable.”
The bigger picture
I’ve asked Lehman if he is party to the class action suit and will update when he responds, but lets get back to the broader issue.
Companies already get the heebie jeebies over the perception that moving to the cloud involves a “loss of control” over its IT. Imagine the impact if they think they can’t trust or believe in the metrics they’re given by their providers.
This is about way more than Heroku and Rap Genius. It’s about customer trust and the lack of that is a real danger to cloud adoption.
Photo courtesy of Shutterstock user 3Art
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
- Cloud and data third-quarter 2012
- Platform as a Service in 2012
- PaaS market accelerators, 2012–2013