Why becoming a data scientist might be easier than you think

Maybe the business world has jumped the gun with all the talk about a looming skills shortage in big data and advanced analytics. There’s mounting evidence that it doesn’t take much to turn a novice programmer or statistician into a perfectly capable data scientist. Maybe all it takes is just some cheap cloud computing servers, or a few weeks studying machine learning with Stanford professor Andrew Ng on Coursera.

Much of this evidence comes via Kaggle, a platform where companies and organizations award prizes for the best solutions to their predictive-modeling needs. In September, for example, I covered a first-time Kaggle user and admitted data science neophyte named Carter S. who won a competition using a simple but effective method he dubbed “overkill analytics.”

Impressive, sure, but Carter builds insurance-industry risk models for a living. While he’s able to learn new techniques such as natural-language processing and social network analysis as he goes, he’s no stranger to a linear regression. But what if someone’s only formal experience with computer science was a single undergraduate programming course?

Ask Luis Tandalla. That was his case before he took a handful of free online classes last year on Coursera. Yet the University of New Orleans senior recently scored his first victory in a Kaggle competition hosted by the Hewlett Foundation where he had to devise a model for accurately grading short-answer questions on exams. Not bad for a college senior who didn’t really know what artificial intelligence and machine learning were before he signed up to learn them.

Once Tandalla got started, he told me, he got passionate about learning more. So he also took Coursera classes on natural-language processing and probabilistic models, began studying on his own outside the online lectures and even got active on Kaggle (this was his first victory in five competitions). He’ll receive his bachelor’s degree in mechanical engineering in May 2013, but now Tandalla says he wants to pursue a master’s degree in machine learning and start his own predictive-software company

The Coursera connection

Maybe Tandalla isn’t so unique after all. The second- and third-place finishers in the Heritage Foundation competition, it turns out, also learned machine learning on Coursera. The latter, Xavier Conort, is a 39-year-old actuary from Singapore who just decided to become a data scientist last year and is now Kaggle’s top-ranked competitor.

Stanford professor and Coursera co-founder Andrew Ng — who teaches the machine-learning class that all three top finishers took — doesn’t think their success is just coincidence. If you’re not trying to make the types of contacts students at top universities are after, and your goal isn’t to perform advanced research, he explained, online education platforms such as Coursera (and, I’ll add, Udacity and EdX), can be incredibly valuable.

Andrew Ng

In particular, Ng said, “Machine learning has matured to the point by where if you take one class you can actually become pretty good at applying it.” Familiarity with algebra and probabilities are certainly helpful, he added, but the only real prerequisite to his course is a basic understanding of programming.

And with machine learning becoming “one of the more highly sought-after skills in Silicon Valley,” Ng said, corporate recruiters say just completing a single course can significantly boost someone’s salary and job prospects at companies where such knowledge is still in short supply.

“I bet many students are going on to [do] great things because of these courses [even if we never hear about it],” Ng said.

Why it works, and why it could change the world

Ng thinks the current incarnation of online education platforms work so well because they’re essentially nurturing the already-talented students who seek them out. Some professionals, he explained, take courses to learn skills such as machine learning or iOS programming that weren’t in vogue or didn’t even exist when they earned their computer science degrees just a decade ago.

Furthermore, with students able to learn at their own pace, there’s a lot of valuable information disseminated in the discussion forums.

Free access to the best teachers around doesn’t hurt either. Ng said he couldn’t teach his course so well if he hadn’t spent so much time living in Silicon Valley learning best practices from some of the smartest computer scientists on the planet. That experience lets him spend less time teaching algorithms for the sake of algorithms and more time talking about how one might actually apply machine learning in the field.

Ng says that’s a more important than just understanding the algorithms in a vacuum. He compares it to learning how to write a computer program instead of just learning the syntax of a programming language but not being able to string commands together into something useful. This approach isn’t entirely unique among the new order of online educators: On Udacity, for example, Google VP and Stanford professor Sebastian Thrun, centers the Computer Science 101 curriculum around learning Python in the context of building a working search engine.

The value of this opportunity wasn’t lost on Tandalla. He said he can feel the passion that professors have even through the pre-recorded video lectures, and it feels good knowing you’re learning from the people who literally wrote the book on the subject you’re studying.

Who knows who’s the next Einstein

But ultimately, minting new data scientists — even Kaggle winners — is low-hanging fruit. Ng said we don’t yet know how much impact online educations platforms like Coursera can have. In all fields, there are talented people all over the world who just need an avenue to hone their skills and a chance to distinguish themselves.

“It makes me wonder,” Ng said, “if the next Albert Einstein is a little girl in Afghanistan who just needs [the opportunity to access quality education].”


GigaOM