These days, it is impossible to work in the analytics industry without hearing phrases such as “data science” or “big data,” but the role of an actual data scientist is still relatively new. Just check out this Google Trends graph for the search term “data scientists” and you’ll see that ten years ago, the term didn’t even exist.
The graph raises an important question: if the role of a data scientist is so new, how do you know if you need to hire one?
Data Science’s Low Barrier to Entry
Today, the industry’s most famous and common machine learning algorithms are available for free via dozens of open source libraries (perhaps the most common are NumPy and SciPy, two libraries in Python). A library is software that has been written by somebody else and that you build on top off.
Given that companies like Google, Netflix and many others all use the same fundamental data science concepts, these libraries are extremely powerful tools. Not only are these libraries free to use, but there are thousands of tutorials and books which help users understand exactly how to use them.
So Why Hire A Data Scientist?
If the industry uses the same subset of algorithms, and these algorithms are free to use, and there are thousands of guides to help build them — why do you need a data scientist? Shouldn’t any motivated engineer be able to get the job done?
Not quite. In the same way you don’t have to be a mechanic to drive a car, you don’t need to be a Ph.D. to understand basic data science. Of course, that doesn’t mean every person with a driver’s license is able to pop the hood and start making adjustments. While building any machine learning algorithm could happen in a matter of days (at Umbel we built our first ad recommendation engine in less than two days), optimization and scaling are just two of the many myriad of problems which require a deep understanding of the methods employed.
“Scaling software to be able to handle thousands of queries, process billions of data points and work in a matter of seconds is an extremely challenging problem.”
Scaling software to be able to handle thousands of queries, process billions of data points and work in a matter of seconds is an extremely challenging problem. It is 100% easier to say than to actually do. And, open source libraries aren’t efficient enough to process information at the scale needed for most companies. Instead, these libraries help validate proof of concepts quickly. Building machine learning algorithms is often done from scratch and in a language that fits well with the needs and technical environment of the company. Optimization means being able to understand why an algorithm is or isn’t performing well. It means popping open the hood and tweaking the engine to make the car drive exactly how you want it to.
Scaling and optimization problems require a deep understanding of algorithms, statistics, and linear algebra concepts. While an engineer could certainly learn these concepts eventually, machine learning academia focuses heavily on solving exactly these kinds of problems. In these spaces, a Ph.D. certainly wouldn’t hurt.
Of course, Umbel’s platform is the only data scientist you need.
P.S.: If you’re interested in learning more about data science, our engineering team recommend the following resources:
Online Tutorial/Guide: A Programmer’s Guide to Data Mining
Sample Dataset: Iris Dataset