There's something of a cottage industry growing around writing about demystifying the data scientist.
A large part of the confusion I think comes down to the fact that Data Science isn't a specific role, but rather an approach to solving real problems within an organization. As such, it can represent a wide range of backgrounds, skill sets, tools and practices. No single summary of 'what makes a data scientist' can cover the gamut of the what you'll see in the field among working data scientists.
Instead, it's probably easier to ask what the practice of data science is.
Briefly, data science is the practice of solving a practical problem with data-driven answers. The techniques for doing so can be wide ranging. You'll often hear about data scientists using classical statistics, Bayesian methods, machine learning, computational tools and domain knowledge to solve these problems.
Sometimes the data comes at tremendous scale and some very sophisticated tools and methods are needed to pierce the nebulous fog and pinpoint a clear insight hidden within the data. But no single method, tool or equation can answer every problem, so there is no one thing that defines a data scientist.
Perhaps the best thing to do is to show, not tell. Let's step through a typically atypical day in the life of a data scientist.
First thing in the morning is probably the closest thing to a routine that exists in my day.
Our team starts the day with a status meeting to share the progress and problems from the day before. This is a bit different from your standard software 'stand-up' meeting because progress for us can be anything from building a piece of software to reading an academic paper that leads us towards a deeper understanding of a problem at hand.
While data science is different in many ways from the academic, scientific process you'd find in university somewhere, it is still legitimately a practice of the application of the scientific method.
Often, our challenge is to turn the unknown into the knowable. And furthermore, into the actionable.
This means testing a hypothesis by analyzing data, constructing methods to measure outcomes, and iterating on that process until our findings are sufficiently refined that they're useful. Our morning meetings are an opportunity to share how each of several of these experiments are progressing.
Now that we all have our bearings, it's time to do some real, actual work. This is the fun part of the day. This is when I sit down and focus on a particular problem.
Maybe that means researching some methods for traversing a bi-graph or writing some code that will calculate a Gaussian Hyper-geometric Function or modeling some probability distributions. But it's rarely the same problem from one week to the next. Having the background in mathematics, statistics and programming is essential to being able to approach these problems — but it's not enough. There's just no way to be an expert in every possible method. And there's no way to predict exactly what you'll need to know for the next problem.
This is why data scientists need to be boundlessly curious and constant learners.
More often than not, each new question requires a new approach. Possibly not just new to you, but new in the world, in its own little way. It's both the challenge and the excitement of data science.
Uncertainty isn't just a statistical property, it's a way of life.
The practice of data science extends beyond the technical details of implementing some algorithm or white-boarding out arcane math.
Fundamentally, we are engaged in solving a real-world problem. That means understanding the problems people face. This is a great time to meet with clients, business development, services, client success and anyone else to get a full, well-rounded understanding of the problems our clients and partners face, day-to-day.
All the power of big data analytics and machine learning don't mean much at all unless they make someone's life easier.
There's an important, but regularly ignored, step in the process of data science that will ultimately determine the success or failure of any project. It's the bit that truly distinguishes data science from, say, academic science.
Translating a business problem into a rigorous research project, and then translating the results back into a practical solution requires a deep understanding of the business domain and no small amount of creativity.
A data science team will never be successful working in seclusion, optimizing some algorithm endlessly. Sure, that's sometimes what's required to drive a project over the finish-line. But what does it matter if it can't be put to use?
Keeping a close relationship with the people out in the field and with their everyday challenges is the only way to bridge the gap between the data itself and what it can tell us about the world.
So now that we've had a chance to roll up our sleeves and dig into the problems both from the technical and practical side of things, it's time to take a step back and consider the big picture.
We'll usually spend some time in the afternoons discussing the course of a project in detail or road-mapping the next steps to bring a research project into a deliverable format.
Frequent contact with the product team ensures our work stays aligned with the overall vision and course of the organization. It's important to keep ourselves focused. Because we're in the business of solving problems, we have to make sure our solutions work. I mean, actually work in practice. A proof-of-concept is just the beginning.
We need to build reliable, repeatable tools.
That include both generalizing an experiment to apply to broader use cases, and engineering a solution that can go into our product.
Here's where we do our best impression of software developers and concern ourselves with performance and stability and scalability and writing a bunch of tests to ensure all of that. Our outstanding engineering team here at Umbel has built a tremendous system and they don't need us giving them extra work, cleaning up after our messes.
The only way to be able to migrate these experiments into a functional piece of software is to keep that goal in mind through the entire course of a project. This is why the research stage is more than just the math involved.
We need to know that we can not only solve the problem, but build software to solve the problem under realistic constraints.
END OF DAY
After all of this, it's important to take a deep breath and look at just how far you've come. Some days it's further than others. Most experiments will fail.
Most approaches will require honing and adjustment before they'll be ready for prime time. There are rarely any guarantees that you're on the right path. This is the uncertainty that comes along with breaking new ground. But every day you go through this process you invariably learn something new. At the end of the day, we will reflect on what we've learned and take that new bit of knowledge about the world into tomorrow's morning meeting.
With enough effort we end up with a straightforward, actionable answer to particular question.
And we can all be confident in that answer because it's supported, rigorously, by the data. Now, we're not done when the code compiles. As I said, the critical component of a data science project is translating the results into something meaningful and useful.
We have to take what we've learned and effectively communicate it to a varied audience. Fundamentally, we need to tell a story with the data. As is true about every other stage in the process, there's no right way construct a data story. This is where we look for ways to visualize our results in intuitive charts, or assemble a deck that walks through the problem to its solution, or even just saying "OK, here's what we're gonna do."
The point is, once we're done, everyone should understand what to do, and why.
You may have seen one of the many diagrams floating around online showing all the overlapping skills needed for a good data scientist.
They'll show you that you need a researcher who can hold their own as a software engineer, who is a natural mathematician and possibly and MBA who creates elegant visualizations of their coffee consumption as a weekend hobby.
Hopefully, taking a peek into what data scientist are doing daily helps to explain where those diagrams are coming from. But data science isn't a job description, it's a process. It's true, you need all of those skills, and more, to successfully carry out a data science project.
But don't get hung up thinking you need them all in a single person. Data Science, like any sufficiently complicated endeavor, is a team effort. That's why you'll rarely find any two data scientists who fit the same profile. A great data science team will span the gamut of all of those skills with a mixture of specialists and generalists, who all share a fundamental curiosity.
If there's one thing that truly defines what a data scientist must do well, it's learn.