How to Think like a Data Scientist, Even If You Aren’t One
if you want a data scientist – and have no inclination of becoming one – you might be wondering what the point is to learning how to think like a data scientist. After all, you can just hire one, right?
While this is certainly true, if you don’t understand how a data scientist thinks, then your entire project has a good chance of failing, while also costing you an arm and a leg. Conversely, if you understand how data scientists think and operate, you will be able to lay the groundwork for your project in such a way that it will be far more efficient and certainly a lot more effective.
Start with the Problem
The key to applying data science effectively is to define the problem you want to solve as accurately as possible. There is little point to collecting tons of data if you have no clear purpose in mind for it. If you try to define the problem afterwards, you’ll find that either much of the data is useless, the collection method wasn’t appropriate, or you don’t have all the data you need. So, you first need to define a specific problem that you want to solve.
The next step is to find out if it’s worth solving the problem, and if it can be solved using data science. Everything takes time and money, which is why it’s important that you determine if the solution to the problem in question will help move your business forward, and by how much.
For example, if the problem you want to solve will generate an increase of 1% in your sales, yet solving another problem could generate an increase of 25%, it’s clear which one you should pursue. However, if you don’t quantify the potential impact, you will end up wasting time and resources.
Another important aspect is to determine whether the problem can be solved. I’ve often found that businesses either have the wrong expectations or they don’t have the right data. And there’s also the issue that many try to do it on their own and bring in a data scientist much too late. Then, they expect that person to be able to magically fix everything, which rarely works without a significant investment of time and resources, as well as a lot of aggravation on the data scientist’s part.
So, while I’m certainly not advocating you do this on your own and strongly recommend that you do not make assumptions before consulting with a data scientist, I still feel it’s important to provide you with a few ways of determining if data science can be used to solve a specific problem. To do this, we will be using heuristics.
A Lightning Quick Look at Heuristics
To determine if a problem can be solved using data science, you must be able to phrase it either as a statistical modelling problem, a hypothesis test, a supervised learning problem, or an unsupervised learning problem.
A statistical modelling problem is one in which you are trying to figure out the relationship between two variables and if one is important for the other.
Hypothesis tests are employed when you want to conduct a comparison between two groups. In essence, you are looking to discover whether the two groups differ, and how they differ. A good example of this is A/B testing.
Supervised learning is a little more complicated and is generally used in scenarios where you want to automate tasks that a human has to do. For example, instead of getting someone to sit down and write out labels for all your images, you can automate that by feeding data into a machine, as well as examples provided by the human, so that the machine can replicate the decision-making process.
Unsupervised learning is often used in the event that you have highly complicated datasets with a significant number of variables and you really don’t know what to do them.
Thus, if your problem can be phrased in any of the above ways, then you can definitely use data science to solve it. Once again, I definitely recommend seeking the advice of a data scientist before making any assumptions to save you a lot of time and aggravation.
Clearly, though, you need to have a basic understanding of statistics and machine learning to be able to apply these heuristics to problem-solving.
A Short Explanation of Statistics and Machine Learning
Statistics is a branch of mathematics that is been around for a few hundred years and was the first attempt scientists made at analyzing data systematically. As a field, statistics tends to deal with smaller samples and is very strict in terms of model assumptions and development since it is based on math.
The main difference between statistics and machine learning is that in machine learning the goal is to get the task done and only then ask questions, whereas with statistics, every assumption has to be validated and verified before you can move on to the next step.
At its core, machine learning involves giving computers the ability to learn without providing any form of explicit programming. Thus, in machine learning, instead of giving the computer a set of rules it has to follow, you provide it with data so it can learn what to do.
There are two main types of machine learning, namely supervised learning and unsupervised learning
In supervised learning, the algorithm is supervised in some shape or form. Thus, it is provided with the raw data set, but it is also given a target. When the algorithm makes an error, the supervisor attempts to correct it. It’s as if the algorithm is learning with the help of a teacher.
In unsupervised learning, the machine is provided with the data but there is no target. Instead, the algorithm is allowed to do its own thing and identify patterns in the data on its own.
The problem with unsupervised learning is that it tends to be ill-defined and almost always has to be interpreted by a human at some stage.
All these concepts – and, of course, the terminology – can seem a little confusing.