Edvancer's Knowledge Hub

Essentials for a data scientist

Manu Jeevan 14/07/2018

A combination of subject matter expertise, mathematics, and computer science is what people commonly expect a data scientist’s skillset should consist of. Obviously, only a few people fit the description since this is no mean feat. Therefore, “ignorance is bliss” cannot work for you if you’re in this field. However, I’ve found that the skill set you need to be effective, in practice, tends to be more targeted and easily achievable. This approach changes both your perspective about data science and what you expect from it. A background in computer science helps with understanding software engineering, but writing data products that actually perform requires specific techniques for writing data science code that delivers. Subject matter expertise is needed to pose interesting questions and interpret results, but this is often done in collaboration between the data scientist and subject matter experts (SMEs). In practice, the skill to engage SMEs in vigorous experimentation is much more imperative for data scientists. A background in mathematics and statistics is needed to understand the details of a majority of machine learning algorithms, but to apply those algorithms effectively requires a more specific understanding of how to evaluate hypotheses. Realistic Expectations In practice, collecting data comes later on, data scientists usually start with a query that they must solve using data that could provide them some insight. A data scientist has to be able to come up with a hypothesis which can be used to explain the data. For example, collaborate collaboration with HR in an effort to find the factors that had a positive effect on employee satisfaction at a firm. After a few short sessions with the SMEs, it was evident that you could probably identify an unsatisfied employee with just a few simple warning signs—which made decision trees (or association rules) a natural choice. We selected a decision-tree algorithm and used it to produce a tree and error estimates based on employee survey responses. Once we have a hypothesis, we need to figure out if it can be trusted. Figuring out what available evidence would be useful for a particular task is the main challenge in judging a hypothesis. In data science today, we spend way too much time focusing on the intricacies of machine learning algorithms. A machine learning algorithm to a data scientist is just like a compound microscope to a biologist. It’s a source of evidence. The biologist should understand that evidence and how it was produced, but custom grinding lenses or calculating refraction indices are not the things we should expect from him but something much beyond them. It is essential for a data scientist to be able to understand an algorithm. But confusion about what that implies causes would-be great data scientists to steer clear of the field, and practicing data scientists to focus on the wrong thing. Interestingly, in this matter the Turing Test can give us an insight. The Turing Test gives us a way to recognize when a machine is intelligent—talk to the machine. If it is difficult to distinguish whether it’s a machine or a person, the machine is intelligent. We can use the same principle in data science. If you can hold an intelligent conversation about the results of an algorithm, then you more or less understand it. In general, here’s what it looks like: Q: Why are the results of the algorithm X and not Y? A: The algorithm operates on principle A. Because the circumstances are B, the algorithm produces X. We would have to change things to C to get result Y. Here’s a more specific example: Q: Why does your adjacency matrix show a relationship of 1 (instead of 3) between the term “cat” and the term “hat”? A: The algorithm defines distance as the number of characters needed to turn one term into another. Since the only difference between “cat” and “hat” is the first letter, the distance between them is 1. If we changed “cat” to, say, “dog”, we would get a distance of 3. The point is to focus on engaging a machine learning algorithm as a scientific apparatus. Know its interface and its output like the back of your hand. Form mental models that will allow you to bridge the gap between the two. Rigorously test that mental model. If you can understand the algorithm, you can understand the hypotheses it generates and you can start looking for evidence that will either support or go against the hypothesis. We tend to judge data scientists by how much they’ve stored in their heads. We look for thorough knowledge of machine learning algorithms, your track record and experiences in a particular domain, and an over all understanding of computers. I believe it’s better, however, to judge the skill of a data scientist based on their history of shepherding ideas through funnels of evidence and arriving at conclusions that are applicable in the real world.

About
Latest Posts

Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.

Latest posts by Manu Jeevan (see all)

Python IDEs for Data Science: Top 5 - January 19, 2019
The 5 exciting machine learning, data science and big data trends for 2019 - January 19, 2019
A/B Testing Made Simple – Part 2 - October 30, 2018

Share this on

Follow us on

Author : Manu Jeevan

Edvancer's Knowledge Hub

Essentials for a data scientist

Manu Jeevan

Latest posts by Manu Jeevan (see all)

Enrol For A Free Data Science & AI Starter Course

Don't Miss This Absolutely Free, No Conditions Attached Course