Edvancer's Knowledge Hub

Essentials for a data scientist

essentials for data scientist

A combination of subject matter expertise, mathematics, and computer science is what people commonly expect a data scientist’s skillset should consist of. Obviously, only a few people fit the description  since this is no mean feat. Therefore, “ignorance is bliss” cannot work for you if you’re in this field. However, I’ve found that the skill set you need to be effective, in practice, tends to be more targeted and easily achievable. This approach changes both your perspective about data science and what you expect from it. A background in computer science helps with understanding software engineering, but writing data products that actually perform requires specific techniques for writing data science code that delivers. Subject matter expertise is needed to pose interesting questions and interpret results, but this is often done in collaboration between the data scientist and subject matter experts (SMEs). In practice, the skill to engage SMEs in vigorous experimentation is much more imperative for data scientists. A background in mathematics and statistics is needed to understand the details of a majority of machine learning algorithms, but to  apply those algorithms effectively requires a more specific understanding of how to evaluate hypotheses. Realistic Expectations In practice, collecting data comes later on, data scientists usually start with a query that they must solve using data that could provide them some insight. A data scientist has to be able to come up with a hypothesis which can be used to explain the data. For example, collaborate collaboration with HR in an effort to find the factors that had a positive effect on employee satisfaction at a firm. After a few short sessions with the SMEs, it was evident that you could probably identify an unsatisfied employee with just a few  simple warning signs—which made decision trees (or association rules) a natural choice. We selected a decision-tree algorithm and used it to produce a tree and error estimates based on employee survey responses. Once we have a hypothesis, we need to figure out if it can be trusted. Figuring out what available evidence would be useful for a particular task is the main challenge in judging a hypothesis. In data science today, we spend way too much time focusing on the intricacies of machine learning algorithms. A machine learning algorithm to a data scientist is just like a compound microscope to a biologist. It’s a source of evidence. The biologist should understand that evidence and how it was produced, but custom grinding lenses or calculating refraction indices are not the things we should expect from him but something much beyond them. It is essential for a data scientist to be able to understand an algorithm. But confusion about what that implies causes would-be great data scientists to steer clear  of the field, and practicing data scientists to focus on the wrong thing. Interestingly, in this matter the Turing Test can give us an insight. The Turing Test gives us a way to recognize when a machine is intelligent—talk to the machine. If it is difficult to distinguish whether it’s a machine or a person, the machine is intelligent. We can use the same principle in data science. If you can hold an intelligent conversation about the results of an algorithm, then you more or less understand it. In general, here’s what it looks like: Q: Why are the results of the algorithm X and not Y? A: The algorithm operates on principle A. Because the circumstances are B, the algorithm produces X. We would have to change things to C to get result Y. Here’s a more specific example: Q: Why does your adjacency matrix show a relationship of 1 (instead of 3) between the term “cat” and the term “hat”? A: The algorithm defines distance as the number of characters needed to turn one term into another. Since the only difference between “cat” and “hat” is the first letter, the distance between them is 1. If we changed “cat” to, say, “dog”, we would get a distance of 3. The point is to focus on engaging a machine learning algorithm as a scientific apparatus. Know its interface and its output like the back of your hand. Form mental models that will allow you to bridge the gap between the two. Rigorously test that mental model. If you can understand the algorithm, you can understand the hypotheses it generates and you can start looking for evidence that will either support or go against the hypothesis. We tend to judge data scientists by how much they’ve stored in their heads. We look for thorough knowledge of machine learning algorithms, your track record and experiences in a particular domain, and an over all understanding of computers. I believe it’s better, however, to judge the skill of a data scientist based on their history of shepherding ideas through funnels of evidence and arriving at conclusions that are applicable in the real world.

Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.
Manu Jeevan
Share this on

Follow us on
Author :
Free Data Science & AI Starter Course

Enrol For A Free Data Science & AI Starter Course

Learn R, Python, basics of statistics, machine learning and deep learning through this free course and set yourself up to emerge from these difficult times stronger, smarter and with more in-demand skills! In 15 days you will become better placed to move further towards a career in data science. Upgrade to the specialization programs at attractive discounts!

Don't Miss This Absolutely Free, No Conditions Attached Course