There are various myths rampant about data science. Data Science did not just come into being overnight, rather it has evolved over a long period of time to become an independent field. In 1960, Peter Nour used it as a substitute for computer science. In 1974, Naur published a study which freely used the term data science for the contemporary data processing methods used in a wide range of applications. Fast forwarding to 2001 where William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate advances in computing with data. In 2008, DJ Patil and Jeff Hammerbacher used the word “Data Scientist” for the first time to define their jobs at LinkedIn and Facebook, respectively.
This brief introduction conveys that data science has been in existence for long. It is gaining more and more popularity due to the availability of huge amounts of data, better computing capabilities and development of scalable machine learning algorithms and libraries. Since it is a fast-paced developing field opening many opportunities and possibilities, more and more people are focusing on building their career in data science. There has been a lot of confusion about the roles & responsibilities of a data scientist and vague myths about how to become one. In this article, I try to address a few of them and shine a light on their realities.
One needs to have a masters and/or PhD in Mathematics or Statistics or Data Science to pursue a career in Data Science.
We discussed the kinds of roles required in a data science project and team in our last article
. To a person with a sound academic background in any of these fields, data science would be a natural choice of career. She/he could choose to work in any of the roles as per interests. Such a person could be involved in developing new algorithms or refining existing ones and so on. Having said that, it does not mean that a person with no such expertise can’t be successful in data science. There are roles like Data Analyst and Data Engineer which do not require expertise in math or stats. One needs to develop an understanding of business & data and pick up on skills like excel and coding in python or R or spark. It is easy to build knowledge of distributed computing through online courses. One can learn stats and machine learning concepts from online or self-paced or classroom-based courses
. With a good mix of theoretical concepts and hands-on experience on data science projects offered by these courses, one can easily switch to a career in data science.
Data Science is all about analytical tools and coding.
Data Science is not equivalent to learning and working on analytical tools like SAS or Kafka. It is also not all about always coding in languages like R and Python. Data Science projects go through similar life cycles as a normal software project. One needs to have business acumen, interpretation skills, problem-solving mindset, ability to be creative with data, presentation skills and communication skills. Working on tools and coding is a part of a data science role. A data scientist should be able to understand the business problem and think of innovative ways to solve it. A data scientist should also be capable of presenting findings and observations to non-experts through simple graphs and plain English.
Data Scientists always work on developing predictive models.
Due to the romanticism of the word data scientist, we might think that a data scientist’s job is to create complex predictive models only. But its only half the truth of the story. Building models is the least time consuming and the easiest task in the job list of a data scientist. A major chunk of a data scientist’s time goes towards gathering relevant data and cleaning it, processing and feature engineering. Data collection is the first and most challenging task in the data science project. One needs to figure out the data that should be analyzed for the stated problem. The data then needs to be processed to make sense of it. Choosing the right variables for a given problem can be very daunting and frustrating. That’s where domain knowledge helps. Hence, it’s not all about machine learning algorithm only and always.
Working on Kaggle challenges translates to real-time project experience.
In Kaggle competitions, companies share problem statements with the datasets and invite participants to solve them. These competitions are indeed great resources for learning about machine learning algorithms and feature engineering methods. But these competitions do not map to real-time data science projects. The datasets shared for such competitions are provided by the host companies. The major task of gathering, cleaning and processing of the data is done by the hosts themselves.. While in real time projects, data collection and processing tasks are to be done by the data science teams. These two tasks are very time consuming and complex. Finding the right data is supreme because analysis done on incorrect data would be not useful to anyone.
You can learn Data Science in a few months.
Data Science is not like working on a tool or fixed format use cases that you can master in a few months. Data science is about identifying patterns and deriving insights from the data; and transforming such knowledge to solve business problems. Each business problem is unique and may require a different thought process and ideas. One becomes a data science pro by working on a multitude of problem scenarios, continued learning and hard work. It is a hap field and it is exciting to be a part of it, but it comes with great efforts and learning.
It could be difficult to choose/change a career. Nevertheless, do a lot of research and make informed choices. It is important to have the aptitude and attitude to make it big in data science.