Harvard business review declared the job of a data scientist as the sexiest job of the 21st century. In this post, I am going to show you what data science is and what skills every data scientist must be equipped with.
Who is a data scientist?
I’m going to let some experts do the talking on this one.
“Data Scientist = Statistician + programmer + coach + storyteller + artist” – Shlomo Aragmon
“A data scientist is half hacker and half analyst, and he uses data to build and derive insights” – Monica Rogati.
Drew Conway created the Venn diagram below:
Considering the number of definitions and discussions floating about the web, your confusion is understandable and frankly, predictable. I’ve attempted to simplify what a data scientist does for you in the diagram below:
A Data Scientist first collates data and cleans it to create a dataset that can be analyzed. Real world data is never clean (well-formatted). Once the data set is obtained, they may analyse the data for present trends in the data or predict future trends.
Based on this, they can build data-driven products and also communicate their findings to other data scientists and the general public using reports, visualizations and blogs.
Lets’ explore the skills required to become a data scientist.
Every year, KD Nuggets conducts a poll on “What programming/statistics languages are used for work in data science”.
The chart below shows the survey result for the year 2014:
As you can see, the top contenders are R, Python, and SAS.
All these programming languages have their own advantages. If you’re into data analysis or bioinformatics, R is the best. If you are into web-scraping and machine learning, Python has the upper hand (grossly simplified of course).
Pick one, learn it and then learn the other. But note this, private companies in banking and pharma industries still use SAS more often than R. So, if you are looking for business analytics jobs in such companies then SAS is the way to go (they don’t trust open source technologies, as they don’t get trusted customer support in open source tools like R).
SQL databases have been a primary data storage mechanism for over four decades. The majority of the data stored by businesses is in these relational databases. As a data scientist, you should be familiar with some My SQL concepts like:
How to install MySQL on your local machine.
How to create tables and insert data into SQL databases.
Filters, Joins, and Aggregations.
There is a lot of debate on SQL vs. NoSQL databases. I would recommend learning SQL first and then proceeding to NoSQL because you will appreciate the pros and cons of NoSQL much more if you learn SQL and how to exploit and tune relational database structures.
Linear algebra and multivariable calculus
Your interviewer may ask you some fundamental questions based on Multivariable Calculus or Linear Algebra. You may wonder why an interviewer would ask these questions, considering there are Linear Algebra and Calculus libraries in R, SAS and Python. But, sometimes a data scientist may need to develop his own algorithm to improve the predictive power of a product, for which a firm understanding of these concepts is crucial.
You should be familiar with basics of statistics, like statistical tests, p-value, distributions, experimental design, etc.
Statistics is crucial in companies where stakeholders will depend on your help to make decisions and design/evaluate experiments.
Data scientists spend 80% of their time converting data into a usable form. Most of the times, data you are analyzing is going to be messy (unstructured) and difficult to work with. Hence, it is a vital skill. Some examples of messy data sets include missing values, inconsistent string formatting (e.g., ‘San Fransico’ versus ‘san francisco’ versus ‘ny’)
Companies that make data-driven decisions rely heavily on a data scientist’s ability to visualize and convey a story by analysing available data, since one needs to communicate data-driven insights to both, technical and non-technical people in the company.
Hence as a data scientist you need to be a good story teller.
There are many good visualization packages in R:
and Python to help you with this. D3.js is also one of the popular visualization package.
Thinking like a data scientist:
Companies want to see that you’re a (data –driven) problem solver.
Say you are working on an online advertising project. You may want to understand what type of people are coming to your website and how they are interacting with the website.
The right questions in this instance would be:
Why are people abandoning your website and not completing the sign-up forms?
What keywords have the greatest impact on conversions?
Data science is all about asking right questions and drawing insights from the data. So using these insights you can make right decisions.
Strong business acumen
A data scientist needs to have an in-depth understanding of the industry he is working in and the business problems his company is trying to solve.
This allows him to be in a position to understand a business challenge, uncover opportunities, and deliver an efficient and measurable solution for the problems of the company, with data-driven insights.
Today, every man, woman, and okay not child, but everyone else walks around proclaiming themselves to be data scientists. I hope I have brought some amount of clarity to the concept and give you a fair idea of the skills required to be a one.
Learn some of the data science skills at Certified Business Analytics Professional.
In my next post, I will talk about some resources through which you can learn these skills. Stay tuned.
Learn R, Python, basics of statistics, machine learning and deep learning through this free course and set yourself up to emerge from these difficult times stronger, smarter and with more in-demand skills! In 15 days you will become better placed to move further towards a career in data science. Upgrade to the specialization programs at attractive discounts!
Don't Miss This Absolutely Free, No Conditions Attached Course