Edvancer's Knowledge Hub

Top mistakes beginners do while learning data science (part2)

Manu Jeevan 29/08/2017

In my previous post, I talked about 5 mistakes that most beginners make while learning data science. Continuing the series, today I am going to talk about few more mistakes that beginners should avoid when learning data science. Let’s get started… 1) Learning just the theoretical concepts You can keep on learning as many theoretical concepts in machine learning and data science as you want, but what’s the use if you don’t know how to apply them to real world problems. That’s why I insist beginners to implement the mathematical concepts they learn using Python or R. Let’s say you are learning linear regression theory then you should actually take a small dataset and then implement linear regression using a programming language on that data set. Even if you learn advanced concepts like Gradient descent or stochastic gradient would you even know when or how to apply these concepts? That’s why I strongly suggest beginners not to take advanced courses like Andrew Ng’s machine learning course. 2) Participating in Kaggle competitions Just because you possess some programming, statistics and data science skills, doesn’t mean you are skilled enough to compete against experienced kaggle contestants. It is very difficult to solve Kaggle problems without the guidance of an experienced machine learning or data science professional. Some of the Kaggle problems are even very difficult for experienced professionals in machine learning. Even the simplest problems like Titanic Machine Learning can’t be solved easily as the data is messy, as I said in my previous post – you need to first know how to clean data. I have also seen many Kaggle competitors (especially people looking to get in to data science) using black box approach to solve Kaggle problems. Black box approach means simply applying an algorithm like neural networks to the problem and allowing the algorithm to spit out the results. This approach doesn’t work in real world as you need to understand why you are using a certain algorithm. I have already discussed this in my previous post. So, do not solve Kaggle problems when you are getting started. 3) Building generic projects If you are a fresh graduate instead of doing generic data science projects in different domains, you need to pick a domain that you are passionate about and then do 4 to 5 data science projects in it. For instance, if you are passionate about real estate, then you can solve problems such as predicting housing prices in a particular region or identifying who are the potential prospects to buy a house in this region. If you are an experienced professional looking to transition to data science then try to leverage your domain knowledge and solve problems in that particular domain using data science. For example, if you are an insurance professional then you can do several projects in this sector. This will make you much more easily employable than doing several generic projects as employers will highly value your domain knowledge. 4) Not focusing on communication of the results A lot of data science work is about explaining their technical findings to other non-technical people. Some data scientists call this “storytelling”. The important thing here is to communicate insights in a clear, concise, and valid way, so that others in the company can effectively act on those insights. Once you complete a data science project, try to explain that to someone who doesn’t know anything about data science or, even better, write a blog post about it. This actually forces you to explain your project in layman terms. 5) Learning too many algorithms You don’t have to learn a lot of algorithms to get an entry level data science job. You just need to be pretty good at few algorithms such as Linear regression, Logistic regression, K-means and Knearest neighbours. These 4 algorithms are probably a good place to start. If you can really understand them, talk about the trade-offs, talk about what works, and do a data –centric project with them. Then you’ll be much more employable than you know a little bit about dozens of algorithms. Conclusion Beginners assume that they have to know a lot of concepts in machine learning, statistics, linear algebra, multi variable calculus, R, Python, SQL, Hadoop and big data systems. So they cram a lot of theoretical concepts in these streams. Even worse, they just blindly do many guided tutorials and assume that they can talk about these tutorials during the interview. They are doing this because they are scared to solve real world data science problems without the guidance of someone or a guided video tutorial. But, to effectively work on data science problems you don’t have to learn many tools and concepts. Instead, you have to learn the basics thoroughly but do not get drowned in the theoretical concepts. Just brush up the basics of math, programming, stats and learn few algorithms thoroughly. And then start working on data science projects. If you understand the basics of linear algebra and calculus then you can understand basic algorithms such as linear regression, logistic regression and K-means.

About
Latest Posts

Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.