Edvancer's Knowledge Hub

Linear regression in python

In my previous post, I explained the concept of linear regression using R. In this post, I will explain how to implement linear regression using Python.  I am going to use a Python library called Scikit Learn to execute Linear Regression. Scikit-learn is a powerful Python module for machine learning and it comes with default data sets. I will use one such default data set called Boston Housing, the data set contains information about the housing values in suburbs of Boston. Introduction In my step by step guide to Python for data science article, I have explained how to install Python and the most commonly used libraries for data science. Go through this post to understand the commonly used Python libraries. Importing libraries in Python - linear regression in python Linear Regression using two dimensional data  First, let’s understand Linear Regression using just one dependent and independent variable. I create two lists  xs and ys. one dimensional linear regression I plot these lists using a scatter plot. I assume xs as the independent variable and ys as the dependent variable. plotting one dimensional data scatter plot - one dimensional data You can see that the dependent variable has a linear distribution with respect to the independent variable. A linear regression line has the equation Y = mx+c, where m is the coefficient of independent variable and c is the intercept. The mathematical formula to calculate slope (m) is: (mean(x) * mean(y) – mean(x*y)) / ( mean (x)^2 – mean( x^2)) The formula to calculate intercept (c) is: mean(y) – mean(x) * m Now, let’s write a function for intercept and slope (coefficient): slope intercept - Linear regression in Python To see the slope and intercept for xs and ys, we just need to call the function slope_intercept: slope intercept slope intercept - linear regression in python reg_line is the equation of the regression line: Fitting a regression line - linear regression in python Now, let’s plot a regression line on xs and ys: Plotting a regression line in python Root Mean Squared Error(RMSE) RMSE is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are, and RMSE is a measure of how spread out these residuals are. If Yi is the actual data point and Y^i is the predicted value by the equation of line then RMSE is the square root of (Yi – Y^i)**2 Let’s define a function for RMSE: RMSE definition - linear regression in python RMSE of one dimensional data Linear Regression using Scikit Learn Now, let’s run Linear Regression on Boston housing data set to predict the housing prices using different variables. Loading boston data from Scikit learn - linear regression in python I create a Pandas data frame for independent and dependent variables. The boston.target is the housing prices. Creating a pandas data frame - Linear Regression in python Exploring data - linear regression in python Exploring data_ linear regression in python data frame_ linear_regression_inpython Now, I am calling a linear regression model. Call the linear regression mode In practice you won’t implement linear regression on the entire data set, you will have to split the data sets into training and test data. So that you train your model on training data and see how well it performed on test data. I use 20 percentage of the total data as my test data. Train and test split_linear_regression_in_python I fit the linear regression model to the training data set. Linear_reg_inpython Let’s calculate the intercept value, mean squared error, coefficients, and the variance score. data frame_ linear_regression_inpython These are the coefficients of Independent variables (slope (m) of the regression line). I attach the slopes to the respective independent variables. I attach the slopes to the respective independent variables. Dataframe - linear regression in python I plot the predicted x_test and y_test values. scattter plot - linear regression in python Select only the important variables for the model. Scikit-learn is a good way to plot a linear regression but if we are considering linear regression for modelling purposes then we need to know the importance of variables( significance) with respect to the hypothesis. To do this, we need to calculate the p value for each variable and if it is less than the desired cutoff( 0.05 is the general cut off for 95% significance) then we can say with confidence that a variable is significant. We can calculate the p-value using another library called ‘statsmodels’. statsmodels_linear_regression_in_python Ordinary least squares or linear least squares is a method for estimating the unknown parameters in a linear regression model. We have explained the OLS method in the first part of the tutorial. model1=sm.OLS(y_train,x_train) fitting a model linear regression output_of_linear_regression_in_python We can drop few variables and select only those that have p values < 0.5 and then we can check improvement in the model. A general approach to compare two different models is AIC( Akaike Information Criteria) and the model with minimum AIC is the best one. AIC OLS 1 Linear_regression_in_python OLS 2 Dealing with multicollinearity Multicollinearity is problem that you can run into when you’re fitting a regression model. Simply put, multicollinearity is when two or more independent variables in a regression are highly related to one another, such that they do not provide unique or independent information to the regression. We can check multicollinearity using this command: corr(method = “name of method”).  I am going to make a correlation plot to see which parameters have multicollinearity issue. multicollinear_linear_regression_in_python correlation plot for multicollinearity Since this is a Pearson Coefficient, the values near to 1 or -1 have high correlation. For example, we can drop AGE and DIS and then execute a linear regression model to see if there are any improvements.

Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.
Manu Jeevan
Share this on
Facebooktwitterredditlinkedinmail

Follow us on
Facebooktwitterlinkedinrss
Author :
Free Data Science & AI Starter Course

Enrol For A Free Data Science & AI Starter Course

Learn R, Python, basics of statistics, machine learning and deep learning through this free course and set yourself up to emerge from these difficult times stronger, smarter and with more in-demand skills! In 15 days you will become better placed to move further towards a career in data science. Upgrade to the specialization programs at attractive discounts!

Don't Miss This Absolutely Free, No Conditions Attached Course