# Step-by-step guide to execute Linear Regression in R

One of the most popular and frequently used techniques in statistics is linear regression where you predict a real-valued output based on an input value.  Technically, linear regression is a statistical technique to analyze/predict the linear relationship between a dependent variable and one or more independent variables. Let’s say you want to predict the price of a house, the price is the dependent variable and factors like size of the house, locality, and season of purchase might act as independent variables. This is because the price depends on other variables. R comes with many default data sets and it can be seen using MASS library. Install.packages(“MASS”) Library(MASS) Data() This will give you a list of available data sets using which you can get can a clear idea of linear regression problems. Analysing a default data set in R  In this post, I will use a default data set called “airquality” data. The data set has various air quality parameters in New York city. These are the parameters in the data set:
• Daily temperature from May to August
• Ozone data
• Wind data
Our goal is to predict the temperature for a particular month in New York using solar radiation, ozone and wind data. I am going to use Linear Regression (LR) to make the prediction. To start using LR or any other algorithm, first and foremost step is to generate a Hypothesis. The hypothesis is: “Temperature of house depends on ozone, wind and solar radiations”. Now, the null hypothesis of linear regression says there is no relation between dependent and independent variables; and all coefficients are zero. i.e. if equation is Temp=a1.Solar.R +a2.Ozone + a3.Wind + error. On the other hand, alternate hypothesis says there is at least one non-zero coefficient and hence relationship exists between dependent and independent variables. In mathematical notations it can be written as: H0: a1=a2=a3=0 Ha: a1≠a2≠a3≠0 Let’s test the hypothesis using a linear regression model and draw a conclusion. To test the hypothesis, we would check the level of significance of variables in support of our hypothesis. If the significance is higher than accepted level (generally 95%), we would reject the null hypothesis and hence there is a relation between dependent and independent variables. If the significance is less than the accepted level, we will reject the null hypothesis and hence there is no relationship between dependent and independent variables. Before that, let’s understand the data by exploring it in R. data(airquality)# to call the data attach(airquality) head(airquality,10)# to see first 10 rows Attach () function makes the data available to the R search path. Summary function gives you the range, quartiles, median and mean for numerical variables and table with frequencies for categorical variables. summary(airquality) ##      Ozone                   Solar.R                   Wind                            Temp ##  Min.   :  1.00         Min.   :  7.0          Min.   : 1.700          Min.   :56.00 ##  1st Qu.: 18.00      1st Qu.:115.8         1st Qu.: 7.400       1st Qu.:72.00 ##  Median : 31.50     Median :205.0    Median : 9.700     Median :79.00 ##  Mean   : 42.13      Mean   :185.9      Mean   : 9.958        Mean   :77.88 ##  3rd Qu.: 63.25     3rd Qu.:258.8     3rd Qu.:11.500       3rd Qu.:85.00 ##  Max.   :168.00      Max.   :334.0      Max.   :20.700        Max.   :97.00 ##  NA’s   :37               NA’s   :7 ##      Month                 Day ##  Min.   :5.000        Min.   : 1.0 ##  1st Qu.:6.000      1st Qu.: 8.0 ##  Median :7.000    Median :16.0 ##  Mean   :6.993     Mean   :15.8 ##  3rd Qu.:8.000    3rd Qu.:23.0 ##  Max.   :9.000      Max.   :31.0 Data visualization  I use a boxplot to visualize the daily temperature for month 5, 6, 7, 8 and 9. month5=subset(airquality,Month=5) month6=subset(airquality,Month=6) month7=subset(airquality,Month=7) month8=subset(airquality,Month=8) month9=subset(airquality,Month=9) par(mfrow = c(1,2))  # 3 rows and 2 columns boxplot((month5\$Temp~airquality\$Day),main=”Month 5″,col=rainbow(3)) boxplot((month6\$Temp~airquality\$Day),main=”Month 6″,col=rainbow(3))

boxplot((month7\$Temp~airquality\$Day),main=“Month 7”,col=rainbow(3)) boxplot((month8\$Temp~airquality\$Day),main=“Month 8”,col=rainbow(3))

boxplot((month9\$Temp~airquality\$Day),main=”Month 9″,col=rainbow(3

I use a histogram to see the distribution of temperature data.

hist(airquality\$Temp,col=rainbow(2))

I use a scatter plot to see if there is a linear pattern between the ‘temperature rise’ and other variables.

plot(airquality\$Temp~airquality\$Day+airquality\$Solar.R+airquality\$Wind+airquality\$Ozone,col=”blue”)

It seems that solar.R , Ozone, and wind have a linear pattern with temperature. Solar and Ozone have a positive relationship and wind has a negative one.  I use Co-plot to see the effect of wind and solar radiations combined on Temperature

coplot(Ozone~Solar.R|Wind,panel=panel.smooth,airquality,col =”green” )

It’s time to execute to Linear Regression on our data set

• Linear relationship between variables
• Normal distribution of residuals
• No or little multi-collinearity: we have seen this using VIF
• Homoscedasticity: Variance across the regression line should be uniform
R displays the summary of the model and gives intercept values of all independent variables along with error terms (or residuals). The Linear relationship between variables has been verified by the significance (p value) of variables. In ‘Residuals vs fitted values’ plot it can be seen that residuals are linearly distributed and hence variance is uniform. In ‘Normal Q-Q’ plot it can be seen that residuals are normally distributed. It can be seen by plotting histogram of residuals also hist(Model_lm_best\$residuals) To measure the quality of the model there are many ways and residual sum of squares is the most common one. There are many ways to measure the quality of a model, but residual sum of squares is the most common one. Residual sum of squares attempts to make a ‘line of best fit’ in the scattered data points so that the line has least error with respect to the actual data points. If Y is the actual data point and Y’ is the predicted value by the equation, then the error is Y-Y’.  But this has a bias towards ‘sign’ because when you sum up the error positive and negative values would cancel each other so the resultant error would be less than the actual value. To overcome this, a general method is to take square which serves two purposes: 1) Cancel out the effect of signs 2) Penalize the error in prediction Prediction To make a prediction, let’s build a data frame for new values of Solar.R, Wind and Ozone. Solar.R=185.93 Wind=9.96 Ozone=42.12 Solar.R=185.93 Wind=9.96 Ozone=42.12 Month=9 new_data=data.frame(Solar.R,Wind,Ozone,Month) new_data ##   Solar.R Wind Ozone Month ## 1  185.93 9.96 42.12     9 pred_temp=predict(Model_lm_best,newdata=new_data) ## [1] “the predicted temperature is:  81.54” Conclusion The regression algorithm assumes that the data is normally distributed and there is a linear relation between dependent and independent variables.  It is a good mathematical model to analyze relationship and significance of various variables.

### Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.