AnalyticsAnalytics trainingChisquare testEdvancerProc AnovaProc CorrProc TtestSASSAS tipsSAS trainingSAS tricksSAS tutorial

All of us at some point in the process of examining data, check for correlations among different variables in the data especially pair-wise correlations.
Among a large chunk of business analysts in industry, there exists a notion of **‘linear correlation coefficient’** being the only criterion for pair-wise correlation and hence at the maximum a **Proc Corr** is run in SAS to check for the same. This will of course be useful for finding out correlations **between continuous variables**. However it more often than not **fails when confronted with real life data** which frequently contains **all kinds of variables, continuous, binary or multi-level categorical** etc.
**OR** is correlated with cont_var.

- In a scenario where you are trying to find
**out correlation between continuous variables**,**Proc Corr**is a good choice, because it simply gives you linear correlation coefficients. - Now when you are looking at correlation
**between a binary variable and a continuous variable**, your idea of correlation needs a little change in perspective. Simple linear correlation coefficient is rendered meaningless here, because one is not really dealing with meaningful numbers now, but categories. In many datasets you would observe that these categories have been given some numbers, but don’t confuse them with real numeric variables, they are just represented using numbers. They very well could have been given some other numbers, changing the value of the linear correlation coefficient, if one was using the same to assess correlation in this case. How do you go about working around this problem then?- Observe what a binary variable does to your continuous variable when taken together. It basically divides your continuous variable into two chunks defined by two levels of that binary variable. Correlation among your binary and continuous variable would mean that when you change from one level of binary variable to another; behaviour of your continuous variable is going to change as well. Now whether that behaviour change is statistically significant can be checked by
**‘Proc Ttest’**using continuous variable as “variable in question” and binary variable as “class variable”.

- Observe what a binary variable does to your continuous variable when taken together. It basically divides your continuous variable into two chunks defined by two levels of that binary variable. Correlation among your binary and continuous variable would mean that when you change from one level of binary variable to another; behaviour of your continuous variable is going to change as well. Now whether that behaviour change is statistically significant can be checked by

- Correlation between a
**multilevel categorical variable and continuous variable**is nothing but an extension to what we discussed above. Instead of just two levels, now we are talking of multiple levels. So ‘**Proc ANOVA’**comes in picture. You use continuous variable as “variable in question” and your categorical variable as “class variable”. Results of Proc ANOVA will tell you whether continuous variable’s mean differs significantly for any of the groups defined by different levels of categorical variable. If it does, then you can use**Bonferroni test****in conjunction with Proc ANOVA**to find out which of the classes are affecting your categorical variable.

- When it comes to
**correlation between categorical variables**, either of binary or multilevel; simple choice is**Chisquared test**, which can be carried out with ‘**Proc Freq’**. Let’s see how we view “correlation” in this context. Look at the proportions in which one of the categorical variables is divided between its categories for overall population. Let’s call that C_{1}. Now correlation between C_{1}and C_{2}means that, this proportional distribution is going to change for different levels of C_{2}. The statistical significance of this change would be determined by Chisq test.

- Logistic Regression vs Decision Trees vs SVM: Part II - October 6, 2015
- Logistic Regression Vs Decision Trees Vs SVM: Part I - October 5, 2015
- How to Create a Multi-Dimensional Visualisation in R - April 10, 2015

Follow us on

Free Data Science & AI Starter Course