
Understanding Correlation and Regression Analysis
Explore the concepts of correlation and regression analysis in data science, including types of correlation, methods of studying correlation, and the difference between positive and negative correlations. Learn how to determine the relationship between variables and understand the limitations of correlation analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Correlation & Regression Module V
CORRELATION Correlation analysis attempts to determine the degree of relationship between variables ( Ex: Between Income & Expenditure) Correlation does not tell anything about causation. Problems in correlation Chance Coincidence Influence of third variable Mutual Influence
Types of Correlation 1. Positive Correlation 2. Negative Correlation 3. Linear Correlation 4. Non-linear Correlation 5. Simple Correlation 6. Partial Correlation 7. Multiple Correlation
1. Positive and Negative Correlation When the values of the two variables move in the same direction i.e., when an increase in the value of one variable is associated with an increase in the value of other variable and vice versa, correlation is said to be positive. Height & Weight, Income & Expenditure, Price & Supply. When the values of the two variables move in the opposite directions, correlation is said to be negative. Price & Demand, Demand & Supply,
2. Linear and Non Linear Correlation A correlation is said to be linear correlation when variations in the value of two variables have a constant ratio. x: 10 20 30 40 y: 40 60 80 100 A correlation is said to be non-linear when the amount of change in the values of one variable does not bear a constant ratio to the amount of change in the corresponding values of another variable. X: 8 9 10 10 Y: 80 130 170 150 50 120 28 230 29 560 30 600
3. Simple, Partial and Multiple Correlation If only two variables are chosen to study correlation between them, then such a correlation is referred to as simple correlation. In partial correlation, two variables are chosen to study the correlation between them, but the effect of other influencing variables are kept constant. In multiple correlation, the relationship between more than three variables is considered simultaneously for study.
METHODS OF STUDYING CORRELATION 1. Scatter Diagram Method 2. Karl Pearson s Correlation 3. Spearman s Rank Method 4. Method of Least Squares 5. Concurrent Deviation Method
1.Scatter Diagram Method It is a quick at-a-glance method of determining an apparent relationship between two variables, if any. It is plotted by taking the independent variable values on the x axis and the dependent values on the y axis.
Scatter diagram of Weight & BMI 4 3.5 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 70
2.Karl Pearsons Coefficient It measures quantitatively the extent to which two variables x and y are correlated. r= Covariance( x, y)/ x y It is a number between -1 and +1 that summarizes the magnitude as well as the direction of association between two variables.
3.Spearmans Rank Correlation R= 1- 6 d2/n(n2-1) R= Rank Correlation R1= Rank of observation with respect to first variable R2= Rank of observation with respect to second variable. d=R1-R2 n = Number of pairs of observations being ranked.
4.Method of Least Squares The method of least-squares to calculate correlation coefficient requires the values of regression co- efficient bxy and b yx, so that In other words, correlation coefficient is the geometric mean of the two-regression co-efficients. (Will be discussed later on) r= bxy xbyx
5.Concurrent Deviation Method This method of studying correlation is the simplest of all the methods. The only thing that is required under this method is to find out the direction of change of X variable and Y variable. rc=+- +-(2c-n)/n rc stands for the coefficient of correlation by the concurrent method. c stands for the number of concurrent deviations or the number of positive signs obtained by multiplying Dx with Dy
REGRESSION Regression is the measure of the average relationship between two or more variables in terms of the original units of the data Regression Lines: X on Y : Value of X for a given value of Y. Y on X: Value of Y for a given value of X.
REGRESSION MODEL The primary objective of regression analysis is the development of a regression model to explain the association between two or more variables in the given population. A regression model is the mathematical equation that provides prediction of value of dependent variable based on the known values of one or more independent variables.
Types of Regression Models 1. 2. Simple and Multiple Regression Models Linear and Nonlinear Regression Models
1. Simple & Multiple Regression If a regression model characterizes the relationship between a dependent variable y and only one independent variable x, thensuch a regression model is called a simple regression model. (Ex: Height & Weight). But if more than one independent variables are associated with a dependent variable, then such a regression model is called a multiple regression model . (Ex: Turnover of a product is associated with demand, advertisement, quality, and so on).
2.Linear & Nonlinear Regression If the value of a dependent variable y in a regression model tends to change in direct proportion to change in the values of independent variables, then such a regression model is called a linear model. Very useful for prediction. If the line passing through the pair of values of variables x and y is curvilinear, then the relationship is called nonlinear. A nonlinear relationship is not very useful for prediction.
Regression Lines The fundamental aim of regression analysis is to determine a regression equation that makes sense and fits the representative data such that the error of variance is as small as possible. Regression equation of y on x y= a+bx Regression equation of x on y x= c+dy
Correlation co-efficient From the following data calculate Karl Pearson s coefficient of correlation & interpret its value. Roll No. Accountancy Statistics 1 48 45 2 35 20 3 17 40 4 23 25 5 47 45
Correlation co-efficient 1.Direct Method 2.Co-variance Method
Correlation co-efficient(Method I) X2 Y2 X Y XY 48 2304 45 2025 2160 35 1225 20 400 700 17 289 40 1600 680 23 529 25 625 575 47 2209 45 2025 2115 X=170 X2=6556 Y2=6675 Y=175 XY=6230
Correlation co-efficient (Method I) r= =(31150-29750)/( 3880X 2750) =1400/3266=0.43
Correlation co-efficient (Method II) ( )( ) X X Y Y r= 2 2 ( ) ( ) X X Y Y
Correlation co-efficient(Method II) x y xy 2 2 Roll No. y x X 45 Y 1 48 +14 196 +10 100 +140 2 35 +1 1 20 -15 225 -15 3 17 -17 289 40 +5 25 -85 4 23 -11 121 25 -10 100 +110 5 47 +13 169 45 +10 100 +130 x2=776 y2=550 X=170 x=34 Y=175 y=35 xy=280
Correlation co-efficient (Method II) x xy r= 2 2 y 280 r= 776 550 =0.43
Rank Correlation R= 1- 1 1 3 3 + + + 2 { 6 ( ) ( ) ...} D m m m m 1 2 1 2 12 12 = 1 R 3 N N
Rank Correlation 2 ladies were asked to rank 7 different brands of lipsticks. The rankings given were as follows. Lipstick: A B Neethu: 2 1 Preethy: 1 3 Calculate Spearman s rank correlation co-efficient. C 4 2 D 3 4 E 5 5 F 7 6 G 6 7
Rank Correlation( Solution) X R1 2 1 4 3 5 7 6 Y R1-R2 D 1 -2 2 -1 0 1 -1 D2 R2 1 3 2 4 5 6 7 1 4 4 1 0 1 1 D2=12 R=1-(6*12)/(73-7)=0.786
Concurrent Deviation Method Calculate the co-efficient of concurrent deviation from the following. X: 60, 55,50,56,30,70,40,35,80,80,75 Y: 65,40,35,75,63,80,35,20,80,60,60
X 60 55 50 56 30 70 40 35 80 80 75 Dx Y 65 40 35 75 63 80 35 20 80 60 60 Dy DxDy - - + - + - - + 0 - - - + - + - - + - 0 + + + + + + + + 0 0
Concurrent Deviation Method Solution : = + + {( 2 / ) n } rc c n = + + {( 2 8 10 / ) 10 } rc x Answer= Answer= 0.77
(Regression Equations -2Methods) y on x x on y y x = ( ) ( ) y y r x x = ( ) ( ) x x r y y x y Y=Na+ b x XY=a X+b X2 X=Na+ b Y XY=a Y+ b Y2
Regression Equations Find the two regression equations for the following problem: X : 6 2 10 Y : 9 11 5 4 8 8 7
(Regression Equations -Method I) Y=Na+ b x XY=a X+b X2 40=5a+30b 214=30a+220b Solving, a=11.9 & b=-0.65 Y=11.9-0.65x X=Na+ b Y XY=a Y+ b Y2 30=5a+40b 214=40a+340b Solving, a=16.4, b=-1.3 X=16.4-1.3Y
Regression Equations : Method I X2 Y2 X Y XY 6 9 54 36 81 2 11 22 4 121 10 5 50 100 25 4 8 32 16 64 8 7 56 64 49 X2=220 Y2=340 X=30 Y=40 XY=214
(Regression Equations -Method II) y on x X on Y y x = ( ) ( ) y y r x x = ( ) ( ) x x r y y x y (Y-8)=-0.92(2/2.82)(X-6) (X-6)=-0.92(2.82/2)(Y-8) Y-8=-0.65(X-6) (X-6)=-1.30(Y-8) Y=11.9-0.65x X=16.4-1.3 Y
Correlation Co-efficient X2 Y2 100 X 50 Y 10 XY 500 2500 60 14 840 3600 196 55 15 825 3025 225 65 11 715 4225 121 75 12 900 5625 144 70 15 1050 4900 225 75 80 16 20 1200 1600 5625 6400 196 400 90 18 1620 8100 324 80 19 1520 6400 361 XY=10770 X2=50400 Y2=2292 X=700 Y=150
Combined Mean & SD Combined Mean: + n x n x 1 2 = 1 2 x 12 + n n 1 2 Combined Standard Deviation: 2 2 2 2 1 + + + n n n d n d = 1 2 2 1 1 2 2 12 + n n 1 2