Exploring Multicollinearity in Regression using Principal Components Analysis

multicollinearity in regression principal n.w

1 / 16

Embed Share

Discover how Principal Components Analysis can detect and correct multicollinearity in regression models, as showcased in a study on physical stature attributes among female police officer applicants. The study includes data on standing height and various physical measurements, highlighting the importance of addressing multicollinearity for accurate analysis.

nead_k Follow

Uploaded on Mar 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Multicollinearity in Regression Principal Components Analysis Standing Heights and Physical Stature Attributes Among Female Police Officer Applicants S.Q. Lafi and J.B. Kaneene (1992). An Explanation of the Use of Principal Components Analysis to Detect and Correct for Multicollinearity, Preventive Veterinary Medicine, Vol. 13, pp. 261-275

Data Description Subjects: 33 Females applying for police officer positions Dependent Variable: Y Standing Height (cm) Independent Variables: X1 Sitting Height (cm) X2 Upper Arm Length (cm) X3 Forearm Length (cm) X4 Hand Length (cm) X5 Upper Leg Length (cm) X6 Lower Leg Length (cm) X7 Foot Length (inches) X8 BRACH (100X3/X2) X9 TIBIO (100X6/X5)

Data ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Y X1 88.7 90.0 87.7 87.1 81.3 88.2 86.1 88.7 83.7 81.2 88.6 83.2 81.5 87.9 88.3 85.6 81.6 86.6 82.0 84.1 88.1 83.9 88.1 87.0 89.6 85.6 84.9 84.1 84.3 85.0 82.6 85.0 83.4 X2 31.8 32.4 33.6 31.0 32.1 31.8 30.6 30.2 31.1 32.3 34.8 34.3 31.0 34.2 30.6 32.6 31.0 32.7 30.3 29.5 34.0 32.5 31.7 33.2 35.2 31.5 30.5 32.8 30.5 35.0 36.2 33.6 33.5 X3 28.1 29.1 29.5 28.2 27.3 29.0 27.8 26.9 27.1 27.8 27.3 30.1 27.3 30.9 28.8 28.8 25.6 25.4 26.6 26.6 29.3 28.6 26.9 26.3 30.1 27.1 28.1 29.2 27.8 27.8 28.6 27.1 29.7 X4 18.7 18.3 20.7 18.6 17.5 18.6 18.4 17.5 18.1 19.1 18.3 19.2 17.5 19.4 18.3 19.1 17.0 17.7 17.3 17.8 18.2 20.2 18.1 19.5 19.1 19.2 17.8 18.4 16.8 19.0 20.2 19.8 19.4 X5 40.3 43.3 43.7 43.7 38.1 42.0 40.0 41.6 38.9 42.8 43.1 43.4 39.8 43.1 41.8 42.7 44.2 42.0 37.9 38.6 43.2 43.3 40.1 43.2 45.1 42.3 41.2 42.6 41.0 47.2 45.0 46.0 45.2 X6 38.9 42.7 41.1 40.6 39.6 40.6 37.0 39.0 37.5 40.1 41.8 42.2 39.6 43.7 41.0 42.0 39.0 37.5 36.1 38.2 41.4 42.9 39.0 40.7 44.5 39.0 43.0 41.1 39.8 42.4 42.3 41.6 44.0 X7 6.7 6.4 7.2 6.7 6.6 6.5 5.9 5.9 6.1 6.2 7.3 6.8 4.9 6.3 5.9 6.0 5.1 5.0 5.2 5.9 5.9 7.2 5.9 5.9 6.3 5.7 6.1 5.9 6.0 5.0 5.6 5.6 5.2 X8 88.4 89.8 87.8 91.0 85.0 91.2 90.8 89.1 87.1 86.1 78.4 87.8 88.1 90.4 94.1 88.3 82.6 77.7 87.8 90.2 86.2 88.0 84.9 79.2 85.5 86.0 92.1 89.0 91.1 79.4 79.0 80.7 88.7 X9 96.5 98.6 94.1 92.9 103.9 96.7 92.5 93.8 96.4 93.7 97.0 97.2 99.5 101.4 98.1 98.4 88.2 89.3 95.3 99.0 95.8 99.1 97.3 94.2 98.7 92.2 104.4 96.5 97.1 89.8 94.0 90.4 97.3 165.8 169.8 170.7 170.9 157.5 165.9 158.7 166.0 158.7 161.5 167.3 167.4 159.2 170.0 166.3 169.0 156.2 159.6 155.0 161.1 170.3 167.8 163.1 165.8 175.4 159.8 166.0 161.2 160.4 164.3 165.5 167.2 167.2

Standardizing the Predictors X X X X j j ij ij n = = = = * ij 1,...,33; 1,...,9 X i j ( ) 2 j n ( 1) S 2 X X j ij = 1 i * 11 * 21 * 12 * 22 * 19 * 29 1 r r 19 r r X X X X X X 12 1 21 29 = = = * * * X X 'X R * 33,1 * 33,2 * 33,9 1 91 r 92 r X X X ( )( ) n X X X X j k ij ik = = 1 i r jk ( ) ( ) n n 2 2 X X X X j k ij ik = = 1 1 i i

Correlation Matrix of Predictors and Inverse R 1.0000 0.1441 0.2791 0.1483 0.1863 0.2264 0.3680 0.1147 0.0212 0.1441 1.0000 0.4708 0.6452 0.7160 0.6616 0.1468 -0.5820 -0.0984 0.2791 0.4708 1.0000 0.5050 0.3658 0.7284 0.4277 0.4420 0.4406 0.1483 0.6452 0.5050 1.0000 0.6007 0.5500 0.3471 -0.1911 -0.0988 0.1863 0.7160 0.3658 0.6007 1.0000 0.7150 -0.0298 -0.3882 -0.4099 0.2264 0.6616 0.7284 0.5500 0.7150 1.0000 0.2821 0.0026 0.3434 0.3680 0.1468 0.4277 0.3471 -0.0298 0.2821 1.0000 0.2445 0.3971 0.1147 -0.5820 0.4420 -0.1911 -0.3882 0.0026 0.2445 1.0000 0.5082 0.0212 -0.0984 0.4406 -0.0988 -0.4099 0.3434 0.3971 0.5082 1.0000 R^(-1) 1.52 -3.48 3.15 0.41 13.15 -13.28 -0.62 -3.41 10.21 -3.48 436.47 -390.31 -1.26 -83.83 77.01 1.18 425.55 -62.66 3.15 -390.31 353.99 -0.07 91.67 -87.90 -1.25 -382.59 68.23 0.41 -1.26 -0.07 2.46 4.89 -5.40 -0.81 -0.49 4.57 13.15 -83.83 91.67 4.89 817.17 -807.75 -2.21 -76.90 603.81 -13.28 77.01 -87.90 -5.40 -807.75 801.94 2.65 71.74 -597.88 -0.62 1.18 -1.25 -0.81 -2.21 2.65 1.77 1.12 -2.49 -3.41 425.55 -382.59 -0.49 -76.90 71.74 1.12 417.39 -58.24 10.21 -62.66 68.23 4.57 603.81 -597.88 -2.49 -58.24 448.37

Variance Inflation Factors (VIFs) VIF measures the extent that a regression coefficient s variance is inflated due to correlations among the set of predictors VIFj= 1/(1-Rj2) where Rj2is the coefficient of multiple determination when Xjis regressed on the remaining predictors. Values > 10 are often considered to be problematic VIFs can be obtained as the diagonal elements of R-1 VIFs X1 1.52 X2 X3 X4 2.46 X5 X6 X7 1.77 X8 X9 436.47 353.99 817.17 801.94 417.39 448.37 Not surprisingly, X2, X3, X5, X6, X8, and X9 are problems (see definitions of X8 and X9)

Regression of Y on [1|X*] E Y Y = + + + = + * * 1 i * 9 i 1 X X X E 0 1 9 0 i Regression Statistics Multiple R R Square Adjusted R Square0.850704 Standard Error Observations 0.944825 0.892694 1.890412 33 Note the surprising negative coefficients for X3*, X5*, and X9* ANOVA df SS MS 75.9758 3.5737 Significance F 21.2600 F Regression Residual Total 9 683.7823 23 82.1941 32 765.9764 0.0000 Coefficients 164.5636 11.8900 4.2752 -3.2845 4.2764 -9.8372 25.5626 3.3805 6.3735 -9.6391 Standard Errort Stat 0.3291 500.0743 2.3307 39.4941 35.5676 2.9629 54.0398 53.5337 2.5166 38.6215 40.0289 P-value Lower 95%Upper 95% 0.0000 163.8829 165.2444 0.0000 7.0686 0.9147 -77.4246 0.9272 -76.8616 0.1624 -1.8528 0.8571 -121.6270 101.9525 0.6375 -85.1802 136.3055 0.1923 -1.8255 0.8704 -73.5211 0.8118 -92.4453 Intercept X1* X2* X3* X4* X5* X6* X7* X8* X9* 5.1015 0.1082 -0.0923 1.4433 -0.1820 0.4775 1.3433 0.1650 -0.2408 16.7114 85.9751 70.2927 10.4057 8.5865 86.2682 73.1670

Principal Components Analysis Using Statistical or Matrix Computer Package, decompose the correlation matrix into its eigenvalues and eigenvectors p p R p v v 1 j p 2 j = = = = * * th th X 'X R v v ' VLV v ' where eigenvalue and eigenvector j j j j j j j = 1 j v jp 0 0 0 1 0 2 = = V v v v L 1 2 p 0 0 p p = = = = v 'v v 'v subject to: 1 0 Condition Index: max p j k j j j k j j = 1 i j * W = X V Principal Components: While the columns of X* are highly correlated, the columns of W are uncorrelated The s represent the variance corresponding to each principal component

Police Applicants Height Data - I V 0.1853 0.4413 0.3934 0.4182 0.4125 0.4645 0.2141 -0.0852 0.0474 0.1523 -0.2348 0.3336 -0.0813 -0.3000 0.1011 0.3577 0.5467 0.5261 0.8017 -0.0986 -0.1642 0.0284 -0.0121 -0.2518 0.3790 -0.0498 -0.3320 0.2782 -0.2312 0.2336 -0.2063 0.3508 0.1658 -0.5862 0.4536 -0.2685 -0.3707 -0.2551 0.1239 0.5765 0.0559 -0.2697 0.2139 0.3674 -0.4396 -0.2327 -0.3191 -0.3183 -0.3703 0.4669 0.3798 0.4811 0.0367 -0.1027 0.1754 -0.3973 -0.4953 0.5529 0.0250 0.2786 -0.2484 -0.0418 0.3445 -0.0005 0.5850 -0.5205 0.0009 0.1487 -0.1539 0.0009 0.5738 0.1089 0.0104 -0.1414 0.1397 0.0040 0.6106 -0.6040 -0.0022 -0.1352 0.4521 L 3.6304 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 2.4427 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0145 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.7656 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6109 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.3024 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2322 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0009 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005

Police Applicants Height Data - II VLV' 1.0000 0.1441 0.2791 0.1483 0.1863 0.2263 0.3680 0.1147 0.0212 0.1441 1.0000 0.4708 0.6452 0.7160 0.6617 0.1468 -0.5820 -0.0985 0.2791 0.4708 1.0000 0.5051 0.3658 0.7284 0.4277 0.4420 0.4406 0.1483 0.6452 0.5051 1.0000 0.6007 0.5500 0.3471 -0.1911 -0.0988 0.1863 0.7160 0.3658 0.6007 1.0000 0.7150 -0.0298 -0.3882 -0.4098 0.2263 0.6617 0.7284 0.5500 0.7150 1.0000 0.2821 0.0026 0.3434 0.3680 0.1468 0.4277 0.3471 -0.0298 0.2821 1.0000 0.2445 0.3971 0.1147 -0.5820 0.4420 -0.1911 -0.3882 0.0026 0.2445 1.0000 0.5083 0.0212 -0.0985 0.4406 -0.0988 -0.4098 0.3434 0.3971 0.5083 1.0000 R 1.0000 0.1441 0.2791 0.1483 0.1863 0.2264 0.3680 0.1147 0.0212 0.1441 1.0000 0.4708 0.6452 0.7160 0.6616 0.1468 -0.5820 -0.0984 0.2791 0.4708 1.0000 0.5050 0.3658 0.7284 0.4277 0.4420 0.4406 0.1483 0.6452 0.5050 1.0000 0.6007 0.5500 0.3471 -0.1911 -0.0988 0.1863 0.7160 0.3658 0.6007 1.0000 0.7150 -0.0298 -0.3882 -0.4099 0.2264 0.6616 0.7284 0.5500 0.7150 1.0000 0.2821 0.0026 0.3434 0.3680 0.1468 0.4277 0.3471 -0.0298 0.2821 1.0000 0.2445 0.3971 0.1147 -0.5820 0.4420 -0.1911 -0.3882 0.0026 0.2445 1.0000 0.5082 0.0212 -0.0984 0.4406 -0.0988 -0.4099 0.3434 0.3971 0.5082 1.0000

Regression of Y on [1|W] 0 E = + Y 1 W Regression Statistics Multiple R R Square Adjusted R Square Standard Error 1.890412 Observations Note that W8 and W9 have very small eigenvalues and very small t-statistics Condition indices are 63.5 and 85.2, Both well above 30 0.944825 0.892694 0.850704 33 ANOVA df SS MS 75.9758 3.5737 Significance F 21.2600 F Regression Residual Total 9 683.7823 23 82.1941 32 765.9764 0.0000 Coefficients 164.5636 12.1269 4.5224 7.6160 4.9552 -3.5819 3.2973 6.8268 1.4226 -27.5954 Standard Errort Stat 0.3291 500.0743 0.9922 1.2096 1.8769 2.1605 2.4185 3.4376 3.9230 64.0508 87.0588 P-value Lower 95%Upper 95% 0.0000 163.8829 165.2444 0.0000 10.0744 0.0011 2.0202 0.0005 3.7334 0.0313 0.4858 0.1522 -8.5850 0.3474 -3.8139 0.0952 -1.2885 0.9825 -131.0766 133.9219 0.7541 -207.6903 152.4995 Intercept W1 W2 W3 W4 W5 W6 W7 W8 W9 12.2227 3.7389 4.0578 2.2935 -1.4810 0.9592 1.7402 0.0222 -0.3170 14.1793 7.0245 11.4985 9.4246 1.4213 10.4085 14.9422

Reduced Model Removing last 2 principal components due to small, insignificant t-statistics and high condition indices Let V(g) be the p g matrix of the eigenvectors for the g retained principal components (p=9, g=7) Let W(g) = X*V(g) Then regress Y on [1|W(g)]to obtain ^ g V(g) 0.1853 0.4413 0.3934 0.4182 0.4125 0.4645 0.2141 -0.0852 0.0474 0.1523 -0.2348 0.3336 -0.0813 -0.3000 0.1011 0.3577 0.5467 0.5261 0.8017 -0.0986 -0.1642 0.0284 -0.0121 -0.2518 0.3790 -0.0498 -0.3320 0.2782 -0.2312 0.2336 -0.2063 0.3508 0.1658 -0.5862 0.4536 -0.2685 -0.3707 -0.2551 0.1239 0.5765 0.0559 -0.2697 0.2139 0.3674 -0.4396 -0.2327 -0.3191 -0.3183 -0.3703 0.4669 0.3798 0.4811 0.0367 -0.1027 0.1754 -0.3973 -0.4953 0.5529 0.0250 0.2786 -0.2484 -0.0418 0.3445

Reduced Regression Fit SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square0.862045 Standard Error Observations 0.944575 0.892223 1.817195 33 ANOVA df SS MS 97.6316 3.3022 Significance F 29.5657 F Regression Residual Total 7 683.4215 25 82.5549 32 765.9764 0.0000 Coefficients 164.5636 12.1268 4.5224 7.6160 4.9551 -3.5819 3.2972 6.8268 Standard Errort Stat 0.3163 520.2229 0.9537 1.1627 1.8042 2.0768 2.3249 3.3044 3.7711 P-value Lower 95%Upper 95% 0.0000 163.9121 165.2151 0.0000 10.1625 0.0007 2.1277 0.0003 3.9002 0.0249 0.6777 0.1360 -8.3701 0.3279 -3.5084 0.0823 -0.9398 Intercept W1 W2 W3 W4 W5 W6 W7 12.7151 3.8895 4.2213 2.3859 -1.5407 0.9978 1.8103 14.0910 6.9170 11.3317 9.2324 1.2063 10.1028 14.5934

Transforming Back to Transformed X-scale ^ = ^ ^ ^ ^ ( )0 g ^ = -1 (g) 2 2 where ' s s = V V L V * (g) (g) (g) (g) (g) (g) (g) * (g) s^2 3.3022 gamma-hat(g) 12.1268 4.5224 7.6160 4.9551 -3.5819 3.2972 6.8268 beta-hat(g) 12.1779 -0.4583 StdErr W1 W2 W3 W4 W5 W6 W7 X1* X2* X3* X4* X5* X6* X7* X8* X9* 2.0639 2.0549 2.3006 2.8275 1.7926 1.8993 2.4118 1.4407 1.9731 1.3113 4.3866 6.8020 9.1146 3.3197 1.8268 2.6829 V{beta-hatg} 4.2598 -0.1779 -0.6883 1.0454 -0.8386 -0.0887 -1.8757 -0.4214 0.9289 -0.1779 4.2228 3.6089 -2.2379 -1.9307 -2.4561 -0.1330 -1.0423 -0.7562 -0.6883 3.6089 5.2928 -2.3318 -1.3892 -2.9496 -0.3347 1.1128 -2.2031 1.0454 -2.2379 -2.3318 7.9948 -1.6401 -0.1911 -2.6329 0.1667 1.9223 -0.8386 -1.9307 -1.3892 -1.6401 3.2135 2.3480 1.4626 0.7180 -1.1223 -0.0887 -2.4561 -2.9496 -0.1911 2.3480 3.6074 0.1090 -0.1452 1.7520 -1.8757 -0.1330 -0.3347 -2.6329 1.4626 0.1090 5.8170 -0.1949 -1.7317 -0.4214 -1.0423 1.1128 0.1667 0.7180 -0.1452 -0.1949 2.0755 -1.2055 0.9289 -0.7562 -2.2031 1.9223 -1.1223 1.7520 -1.7317 -1.2055 3.8931

Comparison of Coefficients and SEs Original Model Principal Components Coefficients 164.5636 11.8900 4.2752 -3.2845 4.2764 -9.8372 25.5626 3.3805 6.3735 -9.6391 Standard Error 0.3291 2.3307 39.4941 35.5676 2.9629 54.0398 53.5337 2.5166 38.6215 40.0289 beta-hat(g) 12.1779 -0.4583 StdErr Intercept X1* X2* X3* X4* X5* X6* X7* X8* X9* X1* X2* X3* X4* X5* X6* X7* X8* X9* 2.0639 2.0549 2.3006 2.8275 1.7926 1.8993 2.4118 1.4407 1.9731 1.3113 4.3866 6.8020 9.1146 3.3197 1.8268 2.6829

Predicted Values Predicted Values for the Individual Cases (or Validation Cases) can be obtained as follow. PC ^ ^ ^ = + = + * Y 1 W Y Y 1 X g g g Applicant Height 1 2 3 4 5 6 7 8 9 10 11 Yhat 165.7 171.2 170.3 167.2 158.4 167.6 160.6 164.3 158.5 161.1 168.6 Applicant Height 12 13 14 15 16 17 18 19 20 21 22 Yhat 166.2 157.7 171.6 167.5 166.8 157.3 159.2 153.9 159.3 167.4 168.5 Applicant Height 23 24 25 26 27 28 29 30 31 32 33 Yhat 163.6 165.8 173.5 162.9 166.3 164.1 161.3 166.0 164.9 166.1 167.4 165.8 169.8 170.7 170.9 157.5 165.9 158.7 166.0 158.7 161.5 167.3 167.4 159.2 170.0 166.3 169.0 156.2 159.6 155.0 161.1 170.3 167.8 163.1 165.8 175.4 159.8 166.0 161.2 160.4 164.3 165.5 167.2 167.2

Exploring Multicollinearity in Regression using Principal Components Analysis

Download Presentation

Presentation Transcript

Related

More Related Content