Regression Model Assumptions and Normality Testing in NBA Player Data

checking regression model assumptions n.w
1 / 19
Embed
Share

Explore a dataset of NBA players' heights and weights from the 2013/14 season, analyze the linear regression model assumptions, regression statistics, and check for normality of errors using graphical and numerical tests.

  • Regression
  • NBA players
  • Data analysis
  • Normality testing

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights

  2. Data Description / Model Heights (X) and Weights (Y) for 505 NBA Players in 2013/14 Season. Other Variables included in the Dataset: Age, Position Simple Linear Regression Model: Y = 0+ 1X + Model Assumptions: N(0, 2) Errors are independent Error variance ( 2) is constant Relationship between Y and X is linear No important (available) predictors have been ommitted

  3. Weight (Y) vs Height (X) - 2013/2014 NBA Players 300 275 250 Weight (lbs) 225 200 175 150 65 70 75 80 85 90 Height (inches)

  4. Regression Model Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.821 0.674 0.673 15.237 505 ^ ^ ^ = + = + = + 279.869 6.331 Y b b X X X 0 1 0 1 ^ = = { } s b 0.197 s ANOVA 1 1 df SS 240985 116782 357767 MS 240985 232 F Significance F 1038 Regression Residual Total 1 0.0000 ( ) ( ) = cdf-based: 0.975;503 = upper-tail based: 0.025;503 t 1.965 t 503 504 ^ 6.331 0.197 b Coefficients -279.869 6.331 Standard Errort Stat 15.551 0.197 P-value Lower 95%Upper 95% 0.0000 -310.423 -249.316 0.0000 5.945 = = = = = * .965(0.197) : 0 : 0 : 32.217 1 H H TS t 1 Intercept Height -17.997 32.217 0 1 1 A { } s b ^ 6.717 s 1 1 ( ) 95% Confidence Interval for : 6.331 1 5.945, 6.717 1 ( ) n 2 = = Total (Corrected)Sum of Squares: 357767 SSTO Y Y i = 1 i 2 n ^ = = = = Regression Sum of Squares: Reg 240985 1 SSR SS Y Y df i Reg = 1 i 2 n ^ = = = = = Error Sum of Squares: Res 11 6782 505 2 503 SSE SS Y Y df i Err i = 1 i ( ) 240985 1 116782 503 Reg Res MSR MSE MS MS = = = = = * : 0 : 0 : 1038 H H TS F ( ) 0 1 1 A Reg 240985 357767 116782 503 SSR SSTO SS SSTO = = = = 2 0.674 r = = = = = = 2 Res 232 232 15.24 s MSE MS s

  5. Checking Normality of Errors Graphically Histogram Should be mound shaped around 0 Normal Probability Plot Residuals versus expected values under normality should follow a straight line. Rank residuals from smallest (large negative) to highest (k = 1, ,n) Compute the quantile for the ranked residual: p=(k-0.375)/(n+0.25) Obtain the Z-score corresponding to the quantiles: z(p) Expected Residual = MSE*z(p) Plot Ordered residuals versus Expected Residuals Numerical Tests: Correlation Test: Obtain correlation between ordered residuals and z(p). Critical Values for n up to 100 are provided by Looney and Gulledge (1985)). Shapiro-Wilk Test: Similar to Correlation Test, with more complex calculations. Printed directly by statistical software packages

  6. Normal Probability Plot / Correlation Test Extreme and Middle Residuals Normal Probability Plot of Residuals e rank 1 2 3 4 5 251 252 253 254 255 501 502 503 504 505 quantile 0.0012 0.0032 0.0052 0.0072 0.0092 0.4960 0.4980 0.5000 0.5020 0.5040 0.9908 0.9928 0.9948 0.9968 0.9988 z(p)*s -46.115 -41.519 -39.045 -37.306 -35.949 -0.151 -0.076 0.000 0.076 0.151 35.949 37.306 39.045 41.519 46.115 -45.583 -44.921 -39.929 -36.921 -36.590 -0.260 -0.260 -0.260 -0.260 0.063 40.748 42.079 44.417 49.740 56.079 80 60 40 20 Residual 0 -60 -40 -20 0 20 40 60 -20 -40 The correlation between the Residuals and their expected values under normality is 0.9972. -60 Expected Value Under Normality Based on the Shapiro-Wilk test in R, the P-value for H0: Errors are normal is P = .0859 (Do not reject Normality)

  7. Checking the Constant Variance Assumption Plot Residuals versus X or Predicted Values Random Cloud around 0 Linear Relation Funnel Shape Non-constant Variance Outliers fall far above (positive) or below (negative) the general cloud pattern Plot absolute Residuals, squared residuals, or square root of absolute residuals Positive Association Non-constant Variance Numerical Tests Brown-Forsyth Test 2 Sample t-test of absolute deviations from group medians Breusch-Pagan Test Regresses squared residuals on model predictors (X variables)

  8. Residuals vs Fitted Values 60 40 20 Residuals 0 -20 -40 -60 150 165 180 195 210 225 240 255 270 285 300 Fitted Values

  9. Absolute Residuals vs Fitted Values 60 50 40 Absolute Residuals 30 20 10 0 140 160 180 200 220 240 260 280 Fitted Values

  10. Equal (Homogeneous) Variance - I Brown-Forsythe Test: : Equal Variance Among Errors :Unequal Variance Among Errors (Increasing or Decreasing in ) 1) Split Dataset into 2 groups based on levels of (or fitted values) wi = 2 2 H H i 0 i X A th sample sizes: , X n n 1 2 2) Compute the median residual in each group: , 3) Compute absolute deviation from group median for each residual: e e 1 2 = = = 1,..., 1,2 d e e i n j j ij ij j 2 1 2 2 4) Compute the mean and varianc e for each group of : , , d d s d s 1 2 ij ( ) n ( ) + 2 1 + 2 2 1 1 n s n s = 1 2 2 5) Compute the pooled variance: s 2 n 1 2 H d d 0 ( ) 1 1 n 2 ~ = + Test Statistic: 2 t t n n 1 2 BF 1 n + s 1 2 ( ) ( ) Reject if 1 2 ; 2 H t t n 0 BF

  11. Equal (Homogeneous) Variance - II Breusch-Pagan (aka Cook-Weisberg) Test: : Equal Variance Among Errors H = 2 2 i 0 i ( ) = + + + 2 i 2 :Unequal Variance Among Errors ... H h X X 0 1 1 A i p ip n = 2 i 1) Let from original regression SSE e = on of 1 i ( ) 2 i 2) Fit Regressi on ,... and obtain Reg* e X X SS 1 i ip ( ) Reg* 2 SS H 0 ~ = 2 BP 2 p Test Statistic: X 2 n 2 i e n = 1 i ( ) 2 BP 2 Reject H if 1 ; = # of predictors p X p 0

  12. Brown-Forsyth and Breusch-Pagan Tests Brown-Forsyth Test: Group 1: Heights 79 , Group 2: Heights 80 H0: Equal Variances Among Errors (Reject H0) Breusch-Pagan Test: H0: Equal Variances Among Errors (Reject H0) Brown-Forsyth Test Group 1 69-79 2 80-87 Regression of Weight on Height ANOVA Heights(Grp) n(Grp) Med(e|grp) Mean(d|Grp) Var(d|Grp) -1.2673 0.7482 252 253 10.8039 12.9193 70.4186 108.7256 df SS Regression Residual Total 1 240984.7782 503 116782.3109 504 357767.0891 MeanDiff PooledVar PooledSD sqrt(1/n1+1/n2) s{d1bar-d2bar} t*(BF) t(.975,505-2) P-value -2.1155 89.6102 9.4663 0.0890 0.8425 -2.5110 1.9647 0.0247 Regression of e^2 on Height ANOVA df SS Regression Residual Total 1 963633.2703 503 67658845.93 504 68622479.2 SSE(Model1) n SS(Reg*) X2(BP):Num X2(BP):Denom X2(BP) Chisq(.95,1) P-value 116782.311 505 963633.270 481816.635 53477.534 9.010 3.841 0.003

  13. Linearity of Regression -Test for Lack-of-Fit ( observations at distinct levels of " ") c F n X j ( ) ( ) = + = + : : H E Y X H E Y X 0 0 1 0 1 i i A i i i Compute fitted value and sample mean for each distinct level Y Y X j j ( ) n c j 2 ( ) = = Lack-of-Fit: 2 SS LF Y Y df c j j LF = = 1 1 j i n ( ) c j 2 ( ) = = n c Pure Error: SS PE Y Y df j ij PE = = 1 1 j i ( ( ( ) ) ( ( 2, ) ) ( ( ) ) 2 SS LF SS PE c n c H ( ( ) ) MS LF MS PE 0 ~ = = Test Statistic: F F n c 2, LOF c ) n c Reject H if 1 ; F F c 0 LOF

  14. Linearity of Regression ) Y ^ ( ) ( = = Full Model : H E Y Y j j A ij j n c j 2 ( ) = = = n c ( ) means (parameters) are estimated c SSE F Y SS PE df j ij F = = 1 1 j i E Y ^ ( ) = + = = + Reduced Model : H X Y b b X j j 0 0 1 0 1 ij j j ( ) n c j 2 = = = ( ) 2 2 param eters are estimated SSE R Y Y SSE df n j ij R = = ) 1 1 j i ( ( ) )( ) n n n n ( ) ( c c c c j j j j 2 2 2 = + + 2 Y Y Y Y Y Y Y Y Y Y j j j j j j j ij ij ij = = = = = = = = 1 1 1 1 1 1 1 1 j i j i j i j i ( ) ( ) n n n ( ) ( ) c c c j j j 2 2 = + + = 2 Y Y Y Y Y Y Y Y j j j j j j ij ij = = = = = = 1 1 1 1 1 1 j i j i j i ( ) n n ( ) c c j j 2 2 ( ) ( ) = + + = + 0 Y Y Y Y SSE SS PE SS LF j j j ij = = = = 1 1 1 1 j i j i

  15. ( n c ) ) ( ) df SSE F df ( ) ( ) SSE n SS PE SSE R SSE F df SS LF c SS PE n c ( ) ( ( n c 2 2 H ( ( ) ) MS LF MS PE 0 ~ = = = = R F F F ( ) ) ( ) n c 2, LOF c SS PE F ( ) Reject H if 1 ; 2, F F c n c 0 LOF Computing Strategy: n j Y ij = = 1) For each group ( ): Co mpute: 1 n i j Y j j n ( ) j 2 Y Y j ij = 1 1 i n = 2 j s j 1 n j 0 otherwise ^ = + Y b b X j 0 1 j 2 2 n c c ^ ^ j ( ) = = 2) SS LF Y Y n Y Y j j j j j = = = 1 1 1 i j j n ( ) c c ( ) j 2 ( ) = = 2 j 3) 1 SS PE Y Y n s j ij j = = = 1 1 1 i j j

  16. Height and Weight Data n=505, c=18 Groups Height n Mean SD Y-hat SSLF 1305.39 150.62 332.27 237.15 583.79 61.96 SSPE SSE 69 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 2 4 182.50 175.75 181.00 186.13 183.33 193.71 200.84 204.13 211.00 221.35 227.33 232.49 241.49 245.66 254.62 247.86 278.00 263.00 #N/A 3.54 15.52 13.00 12.09 9.26 11.58 11.96 10.70 12.83 18.70 15.13 19.63 14.79 17.55 14.70 10.75 0.00 0.00 #N/A 156.95 169.61 175.94 182.28 188.61 194.94 201.27 207.60 213.93 220.26 226.59 232.92 239.25 245.58 251.91 258.24 264.57 270.91 #N/A 12.50 722.75 2028.00 2191.75 1716.67 5360.49 4434.22 3433.48 6912.00 1317.89 873.37 2360.27 2428.90 2300.45 5422.44 4439.96 3806.55 7280.86 Do not reject H0: j = 0 + 1Xj 13 16 21 41 32 31 43 49 46 67 53 44 34 7 1 1 505 5.74 373.06 368.86 57.94 16781.10 16839.04 24.90 10300.11 10325.01 12.30 25430.75 25443.05 265.64 11369.25 11634.88 0.26 13241.89 13242.14 248.66 7128.03 755.21 692.86 180.24 0.00 62.50 0.00 5026.479 111755.8 116782.3 7376.69 1448.07 180.24 62.50 Sum Source LackFit PureError df SS MS F(LOF) F(.95) P-value 0.1521 16 487 111755.8 5026.5 314.2 229.5 1.369 1.664

  17. Box-Cox Transformations Automatically selects a transformation from power family with goal of obtaining: normality, linearity, and constant variance (not always successful, but widely used) Goal: Fit model: Y = 0 + 1X + for various power transformations on Y, and selecting transformation producing minimum SSE (maximum likelihood) Procedure: over a range of from, say -2 to +2, obtain Wi and regress Wi on X (assuming all Yi > 0, although adding constant won t affect shape or spread of Y distribution) ( ) 1 n 1 0 K Y n 1 1 i = = = W K Y K ( ) Y 2 1 i i 1 K = ln 0 K = 1 i 2 2 i

  18. Box-Cox Transformation Obtained in R Maximum occurs near = 0 (Interval Contains 0) Try taking logs of Weight

  19. Results of Tests (Using R Functions) on ln(WT) Normality of Errors (Shapiro-Wilk Test) > nba.mod2 <- lm(log(Weight) ~ Height) > summary(nba.mod2) > shapiro.test(e2) Shapiro-Wilk normality test data: e2 W = 0.9976, p-value = 0.679 Call: lm(formula = log(Weight) ~ Height) Coefficients: Est Std. Error t value Pr(>|t|) (Intercept) 3.0781 0.0696 44.20 <2e-16 Height 0.0292 0.0009 33.22 <2e-16 Residual standard error: 0.06823 on 503 degrees of freedom Multiple R-squared: 0.6869, Adjusted R-squared: 0.6863 F-statistic: 1104 on 1 and 503 DF, p- value: < 2.2e-16 Constant Error Variance (Breusch-Pagan Test) > bptest(log(Weight) ~ Height,studentize=FALSE) Breusch-Pagan test data: log(Weight) ~ Height BP = 0.4711, df = 1, p-value = 0.4925 Linearity of Regression (Lack of Fit Test) nba.mod3 <- lm(log(Weight) ~ factor(Height)) > anova(nba.mod2,nba.mod3) Analysis of Variance Table Model 1: log(Weight) ~ Height Model 2: log(Weight) ~ factor(Height) Res.Df RSS Df Sum of Sq F Pr(>F) 1 503 2.3414 2 487 2.2478 16 0.093642 1.268 0.2131 Model fits well on all assumptions

Related


More Related Content