Understanding Multiple Regression Analysis in Statistics

statistics and data analysis n.w
1 / 45
Embed
Share

Explore the intricacies of multiple regression analysis with Professor William Greene from the Stern School of Business. Learn about hypothesis testing, regression analysis, application in Monet paintings, equivalent tests, and partial effects. Gain insights into testing coefficients, R2 fit measures, rejection regions, and more in this comprehensive overview.

  • Statistics
  • Regression Analysis
  • Hypothesis Testing
  • Multiple Regression
  • Data Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics 24-1/45 Part 24: Multiple Regression Part 4

  2. Statistics and Data Analysis Part 24 Multiple Regression: 4 24-2/45 Part 24: Multiple Regression Part 4

  3. Hypothesis Tests in Multiple Regression Simple regression: Test = 0 Testing about individual coefficients in a multiple regression R2 as the fit measure in a multiple regression Testing R2 = 0 Testing about sets of coefficients Testing whether two groups have the same model 24-3/45 Part 24: Multiple Regression Part 4

  4. Regression Analysis Investigate: Is the coefficient in a regression model really nonzero? Testing procedure: Model: y = + x + Hypothesis: H0: = 0. Rejection region: Least squares coefficient is far from zero. Test: level for the test = 0.05 as usual Compute t = b/StandardError Reject H0 if t is above the critical value 1.96 if large sample Value from t table if small sample. Reject H0 if reported P value is less than level Degrees of Freedom for the t statistic is N-2 24-4/45 Part 24: Multiple Regression Part 4

  5. Application: Monet Paintings Does the size of the painting really explain the sale prices of Monet s paintings? Investigate: Compute the regression Hypothesis: The slope is actually zero. Rejection region: Slope estimates that are very far from zero. The hypothesis that = 0 is rejected 24-5/45 Part 24: Multiple Regression Part 4

  6. An Equivalent Test Is there a relationship? H0: No correlation Rejection region: Large R2. Test: F= Reject H0 if F > 4 Math result: F = t2. 2 (N-2)R 1 - R 2 Degrees of Freedom for the F statistic are 1 and N-2 24-6/45 Part 24: Multiple Regression Part 4

  7. Partial Effects in a Multiple Regression Hypothesis: If we include the signature effect, size does not explain the sale prices of Monet paintings. Test: Compute the multiple regression; then H0: 1 = 0. level for the test = 0.05 as usual Rejection Region: Large value of b1 (coefficient) Test based on t = b1/StandardError Degrees of Freedom for the t statistic is N-3 = N-number of predictors 1. Regression Analysis: ln (US$) versus ln (SurfaceArea), Signed The regression equation is ln (US$) = 4.12 + 1.35 ln (SurfaceArea) + 1.26 Signed Predictor Coef SE Coef T P Constant 4.1222 0.5585 7.38 0.000 ln (SurfaceArea) 1.3458 0.08151 16.51 0.000 Signed 1.2618 0.1249 10.11 0.000 S = 0.992509 R-Sq = 46.2% R-Sq(adj) = 46.0% Reject H0. 24-7/45 Part 24: Multiple Regression Part 4

  8. Use individual T statistics. T > +2 or T < -2 suggests the variable is significant. Coef T =SE Coef T for LogPCMacs = +9.66. This is large. 24-8/45 Part 24: Multiple Regression Part 4

  9. Women appear to assess health satisfaction differently from men. 24-9/45 Part 24: Multiple Regression Part 4

  10. Or do they? Not when other things are held constant 24-10/45 Part 24: Multiple Regression Part 4

  11. 24-11/45 Part 24: Multiple Regression Part 4

  12. Confidence Interval for Regression Coefficient Coefficient on OwnRent Estimate Standard error = 0.007141 Confidence interval 0.040923 1.96 X 0.007141 (large sample) = 0.040923 0.013996 = 0.02693 to 0.05492 Form a confidence interval for the coefficient on SelfEmpl. (Left for the reader) = +0.040923 24-12/45 Part 24: Multiple Regression Part 4

  13. Model Fit How well does the model fit the data? R2 measures fit the larger the better Time series: expect .9 or better Cross sections: it depends Social science data: .1 is good Industry or market data: .5 is routine Use R2 to compare models and find the right model 24-13/45 Part 24: Multiple Regression Part 4

  14. Dear Prof William I hope you are doing great. I have got one of your presentations on Statistics and Data Analysis, particularly on regression modeling. There you said that R squared value could come around .2 and not bad for large scale survey data. Currently, I am working on a large scale survey data set data (1975 samples) and r squared value came as .30 which is low. So, I need to justify this. I thought to consider your presentation in this case. However, do you have any reference book which I can refer while justifying low r squared value of my findings? The purpose is scientific article. 24-14/45 Part 24: Multiple Regression Part 4

  15. Pretty Good Fit: R2 = .722 Regression of Fuel Bill on Number of Rooms 24-15/45 Part 24: Multiple Regression Part 4

  16. A Huge Theorem R2 always goes up when you add variables to your model. Always. 24-16/45 Part 24: Multiple Regression Part 4

  17. The Adjusted R Squared Adjusted R2 penalizes your model for obtaining its fit with lots of variables. Adjusted R2 = 1 [(N-1)/(N-K-1)]*(1 R2) Adjusted R2 is denoted Adjusted R2 is not the mean of anything and it is not a square. This is just a name. 2 R 24-17/45 Part 24: Multiple Regression Part 4

  18. The Adjusted R Squared S = 0.952237 R-Sq = 57.0% R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression 20 2617.58 130.88 144.34 0.000 Residual Error 2177 1974.01 0.91 Total 2197 4591.58 If N is very large, R2 and Adjusted R2 will not differ by very much. 2198 is quite large for this purpose. 24-18/45 Part 24: Multiple Regression Part 4

  19. Success Measure Hypothesis: There is no regression. Equivalent Hypothesis: R2 = 0. How to test: For now, rough rule. Look for F > 2 for multiple regression (Critical F was 4 for simple regression) F = 144.34 for Movie Madness 24-19/45 Part 24: Multiple Regression Part 4

  20. Testing The Regression 1 1 2 2 Model: y = + x + x + ... + Hypothesis: The x variables are not relevant to y. H : 0 and 0 and ... H : At least one coefficient is not zero. Set level to 0.05 as us ual. Rejection region: In principle, values of coefficients that are far from zero Rejection region for purposes of the test: Large R . The test is equivalent to a test of the hypothesis that R = x + K K Degrees of Freedom for the F statistic are K and N-K-1 = 1 = 2 = K 0 0 1 2 2 0. 2 R /K Test procedure: Compute F = 2 (1 - R )/(N-K-1) Reject H if F is large. Critical value depends on K and N-K-1 (see next page). (F is not the square of any t statistic if K > 1.) 0 24-20/45 Part 24: Multiple Regression Part 4

  21. The F Test for the Model Determine the appropriate critical value from the table. Is the F from the computed model larger than the theoretical F from the table? Yes: Conclude the relationship is significant No: Conclude R2= 0. 24-21/45 Part 24: Multiple Regression Part 4

  22. n1 = Number of predictors n2 = Sample size number of predictors 1 24-22/45 Part 24: Multiple Regression Part 4

  23. Movie Madness Regression S = 0.952237 R-Sq = 57.0% R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression 20 2617.58 130.88 144.34 0.000 Residual Error 2177 1974.01 0.91 Total 2197 4591.58 24-23/45 Part 24: Multiple Regression Part 4

  24. Compare Sample F to Critical F F = 144.34 for Movie Madness Critical value from the table is 1.57. Reject the hypothesis of no relationship. 24-24/45 Part 24: Multiple Regression Part 4

  25. An Equivalent Approach What is the P Value? We observed an F of 144.34 (or, whatever it is). If there really were no relationship, how likely is it that we would have observed an F this large (or larger)? Depends on N and K The probability is reported with the regression results as the P Value. 24-25/45 Part 24: Multiple Regression Part 4

  26. The F Test S = 0.952237 R-Sq = 57.0% R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression 20 2617.58 130.88 144.34 0.000 Residual Error 2177 1974.01 0.91 Total 2197 4591.58 24-26/45 Part 24: Multiple Regression Part 4

  27. A Cost Function Regression The regression is significant. F is huge. Which variables are significant? Which variables are not significant? 24-27/45 Part 24: Multiple Regression Part 4

  28. What About a Group of Variables? Is Genre significant in the movie model? There are 12 genre variables Some are significant (fantasy, mystery, horror) some are not. Can we conclude the group as a whole is? Maybe. We need a test. 24-28/45 Part 24: Multiple Regression Part 4

  29. Theory for the Test A larger model has a higher R2 than a smaller one. (Larger model means it has all the variables in the smaller one, plus some additional ones) Compute this statistic with a calculator 2 Larger Model R 2 Smaller Model R How much larger = How many Variables = F 2 Larger Model 1 R N K 1 for the larger model 24-29/45 Part 24: Multiple Regression Part 4

  30. Is Genre Significant? Calc -> Probability Distributions -> F The critical value shown by Minitab is 1.76 With the 12 Genre indicator variables: R-Squared = 57.0% Without the 12 Genre indicator variables: R-Squared = 55.4% The F statistic is 6.750. F is greater than the critical value. Reject the hypothesis that all the genre coefficients are zero. (0.570 (1 .570)/ (2198 0.554)/12 = = F 6.750 20 1) 24-30/45 Part 24: Multiple Regression Part 4

  31. Now What? If the value that Minitab shows you is less than your F statistic, then your F statistic is large I.e., conclude that the group of coefficients is significant This means that at least one is nonzero, not that all necessarily are. 24-31/45 Part 24: Multiple Regression Part 4

  32. Application: Part of a Regression Model Regression model includes variables x1, x2, I am sure of these variables. Maybe variables z1, z2, I am not sure of these. Model: y = + 1x1+ 2x2 + 1z1+ 2z2 + Hypothesis: 1=0 and 2=0. Strategy: Start with model including x1 and x2. Compute R2. Compute new model that also includes z1 and z2. Rejection region: R2 increases a lot. 24-32/45 Part 24: Multiple Regression Part 4

  33. Test Statistic Model 0 contains x1, x2, ... Model 1 contains x1, x2, ... and additional variables z1, z2, ... R = the R from Model 0 2 0 2 1 2 2 2 1 2 0 R = the R from Model 1. R will always be greater than R . 2 1 2 0 (R R )/(Number of z variables) c is F = (1 - R )/(N - total number of variables - 1) Critical F comes from the table of F[KZ, N - KX - KZ - 1]. (Unfortunately, Minitab cannot do this kind of test aut The test statisti 2 1 omatically.) 24-33/45 Part 24: Multiple Regression Part 4

  34. Gasoline Market 24-34/45 Part 24: Multiple Regression Part 4

  35. Gasoline Market Regression Analysis: logG versus logIncome, logPG The regression equation is logG = - 0.468 + 0.966 logIncome - 0.169 logPG Predictor Coef SE Coef T P Constant -0.46772 0.08649 -5.41 0.000 logIncome 0.96595 0.07529 12.83 0.000 logPG -0.16949 0.03865 -4.38 0.000 S = 0.0614287 R-Sq = 93.6% R-Sq(adj) = 93.4% Analysis of Variance Source DF SS MS F P Regression 2 2.7237 1.3618 360.90 0.000 Residual Error 49 0.1849 0.0038 Total 51 2.9086 R2 = 2.7237/2.9086 = 0.93643 24-35/45 Part 24: Multiple Regression Part 4

  36. Gasoline Market Regression Analysis: logG versus logIncome, logPG, ... The regression equation is logG = - 0.558 + 1.29 logIncome - 0.0280 logPG - 0.156 logPNC + 0.029 logPUC - 0.183 logPPT Predictor Coef SE Coef T P Constant -0.5579 0.5808 -0.96 0.342 logIncome 1.2861 0.1457 8.83 0.000 logPG -0.02797 0.04338 -0.64 0.522 logPNC -0.1558 0.2100 -0.74 0.462 logPUC 0.0285 0.1020 0.28 0.781 logPPT -0.1828 0.1191 -1.54 0.132 S = 0.0499953 R-Sq = 96.0% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression 5 2.79360 0.55872 223.53 0.000 Residual Error 46 0.11498 0.00250 Total 51 2.90858 Now, R2 = 2.7936/2.90858 = 0.96047 Previously, R2 = 2.7237/2.90858 = 0.93643 24-36/45 Part 24: Multiple Regression Part 4

  37. 2 R increased from 0.93643 to 0.96047 when the 3 variables were added to the model. (0.96047 - 0.93643)/3 (1 - 0.96047)/(52 - 2 - 1 - 3) The F statistic is = 9.32482 24-37/45 Part 24: Multiple Regression Part 4

  38. n1 = Number of predictors n2 = Sample size number of predictors 1 24-38/45 Part 24: Multiple Regression Part 4

  39. Improvement in R2 R increased from 0.93643 to 0.96047 (0.96047 - 0.93643)/3 The F statistic is (1 - 0.96047)/(52 - 2 - 3 - 1) 2 = 9.32482 Inverse Cumulative Distribution Function F distribution with 3 DF in numerator and 46 DF in denominator P( X <= x ) = 0.95 x = 2.80684 The null hypothesis is rejected. Notice that none of the three individual variables are significant but the three of them together are. 24-39/45 Part 24: Multiple Regression Part 4

  40. Application Health satisfaction depends on many factors: Age, Income, Children, Education, Marital Status Do these factors figure differently in a model for women compared to one for men? Investigation: Multiple regression Null hypothesis: The regressions are the same. Rejection Region: Estimated regressions that are very different. 24-40/45 Part 24: Multiple Regression Part 4

  41. Equal Regressions Setting: Two groups of observations (men/women, countries, two different periods, firms, etc.) Regression Model: y = + 1x1+ 2x2 + + Hypothesis: The same model applies to both groups Rejection region: Large values of F 24-41/45 Part 24: Multiple Regression Part 4

  42. Procedure: Equal Regressions There are N1 observations in Group 1 and N2 in Group 2. There are K variables and the constant term in the model. This test requires you to compute three regressions and retain the sum of squared residuals from each: SS1 = sum of squares from N1 observations in group 1 SS2 = sum of squares from N2 observations in group 2 SSALL = sum of squares from NALL=N1+N2 observations when the two groups are pooled. (SSALL-SS1-SS2)/K F=(SS1+SS2)/(N1+N2-2K-2) The hypothesis of equal regressions is rejected if F is larger than the critical value from the F table (K numerator and NALL-2K-2 denominator degrees of freedom) 24-42/45 Part 24: Multiple Regression Part 4

  43. Health Satisfaction Models: Men vs. Women +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error | T |P value]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ Women===|=[NW = 13083]================================================ Constant| 7.05393353 .16608124 42.473 .0000 1.0000000 AGE | -.03902304 .00205786 -18.963 .0000 44.4759612 EDUC | .09171404 .01004869 9.127 .0000 10.8763811 HHNINC | .57391631 .11685639 4.911 .0000 .34449514 HHKIDS | .12048802 .04732176 2.546 .0109 .39157686 MARRIED | .09769266 .04961634 1.969 .0490 .75150959 Men=====|=[NM = 14243]================================================ Constant| 7.75524549 .12282189 63.142 .0000 1.0000000 AGE | -.04825978 .00186912 -25.820 .0000 42.6528119 EDUC | .07298478 .00785826 9.288 .0000 11.7286996 HHNINC | .73218094 .11046623 6.628 .0000 .35905406 HHKIDS | .14868970 .04313251 3.447 .0006 .41297479 MARRIED | .06171039 .05134870 1.202 .2294 .76514779 Both====|=[NALL = 27326]============================================== Constant| 7.43623310 .09821909 75.711 .0000 1.0000000 AGE | -.04440130 .00134963 -32.899 .0000 43.5256898 EDUC | .08405505 .00609020 13.802 .0000 11.3206310 HHNINC | .64217661 .08004124 8.023 .0000 .35208362 HHKIDS | .12315329 .03153428 3.905 .0001 .40273000 MARRIED | .07220008 .03511670 2.056 .0398 .75861817 German survey data over 7 years, 1984 to 1991 (with a gap). 27,326 observations on Health Satisfaction and several covariates. 24-43/45 Part 24: Multiple Regression Part 4

  44. Computing the F Statistic +--------------------------------------------------------------------------------+ | Women Men All | | HEALTH Mean = 6.634172 6.924362 6.785662 | | Standard deviation = 2.329513 2.251479 2.293725 | | Number of observs. = 13083 14243 27326 | | Model size Parameters = 6 6 6 | | Degrees of freedom = 13077 14237 27320 | | Residuals Sum of squares = 66677.66 66705.75 133585.3 | | Standard error of e = 2.258063 2.164574 2.211256 | | Fit R-squared = 0.060762 0.076033 .070786 | | Model test F (P value) = 169.20(.000) 234.31(.000) 416.24 (.0000) | +--------------------------------------------------------------------------------+ [133,585.3-(66,677.66+66,705.75)] / 6 (66,677.66+66,705.75) / (27,326 - 6 - 6 - 2 The critical value for F[6, 23214] is 2.0989 Even though the regressions look similar, the hypothesis of equal regressions is rejected. F= = 6.8904 24-44/45 Part 24: Multiple Regression Part 4

  45. Summary Simple regression: Test = 0 Testing about individual coefficients in a multiple regression R2 as the fit measure in a multiple regression Testing R2 = 0 Testing about sets of coefficients Testing whether two groups have the same model 24-45/45 Part 24: Multiple Regression Part 4

Related


More Related Content