Multiple Regression Analysis Overview

section 9 2 n.w
1 / 24
Embed
Share

Explore multiple regression analysis, a statistical technique used to model relationships between a numeric response variable and multiple predictor variables. Understand the key concepts, assumptions, model building, and interpretation with a practical example on energy consumption of hotels in Lagos.

  • Regression Analysis
  • Statistical Modeling
  • Model Assumptions
  • Data Analysis
  • Predictor Variables

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Section 9.2 Multiple Regression

  2. Multiple Regression Numeric Response variable (Y) p Numeric predictor variables (p < n) Model: Y = 0+ 1X1+ + pXp+ Partial Regression Coefficients: j effect (on the mean response) of increasing the jthpredictor variable by 1 unit, holding all other predictors constant Model Assumptions (Involving Error terms ) Normally distributed with mean 0 Constant Variance 2 Independent (Problematic when data are series in time/space)

  3. Example Energy Consumption of Hotels in Lagos Units: n = 28 Hotels in Lagos, Nigeria Response: Annual Energy Consumption (1000s Megawatt hours) Predictors (p=5 explanatory variables) Hotel Rating Stars (X1 -- 2,3,4,5) Floor Area (X2, in 1000s of square meters) Age (X3, in Years) Equivalent Guest Rooms (X4 = #Rooms Occupancy Rate) Employees (X5, in 100s) Source: P.O. Oluseyi, et al (2016). Assessment of Energy Consumption and Carbon Footprint from the Hotel Sector Within Lagos, Nigeria, Energy and Buildings, Vol. 116, pp. 106-113

  4. Least Squares Estimation Population Model for mean response: E Y = + + + X X 0 1 1 p p Least Squares Fitted (predicted) equation, minimizing SSE: 2 ^ ^ ^ ^ ^ = + + + = Y X X SSE Y Y 0 1 p 1 p Statistical software packages/spreadsheets can compute least squares estimates and their standard errors

  5. Analysis of Variance Direct extension to ANOVA based on simple linear regression Only adjustments are to degrees of freedom: DFR = p DFE= n-(p+1) Source of Variation Model Error Total Sum of Squares SSR SSE TSS Degrees of Freedom p n-(p+1) n-1 Mean Square MSR = SSR/p MSE = SSE/(n-(p+1)) F F = MSR/MSE TSS SSE SSR = = 2 R TSS TSS

  6. Testing for the Overall Model - F-test Tests whether any of the explanatory variables are associated with the response H0: 1= = p=0 (None of the Xs associated with Y) HA: Not all j = 0 2 / MSR MSE R p = = . .: T S F ( ) ( ) ( ) obs + 2 1 / 1 R n p . .: R R F F , , p n + ( 1) obs p : ( ) P val P F F ( ) + obs , 1 p n p

  7. Example Energy Consumption of Hotels in Lagos n=28 p=5 R2=0.7382 H0: 1= = 5=0 HA: Not all j = 0 2 / MSR MSE R p = = = . .: T S F ( ) ( ) obs + 2 (1 ) / 1 R n p 0.7382/5 .1476 .0119 = = = 12.4 + (1 0.7382) / (28 . .: : P val P F (5 1)) 2.66 0.0001 = R R F F ,5,22 12.4) obs (

  8. Testing Individual Partial Coefficients - t-tests Wish to determine whether the response is associated with a single explanatory variable, after controlling for the others H0: j= 0 HA: j 0 (2-sided alternative) ^ j = . .: T S t obs ^ ^ SE j . .:| R R | :2 ( | |) t t P val P t t + /2, ( 1) obs n p obs

  9. Example Energy Consumption of Hotels in Lagos Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.09648 2.81547 -1.100 0.2833 stars 1.45974 0.96562 1.512 0.1448 flr.area 0.14777 0.06011 2.458 0.0223 * age 0.02115 0.03932 0.538 0.5961 equiv.gr -0.02885 0.01573 -1.834 0.0802 . employees 1.23698 0.47758 2.590 0.0167 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 3.629 on 22 degrees of freedom Multiple R-squared: 0.7382, Adjusted R-squared: 0.6786 F-statistic: 12.4 on 5 and 22 DF, p-value: 8.474e-06

  10. Comparing Regression Models Conflicting Goals: Explaining variation in Y while keeping model as simple as possible (parsimony) We can test whether a subset of p-g predictors (including possibly cross-product terms) can be dropped from a model that contains the remaining g predictors. H0: g+1= = p =0 Complete Model: Contains all predictors Reduced Model: Eliminates the predictors from H0 Fit both models, obtaining sums of squares for each (or R2 from each): Complete: SSRc ,SSEc(Rc2) Reduced: SSRr ,SSEr (Rr2)

  11. Comparing Regression Models H0: g+1= = p = 0 (After removing the effects of X1, ,Xg, none of other predictors are associated with Y) Ha: H0 is false ( ) ( ( / c SSE n p RR F F + = ( ) ( n ) ) ) 2 c R 2 r R R p g / SSE SSE p g = = r c TS: F ( ) ) ( obs + + 2 c 1 1 1 p : , ,( ( 1)) obs p g n p ( ) P P F F obs P-value based on F-distribution with p-g and n-(p+1)d.f.

  12. Example Energy Consumption of Hotels in Lagos Test whether Electricity Consumption is associated with: stars, age, and/or equivalent guest rooms, after controlling for floor area and employees = = = : 0 2 = : + is false 22 = H n H H 0 = 1 3 = 4 0 A p ( ) = 28 5 1 3 p g n p g ( 2 1 + ) = = = 5 1 + = 2 c Complete Model: 289.72 .7382 28 22 SSE R dfE c c ( ) = = = = 2 r Reduced Model: 354.75 .6794 28 25 SSE R dfE r r 354.75 289.72 5 2 289.72 28 ( ) 1.646 21.677 13.169 = = = = TS: 1.646 RR: 3.049 F F .05,3,22 F obs obs ) 5 1 + ( = = .2076 P P F 3,22

  13. R Program/ Partial Output mod1 <- lm(year.EC ~ stars + flr.area + age + equiv.gr + employees) summary(mod1) anova(mod1) mod2 <- lm(year.EC ~ flr.area + employees) summary(mod2) anova(mod2) anova(mod2, mod1) > anova(mod2, mod1) Analysis of Variance Table Model 1: year.EC ~ flr.area + employees Model 2: year.EC ~ stars + flr.area + age + equiv.gr + employees Res.Df RSS Df Sum of Sq F Pr(>F) 1 25 354.75 2 22 289.72 3 65.03 1.646 0.2076

  14. Models with Dummy Variables Some models have both numeric and categorical explanatory variables If a categorical variable has m levels, need to create m-1 dummy variables that take on the values 1 if the level of interest is present, 0 otherwise. The baseline level of the categorical variable is the one for which all m-1 dummy variables are set to 0 The regression coefficient corresponding to a dummy variable is the difference between the mean for that level and the mean for baseline group, controlling for all numeric predictors

  15. Example - Deep Cervical Infections Subjects - Patients with deep neck infections Response (Y) - Length of Stay in hospital Predictors: (One numeric, 11 Dichotomous) Age (x1) Gender (x2=1 if female, 0 if male) Fever (x3=1 if Body Temp > 38C, 0 if not) Neck swelling (x4=1 if Present, 0 if absent) Neck Pain (x5=1 if Present, 0 if absent) Trismus (x6=1 if Present, 0 if absent) Underlying Disease (x7=1 if Present, 0 if absent) Respiration Difficulty (x8=1 if Present, 0 if absent) Complication (x9=1 if Present, 0 if absent) WBC > 15000/mm3(x10=1 if Present, 0 if absent) CRP > 100 g/ml (x11=1 if Present, 0 if absent) Source: Wang, et al (2003)

  16. Example - Weather and Spinal Patients Subjects - Visitors to National Spinal Network in 23 cities Completing SF-36 Form Response - Physical Function subscale (1 of 10 reported) Predictors: Patient s age (x1) Gender (x2=1 if female, 0 if male) High temperature on day of visit (x3) Low temperature on day of visit (x4) Dew point (x5) Wet bulb (x6) Total precipitation (x7) Barometric Pressure (x7) Length of sunlight (x8) Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon, wan gibbous, last Qtr, wan crescent, => 8-1=7 dummy variables) Source: Glaser, et al (2004). Weather Conditions and Spinal Patients, Spine, Vol. 29, #12, pp. 1369-1373

  17. Modeling Interactions Statistical Interaction: When the effect of one predictor (on the response) depends on the level of other predictors. Can be modeled (and thus tested) with cross- product terms (case of 2 predictors): E(Y) = 0 + 1X1 + 2X2 + 3X1X2 X2=0 E(Y) = 0 + 1X1 X2=10 E(Y) = 0 + 1X1 + 10 2 + 10 3X1 = ( 0+ 10 2) + ( 1 + 10 3)X1 The effect of increasing X1 by 1 on E(Y) depends on level of X2, unless 3=0 (t-test)

  18. Example Hotel Energy in Lagos and Singapore Extension of Hotel Energy Consumption to Hotels in Lagos (28 hotels) and Singapore (29 hotels) n = 57 Response: Annual energy usage (Y, in 1000s MWH) Predictors: Floor Area (X1, in 1000s of metres2) Employees (X2, in 100s) Singapore Indicator (Dummy) variable (X3=1 if Singapore, 0 if Lagos) E Y E Y + = = + + + + = + + + Model 1: Model 2: + + E Y X X X X X 0 1 1 2 2 0 1 1 2 2 3 3 Model 3: X X X X X X X 0 1 1 2 2 3 3 4 1 3 5 2 3

  19. Example Hotel Energy in Lagos and Singapore ^ 0.410 0.395 = + Model 1: 0.040 Y X X 1 2 ^ 2.152 0.318 = + + + Model 2: 0.438 5.267 Y X X X 1 2 3 ^ 2.152 0.318 = + + Lagos: 0.438 Y X X 1 2 ^ 2.152 0.318 = + + + = 3.115 0.318 + + Singapore: 0.438 5.267(1) 0.438 Y X X X X 1 2 1 2 ^ = 0.269 0.156 + + + + Model 3: 0.734 0.361 0.115 2.056 Y X X X X X X X 1 2 3 1 3 2 3 ^ = 0.269 0.156 + + Lagos: 0.734 Y X X 1 2 ^ 0.269 0.156 + + + (1) 2.056 + 0.092 0.271 = + + Singapore: 0.734 0.361(1) 0.115 (1) 2.790 Y X X X X X X 1 2 1 2 1 2 Model p SSE R-square df_ERR 0.7729 0.8309 0.9022 1 2 3 2 3 5 1120.7 833.8 482.2 54 53 51

  20. Example Hotel Energy in Lagos and Singapore Model 1 vs 2: Test of Singapore effect (different intercepts) ^ ( ) 0.410 0.395 = + = = + = 2 Model 1: 0.040 1120.7 =0.7729 57- 2 1 54 Y X X SSE R df 1 2 1 1 1 ERR ^ ( ) 2.152 0.318 = + + + = = + = 2 2 Model 2: 0.438 5.267 833.8 =0.8309 57- 3 1 53 Y X X X SSE R df 1 2 3 2 1 ERR ^ 2.152 0.318 = + + + 2.152 0.318 = + + Lagos: 0.438 5.267(0) 0.438 Y X X X X 1 2 1 2 ^ 2.152 0.318 = + + + = 3.115 0.318 + + Singapore: 0.438 5.267(1) 0.438 Y X X X X 1 2 1 2 1120.7 833.8 54 53 833.8 53 .0001 286.9 15.73 = = = = 12 0 12 obs Model 1 vs 2: : 0 in Model 2. : 18.24 H TS F 3 ( ) = = 12 obs : 4.023 18.24 RR F .05,1,53 F P P F 1,53 Mode1 p SSE R-square df_ERR 0.7729 0.8309 0.9022 1 2 3 2 3 5 1120.7 833.8 482.2 54 53 51

  21. Example Hotel Energy in Lagos and Singapore Model 2 vs 3: Different slopes wrt X1 and X2 by city ^ 2.152 0.318 = + + + Model 2: 0.438 5.267 Y X X X 1 2 3 ^ 2.152 0.318 = + + 2.152 0.318 = + + Lagos: 0.438 +5.267(0) 0.438 Y X X X X 1 2 1 2 ^ 2.152 0.318 = + + + = 3.115 0.318 + + Singapore: 0.438 5.267(1) 0.438 Y X X X X 1 2 1 2 ^ = + + + + Model 3: 0.269 0 .156 0.734 0.361 0.115 2.056 Y X X X X X X X 1 2 3 1 3 2 3 ^ = 0.269 0.156 + + + (0) 2.056 + = 0.269 0.156 + + Lagos: 0.734 0.361(0) 0.115 (0) 0.734 Y X X X X X X 1 2 1 2 1 2 ^ 0.269 0.156 + + + (1) 2.056 + = 0.092 0.271 + + Singapore: 0.734 0.361(1) 0.115 (1) 2.790 Y X X X X X X 1 2 1 2 1 2 833.8 482.2 53 51 482.2 51 .0001 175.8 9.45 = = = = = 23 0 23 obs Model 2 vs 3: : 0 in Model 3. : 18.60 H TS F 4 5 ( ) = = 23 obs : 3.179 18.60 RR F .05,2,51 F P P F 2,51 Mode1 p SSE R-square df_ERR 0.7729 0.8309 0.9022 1 2 3 2 3 5 1120.7 833.8 482.2 54 53 51

  22. Example Hotel Energy in Lagos and Singapore Model 1 vs 3: Same regression equation for each city (Note that since the previous two F-tests were highly significant, we know this test will reject the null hypothesis) = = = 13 0 Model 1 vs 3: : 0 in Model 3. H 3 4 5 1120.7 482.2 54 51 482.2 51 = 212.83 9.45 = = = 13 obs : 22.52 TS F ( ) = 13 obs : 2.786 22.52 .0001 RR F .05,3,51 F P P F 3,51 Mode1 p SSE R-square df_ERR 0.7729 0.8309 0.9022 1 2 3 2 3 5 1120.7 833.8 482.2 54 53 51

  23. R Program nsh.data <- read.csv("E:\\public_html\\data\\nigeria_singapore_hotel.csv") attach(nsh.data); names(nsh.data) singapore <- city 1; year.EC <- year.EC / 1000 flr.area <- flr.area / 1000; employees <- employees / 100 hotel.mod1 <- lm(year.EC ~ flr.area + employees) summary(hotel.mod1); anova(hotel.mod1) hotel.mod2 <- lm(year.EC ~ flr.area + employees + singapore) summary(hotel.mod2); anova(hotel.mod2) hotel.mod3 <- lm(year.EC ~ flr.area + employees + singapore + I(flr.area*singapore) + I(employees*singapore)) summary(hotel.mod3); anova(hotel.mod3) anova(hotel.mod1, hotel.mod2) anova(hotel.mod2, hotel.mod3) anova(hotel.mod1, hotel.mod3)

  24. R Partial Output > anova(hotel.mod1, hotel.mod2) Model 1: year.EC ~ flr.area + employees Model 2: year.EC ~ flr.area + employees + singapore Res.Df RSS Df Sum of Sq F Pr(>F) 1 54 1120.69 2 53 833.76 1 286.93 18.239 8.114e-05 *** > anova(hotel.mod2, hotel.mod3) Model 1: year.EC ~ flr.area + employees + singapore Model 2: year.EC ~ flr.area + employees + singapore + I(flr.area * singapore) + I(employees * singapore) Res.Df RSS Df Sum of Sq F Pr(>F) 1 53 833.76 2 51 482.20 2 351.55 18.591 8.627e-07 *** > anova(hotel.mod1, hotel.mod3) Model 1: year.EC ~ flr.area + employees Model 2: year.EC ~ flr.area + employees + singapore + I(flr.area * singapore) + I(employees * singapore) Res.Df RSS Df Sum of Sq F Pr(>F) 1 54 1120.7 2 51 482.2 3 638.48 22.51 2.025e-09 ***

Related


More Related Content