Statistical Thinking for Data Science by Mahrita Harahap

Slide Note

Statistical thinking plays a crucial role in data science, helping us make informed decisions and draw accurate conclusions from data. This content covers various statistical methods such as linear regression, ANOVA, logistic regression, and more. It delves into topics like model selection, multicollinearity, and exploratory data analysis, providing valuable insights for analyzing and interpreting data effectively.

nbenn Follow

Uploaded on Mar 05, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

36103 Statistical Thinking for Data Science By Mahrita Harahap

What type of response variable? How many response variables? How many predictor variables? What type of predictor? Simple Linear Regression Continuous One One Way ANOVA Categorical Continuous Multiple Regression Two or more Any Assuming data meets assumptions for parametric tests Logistic Regression Continuous One Pearson Chi Square or Likelihood Ratio Categorical One Categorical Loglinear Analysis Categorical Two or more Continuous or Both Categorical and Continuous Multinomial Regression Counts of events Any amount Any type Poisson Model

Model Selection Given the data set with many potential explanatory variables we need to decide which ones to include in out model and which ones to leave out. Possible procedures include: All possible models (best subsets) Forward Selection Backward Elimination Stepwise Selection

Multicollinearity With multiple regression, we are often adding variables that are themselves partially related to each other. This becomes a problem when a combination of variables become close to collinear: if some of the predictor variables are (highly) correlated we say that multicollinearity exists WHY IS MULTICOLLINEARITY A PROBLEM: If two (or more) variables are collinear we cannot interpret the parameter as the relationship between one x variable and y holding ALL the other x s constant: Can get negative relationships when they should be positive. Adding highly collinear variables inflates the standard error on the parameter estimates. This can lead to make individual variables look non-significant. Variance Inflation Factor: The VIF measures how much the variance of an estimated regression coefficient is inflated because of collinearity. Typically concerned if any of the Variance Inflation Factor s are above 10.

Exploratory Data Analysis setwd( .) houses<-read.csv("Houses.csv") summary(houses) head(houses) ## Price Floor Block Rooms Age CentralHeating ## 1 52.00 111 830 5 6.2 no ## 2 54.75 128 710 5 7.5 no install.packages("corrplot") library(corrplot) Correlation<-round(cor(houses[,c(1,2,3,4,5)]),2) corrplot(Correlation,method="ellipse") corrplot(Correlation,method="number") library(car) scatterplotMatrix(~Price+Floor+Block+Rooms+Age+CentralHeating,da ta=houses)

Correlation Plot

Matrix Plot

All possible regressions Best Subsets To consider all possible models is time consuming unless there are only a small number of models because there are 2p possible linear regression models and we require procedures for choosing one (or a small number) of them. Still difficult to choose best model as lots of test results will be available giving conflicting information. Can select the best models based on Adjusted R2, Mallows Cp, AIC or BIC. Adjusted R2 is used instead of R2 because penalises for the number of parameters and sample size. Usually too many to manually consider all models so need an automatic system for deciding which models to consider and in which order. Better to use a logical procedure like forward selection, backward elimination or stepwise, where each test is acted upon sequentially and do not ignore any substantive theory .

All possible regressions Best Subsets You can perform best subsets (all possible regressions) using the leaps() function from the leaps package. In the following code nbest indicates the number of subsets of each size to report. Here, the ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.). Other options for plot( ) are BIC, Cp, and Adj R2. Other options for plotting with subset( ) are BIC, Cp, Adj R2, and RSS. This video https://www.youtube.com/watch?v=LkifE44myLc goes through those the measures in depth. install.packages("leaps") library(leaps) leaps<-regsubsets(Price~Floor+Block+Rooms+Age+CentralHeating, data=houses, nbest=10) leaps plot(leaps, scale="adjr2") plot(leaps, scale="bic") Plot(leaps, scale= Cp )

All possible regressions Best Subsets Here black indicates that a variable is included in the model, while white indicates that they are not. Looking at the values on the y-axis of the plot indicates that the top four models have roughly the same adjusted R- square. Automatic methods are useful when the number of explanatory variables is large and it is not feasible to fit all possible models. In this case, it is more efficient to use a search algorithm (e.g., Forward selection, Backward elimination and Stepwise regression) to find the best model.

Automatic Methods forward,backward, stepwise The R function step() can be used to perform variable selection. To perform forward selection we need to begin by specifying a starting model and the range of models which we want to examine in the search. attach(houses) # null model null<-lm(Price~1, data=houses) Null Call: lm(formula = Price ~ 1, data = houses) Coefficients: (Intercept) 71.56 # full model full<-lm(Price~., data=houses) Full Call: lm(formula = Price ~ ., data = houses) Coefficients: (Intercept) Floor Block Rooms 18.251533 0.166485 -0.002324 6.176126 Age CentralHeatingyes -1.884105 6.399189 detach(houses)

Variable Selection Forward Selection # Forward Selection Method > step(null, scope=list(lower=null, upper=full), direction="forward") # Final Model selected Step: AIC=58.75 Price ~ Floor + Age + Rooms + CentralHeating Df Sum of Sq RSS AIC <none> 228.88 58.749 + Block 1 0.81331 228.06 60.678 Call: lm(formula = Price ~ Floor + Age + Rooms + CentralHeating, data = houses) Coefficients: (Intercept) Floor Age Rooms 15.9814 0.1622 -1.8479 6.3253 CentralHeatingyes 6.3502

Variable Selection Backward Selection # Backward Elimination Method > step(full, data=houses, direction="backward") # Final Model selected Step: AIC=58.75 Price ~ Floor + Rooms + Age + CentralHeating Df Sum of Sq RSS AIC <none> 228.88 58.749 - Floor 1 85.47 314.35 63.095 - CentralHeating 1 115.80 344.68 64.938 - Rooms 1 155.11 383.99 67.098 - Age 1 324.12 553.00 74.393 Call: lm(formula = Price ~ Floor + Rooms + Age + CentralHeating, data = houses) Coefficients: (Intercept) Floor Rooms Age 15.9814 0.1622 6.3253 -1.8479 CentralHeatingyes 6.3502

Variable Selection Stepwise Selection # Stepwise Selection Method > step(null, scope = list(upper=full), data=houses, direction="both") # Final model selected Step: AIC=58.75 Price ~ Floor + Age + Rooms + CentralHeating Df Sum of Sq RSS AIC <none> 228.88 58.749 + Block 1 0.81 228.07 60.678 - Floor 1 85.47 314.35 63.095 - CentralHeating 1 115.80 344.68 64.938 - Rooms 1 155.11 383.99 67.098 - Age 1 324.12 553.00 74.393 Call: lm(formula = Price ~ Floor + Age + Rooms + CentralHeating, data = houses) Coefficients: (Intercept) Floor Age Rooms 15.9814 0.1622 -1.8479 6.3253 CentralHeatingyes 6.3502

Based on these outputs, which model would you select? (remember to look at the residual analysis on the final selected model to see the validity of the model)

Looks like all 4 procedures selected this model: Price ~ Floor + Age + Rooms + CentralHeating So we will fit this model in R and analyse it s residual plots to see the validity of the model. > fit<-lm(Price~Floor+Age+Rooms+CentralHeating, data=houses) > summary(fit) Call: lm(formula = Price ~ Floor + Age + Rooms + CentralHeating, data = houses) Residuals: Min 1Q Median 3Q Max -6.5211 -1.6048 0.1236 2.2921 6.4283 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.98136 7.67135 2.083 0.054751 . Floor 0.16221 0.06854 2.367 0.031827 * Age -1.84790 0.40094 -4.609 0.000341 *** Rooms 6.32527 1.98387 3.188 0.006108 ** CentralHeatingyes 6.35019 2.30507 2.755 0.014741 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 3.906 on 15 degrees of freedom Multiple R-squared: 0.919, F-statistic: 42.57 on 4 and 15 DF, p-value: 5.123e-08 Adjusted R-squared: 0.8974 > plot(fit)

References: www.stat.columbia.edu/~martin/W2024/R10.pdf http://www.statmethods.net/stats/regression.html https://stat.ethz.ch/R-manual/R- devel/library/stats/html/step.html Linear Model Selection and Best Subset Selection (13:44) Forward Stepwise Selection (12:26) Backward Stepwise Selection (5:26) Estimating Test Error Using Mallow's Cp, AIC, BIC, Adjusted R-squared (14:06) Estimating Test Error Using Cross-Validation (8:43)

Statistical Thinking for Data Science by Mahrita Harahap

Download Presentation

Presentation Transcript

Related

More Related Content