Understanding Selection Bias in Data Analysis

qm222 class 5 section a1 qm222 class 5 section n.w

1 / 30

Embed Share

Explore the concepts of selection bias in data analysis, including self-selection and survivorship bias. Learn how selection can impact study outcomes and how to mitigate its effects in research.

ofarrell_j Follow

Uploaded on Jun 05, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

QM222 Class 5 Section A1 QM222 Class 5 Section A1 Simple Regression Simple Regression QM222 Fall 2017 Section A1 1

To To- -dos dos There is no class this Friday. However, TAs will be in room 314 (computer lab) to help anyone who wants help with Stata and/or data Assignment 2 is due next Wednesday. I will also be writing everyone if they are good to go with their topic or if they need to talk to me more about their project topic before approval. I have time for drop ins today 1-1:45, 2:45-4. Otherwise It is time to write down the variables that you want in the data sources (and later start downloading them.) If it is a small data set, you can download the whole dataset (survey etc.) The Midterm will be on Thursday Oct. 31 at 6pm and will be over in time for you to do Halloween QM222 Fall 2017 Section A1 2

Todays Agenda Today s Agenda Review of selection and the WWII fighter plane example Introduction to simple regression (Chapter 6) Nomenclature of parts of the regression How to interpret regressions In-class exercise Very brief extension to multiple regression QM222 Fall 2017 Section A1 3

Selection Selection Selection is the general term for cases where the population that you are studying is not representative of the population as a whole There are general cases of selection and also two special cases . Self-selection when people select into the sample Survivorship bias where only survivors are observed QM222 Fall 2017 Section A1 4

Cases of selection Cases of selection From the video: Married men live longer From our exercise: Kids in schools with smaller classes are more likely to go to college Self-selection: Struggling students go to TA office hours QM222 Fall 2017 Section A1 5

The key in all these selection bases The key in all these selection bases Terminology: When you are measuring the effect of something in an experiment, you give this treatment to some of the people/rats/etc and not give this treatment to the others the control group Selection will bias our measure of the effect if the treatment group and control group are likely to be different for reasons unrelated to the treatment that could be creating the outcome. Self-selection is when people choose which group they are in QM222 Fall 2017 Section A1 6

Another example of this: Another example of this: Is this a good way to estimate the average number of Is this a good way to estimate the average number of children per family in the US? children per family in the US? Suppose that I want to estimate the average number of children in families in the US and use you (the class) as a sample. I ask each of you how many children are in your family (including you). What are the problems with this estimation? Do you think it would be a overestimate or an underestimate of the average number of children? Why? QM222 Fall 2017 Section A1 7

Another example from WWII Another example from WWII During World War II, some of the most important mathematicians acted as secret agents of the US armed forces (called the Applied Mathematics Panel). When a commander would stumble into a problem that might be related to statistics, he s ask this Panel. During World War II, the chances of a member of a bomber crew making it through a tour of duty was 50%. How, the Air Force asked, could they improve the odds of a bomber making it home? The military looked at the bombers that had returned from enemy territory, recording where those planes had taken the most damage. QM222 Fall 2017 Section A1 8

Discussion: Why did statisticians say the commanders were Discussion: Why did statisticians say the commanders were completely wrong? completely wrong? They saw the bullet holes tended to accumulate along the wings, around the tail gunner, and down the center of the body. The commanders wanted to put the thicker protection where they could clearly see the most damage, where the holes clustered But the statistician (Abraham Wald) said No, it is the OPPOSITE. Where these bombers are unharmed is where these bombers are most vulnerable. Put protection THERE! Why did he say this? Where should they add protection? ANSWER part d! QM222 Fall 2017 Section A1 9

This special kind of selection bias is called This special kind of selection bias is called Survivorship Bias Survivorship Bias Survivorship bias occurs when those who survive are different from those who don t . But you only measure the survivors. Another example: If you look at the 10-year % return of mutual funds . These are the ones that survived and will be the ones who got the highest % return (even if returns were random across funds). So don t expect this % return if you invest in mutual funds for 10 years. QM222 Fall 2017 Section A1 10

In sum. In sum . Selection is a major reason that correlation doesn t imply causality And often can be related to a missing confounding factor QM222 Fall 2017 Section A1 11

Todays Agenda Today s Agenda Review of selection and the WWII fighter plane example Introduction to simple regression (Chapter 6) Nomenclature of parts of the regression How to interpret regressions In-class exercise Very brief extension to multiple regression QM222 Fall 2017 Section A1 12

If you want to measure the slope of the line, you If you want to measure the slope of the line, you can t use correlation. You need regression! (example) can t use correlation. You need regression! (example) Figure 6.1 Brookline Condos sold in 2009/2010 2500000 2000000 1500000 Price (US$) 1000000 500000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Size (Sq Feet) Here is a scatter diagram of the price of Brookline Condo s v. the size (in square feet) of the condo. QM222 Fall 2017 Section A1 13

Scatter diagram with a linear regression line Scatter diagram with a linear regression line (which is what a trend line in Excel is) (which is what a trend line in Excel is) Figure 6.1 Brookline Condos sold in 2009/2010 2500000 2000000 1500000 =b0+b1X Price (US$) 1000000 500000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Size (Sq Feet) Every line is a linear equation. We write simple regression equations as follows: = b0 + b1X QM222 Fall 2017 Section A1 14

= b = b0 0 + + b b1 1X X In regression, variables play one of two roles: 1. Dependent (LHS): what we want to we want to explain or predict. Y We use the hat over Y to mean predicted Y -- 2. Explanatory (RHS):variables used to explain or predict the dependent variable. X b0 is the intercept of the line b1is the slope of the line. We also call it the coefficient of X. This is NOT the same as the correlation coefficient. Coefficient kind of means estimated number. A simple regression means only one X. QM222 Fall 2017 Section A1 15

The computer estimated this specific regression The computer estimated this specific regression equation as: Price equation as: Price = 12934 + 407.5 size = 12934 + 407.5 size Figure 6.1 Brookline Condos sold in 2009/2010 2500000 2000000 1500000 =b0+b1X Price (US$) 1000000 500000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Size (Sq Feet) Interpreting the coefficient: Every additional square foot increases the price by $407.50 on average. QM222 Fall 2017 Section A1 16

The computer estimated this specific regression The computer estimated this specific regression equation as: Price equation as: Price = 12934 + 407.5 size = 12934 + 407.5 size Figure 6.1 Brookline Condos sold in 2009/2010 2500000 2000000 Price (US$) 1500000 =b0+b1X 1000000 500000 0 0 1000 2000 3000 4000 5000 Size (Sq Feet) Interpreting the intercept 12934: Mathematically, a condo with 0 size sells for 12934. But such a condo doesn t exist. So really, the intercept just helps fit the line through the dots we have. More generally, we cannot apply an equation outside the ranges of X we observe. QM222 Fall 2017 Section A1 17

In your project, you will estimate regressions like this. Price = 12934 + 407.5 size However, does this tell us that an additional square foot (size) CAUSES the price to increase by $407.50. No it just says: Price TENDS to increase by $407.50 when square feet (size) increases by 1. QM222 Fall 2017 Section A1 18

(new) Predicted line =b (new) Predicted line =b0 0+b +b1 1X and errors X and errors Figure 6.2 Brookline Condos sold in 2009/2010 2500000 2000000 Y Y 1500000 error, residual Price (US$) =b0+b1X Y = + error 1000000 = b0+ b1X 500000 Y = b0+b1X + error 0 X X 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Size (Sq Feet) For any X (e.g. 2700), we predict the value along the line. But each observation is not exactly the same as that predicted value. The difference is called the RESIDUAL or ERROR and they can be positive or negative. QM222 Fall 2017 Section A1 19

Regression in Stata Regression in Stata regress yvariablename xvariablename For instance, to run a regression of price on size, type: regress price size Table 6.1 Name of dependent variable Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 b1 b0 Name of explanatory variable(s) Note: _cons is constant, intercept price = 12934 + 407.45 size If size increases by 1 sqft, sales price increases by $407 We have a total of 1085 observations. QM222 Fall 2017 Section A1 20

Using this regression to predict: Using this regression to predict: Price Price = 12934 + 407.5 size = 12934 + 407.5 size If you know a condo s size, you can use the regression to predict that condo s price. Simply plug a value of square feet into the equation above: The predicted price for a condo with 1000 sq. feet is: 12934 + 407.5*1000 = $420,434. The predicted price for a condo with 2000 sq. feet is: 12934 + 407.5*2000 = $827,834. QM222 Fall 2017 Section A1 21

Using this regression to explain the relationship Using this regression to explain the relationship between size and price: between size and price: Price Price = 12934 + 407.5 size = 12934 + 407.5 size If there are two houses and one had an extra 1000 square feet than another, how much would the sales price change: 407.5 * 1000 = 40,750 After all, the difference between $827,834 and 420,434 is just: 407.5*1000 = 40,750 The predicted price for a condo with 1000 sq. feet is: 12934 + 407.5*1000 = $420,434. The predicted price for a condo with 2000 sq. feet is: 12934 + 407.5*2000 = $827,834. QM222 Fall 2017 Section A1 22

Do in class exercise side 1 Do in class exercise side 1 Next, I want to go out of order a bit to tell you where we are going in later classes Related to confounding factors QM222 Fall 2017 Section A1 23

Todays Agenda Today s Agenda Review of selection and the WWII fighter plane example Introduction to simple regression (Chapter 6) Nomenclature of parts of the regression How to interpret regressions In-class exercise Very brief extension to multiple regression QM222 Fall 2017 Section A1 24

Going back to the Brookline Condos Going back to the Brookline Condos Maybe there are confounding factors . Correlated with both price and size. Perhaps size really has a smaller effect. How can we separate out the effect of size and for instance # of parking spaces? If we had a HUGE data set, we might be able to look at houses with only 1 space; then with 2 space etc. But we would need much more data than exists. Instead we use multiple regression. QM222 Fall 2016 Section D1 25

In multiple regression, we have more than one In multiple regression, we have more than one explanatory variable. explanatory variable. At first we had: Each square feet increases the price by $407.6 What if I run a regression in 3-Dimensional space and get an equation with three variables Price = 15639 + 388.5 size + 40463.1 parkingspaces If there are two houses both with 2 parking spaces, but one has an extra square foot, how much does price change? Notice nothing changes in the equation but this term So price goes up by $388.5 (which is less than 407.5) Price = 12934 + 407.5 size QM222 Fall 2017 Section A1 26

So when you add additional variables into the So when you add additional variables into the equation equation The multiple regression isolates the effect of each explanatory variable holding the other explanatory variable constant Price = 15639 + 388.5 size + 40463.1 parkingspaces QM222 Fall 2017 Section A1 27

Assignment 2: Assignment 2: Due Wednesday, Sept. Due Wednesday, Sept. 27 You cannot start this assignment before you have gotten approval of your topic from me. I will send you emails. Then you ll need to understand your data well enough to know the variables that you ll use and how many observations there are. What specific question or questions will your project address? What company, governmental body or other organization would be interested in knowing the answer to this question? What data source(s) are you using? In your data, what does each observation represent? In your data, how many usable observations are there? What is the dependent variable(s) you plan to focus on? What is the main explanatory variable(s) that you will focus on? What additional, possibly confounding variables, can you measure that you planning to include in your analysis? 27 (paraphrased) (paraphrased) QM222 Fall 2017 Section A1 28

I have also made a single sheet Project Description I have also made a single sheet Project Description that you will start filling out and handing in each that you will start filling out and handing in each week week I ll use it to keep up with your project and your data. For Assignment 2, you need to just copy your first 5 answers. But with each additional Assignment, you will: Fill in more of this sheet. Change some of what you have written. QM222 Fall 2017 Section A1 29

Today we Today we Reviewed how many times that there are confounding factors, it is because of selection, including the special cases self-selection and survivor bias Learned a lot about a simple regression An estimated line Special words: dependent variable, explanatory variable, regression coefficient How to use simple regressions to predict; and to measure how much one variable s change changes the other. (I am avoiding saying how one variable affects the other Were introduced to multiple regression to separate out the effect of confounding factors. QM222 Fall 2017 Section A1 30

Understanding Selection Bias in Data Analysis

Download Presentation

Presentation Transcript

Related

More Related Content