
Understanding Sampling and Statistical Inference in Data Analysis
Explore the concepts of sampling, population, and statistical inference in data analysis with Dr. Kari Lock Morgan. Learn how data can be collected, variables categorized, and inferences drawn. Discover the importance of different factors like bias in sampling and the broader application of sample data to populations.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
STAT 101 Dr. Kari Lock Morgan Collecting Data: Sampling SECTION 1.2 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias Statistics: Unlocking the Power of Data Lock5
Review of Last Class Data are everywhere, and pertain to a wide variety of topics A dataset is usually comprised of variables measured on cases Variables are either categorical or quantitative Data can be used to provide information about essentially anything we are interested in and want to collect data on! Statistics: Unlocking the Power of Data Lock5
Sample versus Population A population includes all individuals or objects of interest. A sample is all the cases that we have collected data on (a subset of the population). Statistical inferenceis the process of using data from a sample to gain information about the population. Statistics: Unlocking the Power of Data Lock5
The Big Picture Population Sampling Sample Statistical Inference Statistics: Unlocking the Power of Data Lock5
Most Important to You Which of the following is most important to you? a) Athletics b) Academics c) Social Life d) Community Service e) Other Statistics: Unlocking the Power of Data Lock5
Most Important to You Suppose researchers studying student life at Duke use the results of our clicker question to investigate what Duke students find important What is the sample? What is the population? Can the sample data be generalized to make inferences about the population? Why or why not? Statistics: Unlocking the Power of Data Lock5
Dewey Defeats Truman? Statistics: Unlocking the Power of Data Lock5
Dewey Defeats Truman? The paper was published before the conclusion of the 1948 presidential election, and was based on the results of a large telephone poll which showed Dewey sweeping Truman However, Harry S. Truman won the election What went wrong? Statistics: Unlocking the Power of Data Lock5
Sampling Bias Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. If sampling bias exists, we cannot trust generalizations from the sample to the population Statistics: Unlocking the Power of Data Lock5
Sampling Population Sample Sample GOAL: Select a sample that is similar to the population, only smaller Statistics: Unlocking the Power of Data Lock5
Can you avoid sampling bias? The next slide shows Lincoln s Gettysburg Address. The entire population, all words in his address, will be shown to you. What is the average word length? Your task: Select a sample of 10 words that resemble the overall address. Write them down. Calculate the average number of letters for the words in your sample Place a dot above your sample average on the board Statistics: Unlocking the Power of Data Lock5
Lincolns Gettysburg Address Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle- field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate we can not consecrate we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us that from these honored dead we take increased devotion to that cause for which they here gave the last full measure of devotion that we here highly resolve that these dead shall not have died in vain that this nation, under God, shall have a new birth of freedom and that government of the people, by the people, for the people, shall not perish from the earth. Statistics: Unlocking the Power of Data Lock5
Can you avoid sampling bias? Actual average: People are TERRIBLE at selecting a good sample, even when explicitly trying to avoid sampling bias! We need a better way Statistics: Unlocking the Power of Data Lock5
Random Sampling How can we make sure to avoid sampling bias? Take a RANDOM sample! Imagine putting the names of all the units of the population into a hat, and drawing out names at random to be in the sample More often, we use technology Statistics: Unlocking the Power of Data Lock5
Random Sampling Before the 2008 election, the Gallup Poll took a random sample of 2,847 Americans. 52% of those sampled supported Obama In the actual election, 53% voted for Obama Random sampling is a very powerful tool!!! Statistics: Unlocking the Power of Data Lock5
Random Numbers 1. Pick 10 random numbers between 1 and 268. Write these numbers down. (Note: When choosing a real sample, you should use technology to generate random numbers. This is simply for illustrative purposes in class.) 2. Using the next slide, calculate the average number of letters in the words corresponding to your random numbers 3. Place a dot below this average on the board Statistics: Unlocking the Power of Data Lock5
1 Four 2 score 3 and 4 seven 5 years 6 ago, 7 our 8 fathers 9 brought 10 forth 11 upon 12 this 13 continent 14 a 15 new 16 nation: 17 conceived 18 in 19 liberty, 20 and 21 dedicated 22 to 23 the 24 proposition 58on 25 that 26 all 27 men 28 are 29 created 30 equal. 31 Now 32 we 33 are 34 engaged 35in 36a 37great 38civil 39war, 40testing 41whether 42that 43nation, 44or 45any 46nation 47so 48conceived 49and 50so 51dedicated, 52can 53long 54endure. 55We 56are 57met 69dedicate 70a 71portion 72of 73that 74field 75as 76a 77final 78resting 79place 80for 81those 82who 83here 84gave 85their 86lives 87that 88that 89nation 90might 91live. 92It 93is 94altogether 128have 95fitting 96and 97proper 98that 99we 100should 101do 102this. 103But, 104in 105a 106larger 107sense, 108we 109cannot 110dedicate, 111we 112cannot 113consecrate, 147remember, 181have 114we 148what 115cannot 149we 116hallow 150say 117this 151here, 118ground. 152but 119The 153it 120brave 154can 121men, 155never 122living 156forget 123and 157what 124dead, 158they 125who 159did 126struggled 160here. 127here 161It 162is 129consecrated 163for 130it, 164us 131far 165the 132above 166living, 133our 167rather, 134poor 168to 135power 169be 136to 170dedicated 137add 138or 139detract. 140The 141world 142will 143little 144note, 145nor 146long 171here 172to 173the 174unfinished 208we 175work 176which 177they 178who 179fought 180here 205these 206honored 207dead 239that 240this 241nation, 242under 243God, 209take 210increased 244shall 211devotion 212to 213that 214cause 215for 216which 217they 218gave 219the 245have 246a 247new 248birth 249of 250freedom, 251and 252that 253government 254of 255the 256people, 257by 182thus 183far 184so 185nobly 186advanced. 220last 187It 188is 189rather 190for 191us 192to 193be 194here 195dedicated 229resolve 196to 197the 198great 199task 200remaining 234not 201before 202us, 203that 204from 221full 222measure 223of 224devotion, 258the 225that 226we 227here 228highly 259people, 260for 261the 262people, 263shall 264not 265perish 266from 267the 268earth. 59a 60great 61battlefield 62of 63that 64war. 65We 66have 67come 68to 230that 231these 232dead 233shall 235have 236died 237in 238vain, Statistics: Unlocking the Power of Data Lock5
Random vs Non-Random Sampling Random samples have averages that are centered around the correct number Non-random samples may suffer from sampling bias, and averages may not be centered around the correct number Only random samples can truly be trusted when making generalizations to the population! Statistics: Unlocking the Power of Data Lock5
Bowl of Soup Analogy Think of tasting a bowl of soup Population = entire bowl of soup Sample = whatever is in your tasting bites If you take bites non-randomly from the soup (if you stab with a fork, or prefer noodles to vegetables), you may not get a very accurate representation of the soup If you take bites at random, only a few bites can give you a very good idea for the overall taste of the soup Statistics: Unlocking the Power of Data Lock5
Simple Random Sample In a simple random sample, each unit of the population has the same chance of being selected, regardless of the other units chosen for the sample More complicated random sampling schemes exist, but will not be covered in this course Statistics: Unlocking the Power of Data Lock5
Realities of Sampling While a random sample is ideal, often it isn t feasible. A list of the entire population may not be available, or it may be impossible or too difficult to contact all members of the population. Sometimes, your population of interest has to be altered to something more feasible to sample from. Generalization of results are limited to the population that was actually sampled from. In practice, think hard about potential sources of sampling bias, and try your best to avoid them Statistics: Unlocking the Power of Data Lock5
Non-Random Samples Suppose you want to estimate the average number of hours that Duke students spend studying each week. Which of the following is the best method of sampling? (a) Go to the library and ask all the students there how much they study (b) Email all Duke students asking how much they study, and use all the data you get (c) Give a clicker question in STAT 101 and force every student to respond (d) Stand outside the Bryan Center and ask everyone going in how much they study Statistics: Unlocking the Power of Data Lock5
Alcohol, Marijuana, and Driving The Federal Office of Road Safety in Australia conducted a study on the effects of alcohol and marijuana on performance Volunteers who responded to advertisements for the study on rock radio stations were given a random combination of the two drugs, then their performance was observed What is the sample? What is the population? Is there sampling bias? Will the results be informative and/or do you think the study is worth conducting? Source: Chesher, G., Dauncey, H., Crawford, J. and Horn, K, The Interaction between Alcohol and Marijuana: A Dose Dependent Study on the Effects of Human Moods and Performance Skills, Report No. C40, Federal Office of Road Safety, Federal Department of Transport, Australia, 1986. Statistics: Unlocking the Power of Data Lock5
Papers Note: The original sources for the studies are provided, and the papers will appear on the website in the order they are used in class If interested in the details of the study, please read the original article! Statistics: Unlocking the Power of Data Lock5
Data Collection and Bias Sampling Bias? Population Sample Other forms of bias? DATA Statistics: Unlocking the Power of Data Lock5
Other Forms of Bias Even with a random sample, data can still be biased, especially when collected on humans Other forms of bias to watch out for in data collection: Question wording Context Inaccurate responses Many other possibilities examine the specifics of each study! Statistics: Unlocking the Power of Data Lock5
Question Wording A random sample was asked: Should there be a tax cut, or should money be used to fund new government programs? A different random sample was asked: Should there be a tax cut, or should money be spent on programs for education, the environment, health care, crime-fighting, and military defense? Statistics: Unlocking the Power of Data Lock5
Context Ann Landers column asked readers If you had it to do over again, would you have children? The first request for data contained a letter from a young couple which listed worries about parenting and various reasons not to have kids The second request for data was in response to this number, in which Ann wrote how she was stunned, disturbed, and just plain flummoxed Statistics: Unlocking the Power of Data Lock5
Having Children If we were to run the question all by itself in the newspaper with a request for responses, could we trust the results? Yes (a) No (b) Statistics: Unlocking the Power of Data Lock5
Inaccurate Responses In a study on US students, 93% of the sample said they were in the top half of the sample regarding driving skill Svenson, O. (February 1981). "Are we all less risky and more skillful than our fellow drivers?". Acta Psychologica 47 (2): 143 148. Statistics: Unlocking the Power of Data Lock5
Summary Always think critically about how the data were collected, and recognize that not all forms of data collection lead to valid inferences This is the easiest way to instantly become a more statistically literate individual! Statistics: Unlocking the Power of Data Lock5
To Do Read Section 1.2 CAOS pretest (1 point, due Tues 1/14, 7pm) Class survey (1 point, due Wed 1/15, 7pm) Get a clicker (grading will start Wed, 1/22) Statistics: Unlocking the Power of Data Lock5