Describing Data: One Variable - Categorical and Quantitative Examples

Slide Note

The concepts of categorical and quantitative variables in statistics. Learn about the importance of randomization in data collection and the distinction between random samples and randomized experiments. Understand when to use random sampling and randomized experiments for different study objectives.

daan_27 Follow

Uploaded on Mar 09, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

STAT 101 Dr. Kari Lock Morgan Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical variable (2.1) One quantitative variable (2.2, 2.3, 2.4) Statistics: Unlocking the Power of Data Lock5

Announcements Homework 1 due now turn it in according to lab section Clicker grading starts today! Statistics: Unlocking the Power of Data Lock5

Why not always randomize? Randomized experiments are ideal, but sometimes not ethical or possible Often, you have to do the best you can with data from observational studies Example: research for the Supreme Court case as to whether preferences for minorities in university admissions helps or hurts the minority students Statistics: Unlocking the Power of Data Lock5

Randomization in Data Collection Was the explanatory variable randomly assigned? Was the sample randomly selected? Yes No Yes No Possible to generalize to the population Should not generalize to the population Can not make conclusions about causality Possible to make conclusions about causality Statistics: Unlocking the Power of Data Lock5

Two Fundamental Questions in Data Collection Random sample??? Population Sample Randomized experiment??? DATA Statistics: Unlocking the Power of Data Lock5

Randomization Doing a randomized experiment on a random sample is ideal, but rarely achievable If the focus of the study is using a sample to estimate a statistic for the entire population, you need a random sample, but do not need a randomized experiment (example: election polling) If the focus of the study is establishing causality from one variable to another, you need a randomized experiment and can settle for a non- random sample (example: drug testing) Statistics: Unlocking the Power of Data Lock5

Review from Last Class Association does not imply causation! In observational studies, confounding variables almost always exist, so causation cannot be established Randomized experiments involve randomly determining the level of the explanatory variable Randomized experiments prevent confounding variables, so causality can be inferred A control or comparison group is necessary The placebo effect exists, so a placebo and blinding should be used Statistics: Unlocking the Power of Data Lock5

The Big Picture Population Sampling Sample Statistical Inference Descriptive Statistics Statistics: Unlocking the Power of Data Lock5

Descriptive Statistics In order to make sense of data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative) Statistics: Unlocking the Power of Data Lock5

One Categorical Variable A random sample of US adults in 2012 were surveyed regarding the type of cell phone owned Android? iPhone? Blackberry? Non- smartphone? No cell phone? Statistics: Unlocking the Power of Data Lock5

Cell Phones Which type of cell phone do you own? a)Android b) iPhone c) Blackberry d) Non-smartphone e) No cell phone Statistics: Unlocking the Power of Data Lock5

Frequency Table US data: A frequency tableshows the number of cases that fall in each category: Android iPhone Blackberry Non Smartphone 924 No cell phone Total 458 437 141 293 2253 R: table(x) Statistics: Unlocking the Power of Data Lock5

Proportion The proportion in a category is found by Proportion for a sample: ?( p-hat ) Proportion for a population: p Statistics: Unlocking the Power of Data Lock5

Proportion What proportion of adults sampled do not own a cell phone? 293 2253= 0.13 Android iPhone Blackberry Non Smartphone 924 No cell phone Total 458 437 141 ? = or 13% Proportions and percentages can be used interchangeably 293 2253 Statistics: Unlocking the Power of Data Lock5

Relative Frequency Table A relative frequency tableshows the proportion of cases that fall in each category All the numbers in a relative frequency table sum to 1 Android iPhone Blackberry Non Smartphone 0.410 No cell phone 0.203 0.194 0.063 0.130 R: table(x)/length(x) Statistics: Unlocking the Power of Data Lock5

Bar Chart/Plot/Graph In a bar chart, the height of the bar is the number of cases falling in each category R: barchart(x) Statistics: Unlocking the Power of Data Lock5

Pie Chart In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(x)) Statistics: Unlocking the Power of Data Lock5

StatKey www.lock5stat.com/statkey Statistics: Unlocking the Power of Data Lock5

Summary: One Categorical Variable Summary Statistics Proportion Frequency table Relative frequency table Visualization Bar chart Pie chart Statistics: Unlocking the Power of Data Lock5

One Quantitative Variable World gross for all 2011 Hollywood movies HollywoodMovies2011 More graphics on profits for Hollywood movies Statistics: Unlocking the Power of Data Lock5

HollywoodMovies2011 Statistics: Unlocking the Power of Data Lock5

Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case Statistics: Unlocking the Power of Data Lock5

Histogram The height of the each bar corresponds to the number of cases within that range of the variable R: hist(x) Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart This is a a) Histogram b) Bar chart c) Other d) I have no idea Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart This is a a) Histogram b) Bar chart c) Other d) I have no idea Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart A bar chart is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x- axis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars Statistics: Unlocking the Power of Data Lock5

Shape Long right tail Symmetric Right-Skewed Left-Skewed Statistics: Unlocking the Power of Data Lock5

Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x1 , x2, , xnrepresent the n values of the variable x x1= 97.009, x2= 201.897, x3= 216.196, Statistics: Unlocking the Power of Data Lock5

Mean The mean or average of the data values is ??? ?? ??? ???? ?????? ?????? ?? ???? ?????? ???? = = ? ???? =?1+ ?2+ + ?? ? ? Sample mean: ? Population mean: ( mu ) R: mean(x) Statistics: Unlocking the Power of Data Lock5

Median The median, m, is the middle value when the data are ordered. If there are an even number of values, the median is the average of the two middle values. The median splits the data in half. R: median(x) Statistics: Unlocking the Power of Data Lock5

Measures of Center m = 76.66 Mean is pulled in the direction of skewness =150.74 World Gross (in millions) Statistics: Unlocking the Power of Data Lock5

Skewness and Center A distribution is left-skewed. Which measure of center would you expect to be higher? a) Mean b) Median The mean will be pulled down towards the skewness (towards the long tail). Statistics: Unlocking the Power of Data Lock5

Outlier An outlier is an observed value that is notably distinct from the other values in a dataset. Statistics: Unlocking the Power of Data Lock5

Outliers Harry Potter Transformers Pirates of the Caribbean World Gross (in millions) Statistics: Unlocking the Power of Data Lock5

Resistance A statistic is resistant if it is relatively unaffected by extreme values. The median is resistant while the mean is not. Mean Median With Harry Potter Without Harry Potter $141,889,900 $75,009,000 $150,742,300 $76,658,500 Statistics: Unlocking the Power of Data Lock5

Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results Statistics: Unlocking the Power of Data Lock5

Standard Deviation The standard deviation for a quantitative variable measures the spread of the data ? ?2 ? 1 ? = Sample standard deviation: s Population standard deviation: ( sigma ) R: sd(x) Statistics: Unlocking the Power of Data Lock5

Standard Deviation The standard deviation gives a rough estimate of the typical distance of a data values from the mean The larger the standard deviation, the more variability there is in the data and the more spread out the data are Statistics: Unlocking the Power of Data Lock5

Standard Deviation 150 s = Frequency 1 50 0 -15 -10 -5 0 5 10 15 150 Frequency s = 4 50 0 -15 -10 -5 0 5 10 15 Both of these distributions are bell-shaped Statistics: Unlocking the Power of Data Lock5

95% Rule If a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean. For a population, 95% of the data will be between 2 and + 2 Statistics: Unlocking the Power of Data Lock5

The 95% Rule Statistics: Unlocking the Power of Data Lock5

The 95% Rule s = 1 150 Frequency 50 0 -3 -2 -1 0 1 2 3 s = 150 Frequency 4 0 50 -15 -10 -5 0 5 10 15 StatKey Statistics: Unlocking the Power of Data Lock5

The 95% Rule The standard deviation for hours of sleep per night is closest to a) b) 1 c) 2 d) 4 e) I have no idea s = 2.03 Statistics: Unlocking the Power of Data Lock5

z-score The z-score for a data value, x, is ? =? ? ? For a population, ? is replaced with and s is replaced with Values farther from 0 are more extreme Statistics: Unlocking the Power of Data Lock5

z-score A z-score puts values on a common scale A z-score is the number of standard deviations a value falls from the mean 95% of all z-scores fall between what two values? -2 and 2 z-scores beyond -2 or 2 can be considered extreme Statistics: Unlocking the Power of Data Lock5

z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT: = 21, = 5 SAT: = 1500, = 325 Assume ACT and SAT scores have approximately bell-shaped distributions 28 21 5 2100 1500 325 7 5 ACT score of 28 SAT score of 2100 I don t know = = = a) b) c) 1.4 z 600 325 = = = 1.85 z Statistics: Unlocking the Power of Data Lock5

Other Measures of Location Maximum = largest data value Minimum = smallest data value Quartiles: Q1 = median of the values below m. Q3 = median of the values above m. Statistics: Unlocking the Power of Data Lock5

Five Number Summary Five Number Summary: Min Q1 m Q3 Max 25% 25% 25% 25% R: summary(x) Statistics: Unlocking the Power of Data Lock5

Five Number Summary > summary(study_hours) Min. 2.00 1st Qu. 10.00 Median 15.00 3rd Qu. 20.00 Max. 69.00 The distribution of number of hours spent studying each week is a) Symmetric b) Right-skewed c) Left-skewed d) Impossible to tell Statistics: Unlocking the Power of Data Lock5

Percentile The Pth percentile is the value which is greater than P% of the data We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile Statistics: Unlocking the Power of Data Lock5

Describing Data: One Variable - Categorical and Quantitative Examples

Download Presentation

Presentation Transcript

Related

More Related Content