
Exploratory Data Analysis in Statistics Course at Duke University
Explore the slides from the introductory session of the Statistics course at Duke University, covering topics like data analysis, clicker registration, readiness assessment, and important reminders for application exercises. Get insights into EDA methods, visualizations, and assignments.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Unit 1: Introduction todata 2. Exploratory dataanalysis Sta 101 Spring 2020 Duke University, Department of Statistical Science Dr. Ellison Slides posted at https://www2.stat.duke.edu/courses/Spring20/sta101.002/
Register your Clicker! To do now: 1. 2. 3. Turn on your clicker (orange button) Wait about 6 seconds. IF your clicker says READY, : a) Look for when your name appears on the slides. (If you don t see your name and you are officially enrolled in the course let DR. ELLISON know!) b) When you see your name, type the 4 letters you see under it (you have 20 seconds). c) If your name box turned GRAY, your clicker should now be registered to the class! IF your clicker does not say READY : a) Hold down on the orange button until the clicker screen changes. b) Quickly press AA. c) Wait ~6 seconds. d) Your screen should now say READY . (If not, ask a TA for help!) We will be registering clickers 1/14, 1/16, 1/21 clicker grading begins on 1/23!
Clicker Help in Readiness Assessment Readiness assessment Reviewing/Changing your Answer Syncing to the Quiz You should see this screen once I press the RA clicker start button (I ll tell you when). If not, press the blue refresh button! The most recent letter you chose should show up here. If you want to change your answer, just press the letter you want. My iclicker box stores your most recent letter you pressed for that number. Submitting your Answer Moving to Other Questions Press the up button to go to higher numbers. Press the down button to go lower numbers. Select the letter you want to pick for the number shown here.
Important Reminder for Scratch Cards and Application Exercises Accurately taking group attendance on the scratch cards and application exercises is part of your participation grade! Incorrect Name Format Correct Name Format Group Name: Stats IS FUN Present Group Members: Amy Lastname Barry Lastname Stats IS FUN Amy Lastname Barry Lastname Missing Mary Absent Albert Absent Group Members: Missing Mary Absent Albert
TO-DO: 1. Complete the Pre-test and Getting to Know you Survey (on Sakai main page) by Wednesday 1/15. 2. Buy clicker and register in class by Wednesday 1/23. 3. Problem Set 1 due Sunday 1/26 4. Lab Assignment 1 due Monday 1/27
Outline 1. General EDA 1. Always start your EDA with a visualization before summary statistics EDA: Single Numerical Variable 1. Visualizations: a) What can histograms, dotplots, and boxplots say about shape? b) How to draw a boxplot (with data). 2. Discussing: a) When describing numerical distributions discuss shape, center, spread, and unusual observations. b) How do mean, median, and skewness relate? c) Understanding standard deviation equation. d) Robust statistics are not easily affected by outliers and extreme skew e) How to guess the shape of a distribution. f) How to approximate standard deviation (with a histogram). 2.
Outline Should we calculate a summary statistic or make a data visualization first?
From a past Sta 101 survey... Do you see anything out of the ordinary? How old were you when you had your first kiss? 20 15 10 5 0 10 0 5 15 20 age at first kiss 3
From a past Sta 101 survey... Do you see anything out of the ordinary? How old were you when you had your first kiss? 20 15 10 5 0 10 0 5 15 20 age at first kiss Some people reported very low ages, which might suggest the survey question wasn t clear: romantic kiss or any kiss?
Outline Should we calculate a summary statistic or make a data visualization first? -Start with a visualization first. Helps us catch unclear questions, data errors, etc.
Outline 1. General EDA 1. Always start your EDA with a visualization before summary statistics EDA: Single Numerical Variable 1. Visualizations: a) What can histograms, dotplots, and boxplots say about shape? b) How to draw a boxplot (with data). 2. Discussing: a) When describing numerical distributions discuss shape, center, spread, and unusual observations. b) How do mean, median, and skewness relate? c) Understanding standard deviation equation. d) Robust statistics are not easily affected by outliers and extreme skew e) How to guess the shape of a distribution. f) How to approximate standard deviation (with a histogram). 2.
Outline EDA One Numerical Variable First visualize.
Outline EDA One Numerical Variable First visualize. 1. Histogram 2. Dot plot 3. Box plot
Outline EDA One Numerical Variable Then describe. What four things should we always discuss?
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution 1. 2. Measures of Center: an estimate of a typical observation in the data 3. Spread: a measure of variability of the data 4. Any outliers?
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution 1. 2. Measures of Center: an estimate of a typical observation in the data 3. Spread: a measure of variability of the data 4. Any outliers?
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed 1.
Outline 1. General EDA 1. Always start your EDA with a visualization before summary statistics EDA: Single Numerical Variable 1. Visualizations: a) What can histograms, dotplots, and boxplots say about shape? b) How to draw a boxplot (with data). 2. Discussing: a) When describing numerical distributions discuss shape, center, spread, and unusual observations. b) How do mean, median, and skewness relate? c) Understanding standard deviation equation. d) Robust statistics are not easily affected by outliers and extreme skew e) How to guess the shape of a distribution. f) How to approximate standard deviation (with a histogram). 2.
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed 1. Clicker question Which of the following visualizations does NOT show us BOTH modality and skewness of a distribution? (a) Histogram (b) Boxplot (c) Dot plot
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed 1. Clicker question Which of the following visualizations does NOT show us BOTH modality and skewness of a distribution? (a) Histogram (b) Boxplot (c) Dot plot
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed Measures of Center: an estimate of a typical observation in the data a) Mean b) Median Spread: a measure of variability of the data a) Standard Deviation (or Variance) b) IQR c) Range Any outliers? a) Suspected outlier: observations that stand out from the rest of the data b) Actual outlier: observation < Q1 - 1.5(IQR) or Q3 + 1.5(IQR) < observation 1. 2. 3. 4.
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed Measures of Center: an estimate of a typical observation in the data a) Mean b) Median Spread: a measure of variability of the data a) Standard Deviation (or Variance) b) IQR c) Range Any outliers? a) Suspected outlier: observations that stand out from the rest of the data b) Actual outlier: observation < Q1 - 1.5(IQR) or Q3 + 1.5(IQR) < observation 1. Summary Statistics 2. 3. 4.
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed Measures of Center: an estimate of a typical observation in the data a) Mean b) Median Spread: a measure of variability of the data a) Standard Deviation (or Variance) b) IQR c) Range Any outliers? a) Suspected outlier: observations that stand out from the rest of the data b) Actual outlier: observation < Q1 - 1.5(IQR) or Q3 + 1.5(IQR) < observation 1. Summary Statistics 2. 3. 4.
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed Measures of Center: an estimate of a typical observation in the data a) Mean b) Median Spread: a measure of variability of the data a) Standard Deviation (or Variance) b) IQR c) Range Any outliers? a) Suspected outlier: observations that stand out from the rest of the data b) Actual outlier: observation < Q1 - 1.5(IQR) or Q3 + 1.5(IQR) < observation 1. Summary Statistics 2. 3. 4.
Outline 1. General EDA 1. Always start your EDA with a visualization before summary statistics EDA: Single Numerical Variable 1. Visualizations: a) What can histograms, dotplots, and boxplots say about shape? b) How to draw a boxplot (with data). 2. Discussing: a) When describing numerical distributions discuss shape, center, spread, and unusual observations. b) How do mean, median, and skewness relate? c) Understanding standard deviation equation. d) Robust statistics are not easily affected by outliers and extreme skew e) How to guess the shape of a distribution. f) How to approximate standard deviation (with a histogram). 2.
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed Measures of Center: an estimate of a typical observation in the data a) Mean b) Median Spread: a measure of variability of the data a) Standard Deviation (or Variance) b) IQR c) Range Any outliers? a) Suspected outlier: observations that stand out from the rest of the data b) Actual outlier: observation < Q1 - 1.5(IQR) or Q3 + 1.5(IQR) < observation skewness relate? 1. 2. 3. How do mean, median, and 4.
Mean vs. median Clicker question How do the mean and median of the following two datasets compare? Dataset 1: 30, 50, 70, 90 Dataset 2: 30, 50, 70, 1000 ?1= ?2, median1 = median2 ?1< ?2, median1 = median2 ?1< ?2, median1 < median2 ?1> ?2, median1 < median2 ?1> ?2, median1 = median2 (a) (b) (c) (d) (e)
Mean vs. median Clicker question How do the mean and median of the following two datasets compare? Dataset 1: 30, 50, 70, 90 Dataset 2: 30, 50, 70, 1000 ?1= ?2, median1 = median2 (a) (b) (b) ??< ??, median1 = median2 (c) ?1< ?2, median1 < median2 (d) ?1> ?2, median1 < median2 (e) ?1> ?2, median1 = median2
Mean vs. median Clicker question How do the mean and median of the following two datasets compare? Symmetric Mean=median=60 Dataset 1: 30, 50, 70, 90 Dataset 2: 30, 50, 70, 1000 ?1= ?2, median1 = median2 (a) (b) (b) ??< ??, median1 = median2 (c) ?1< ?2, median1 < median2 (d) ?1> ?2, median1 < median2 (e) ?1> ?2, median1 = median2
Mean vs. median Clicker question How do the mean and median of the following two datasets compare? Symmetric Mean=median=60 Dataset 1: 30, 50, 70, 90 Dataset 2: 30, 50, 70, 1000 ?1= ?2, median1 = median2 (a) (b) (b) ??< ??, median1 = median2 (c) ?1< ?2, median1 < median2 (d) ?1> ?2, median1 < median2 (e) ?1> ?2, median1 = median2 Right-Skewed Mean>median=60
Outline 1. General EDA 1. Always start your EDA with a visualization before summary statistics EDA: Single Numerical Variable 1. Visualizations: a) What can histograms, dotplots, and boxplots say about shape? b) How to draw a boxplot (with data). 2. Discussing: a) When describing numerical distributions discuss shape, center, spread, and unusual observations. b) How do mean, median, and skewness relate? c) Understanding standard deviation equation. d) Robust statistics are not easily affected by outliers and extreme skew e) How to guess the shape of a distribution. f) How to approximate standard deviation (with a histogram). 2.
Outline EDA One Numerical Variable Then describe. What four things should we always discuss? Shape of the distribution a) Modality: unimodal, bimodal, multimodal, or uniform b) Skewness: symmetric, left-skewed, or right-skewed Measures of Center: an estimate of a typical observation in the data a) Mean b) Median Spread: a measure of variability of the data a) Standard Deviation (or Variance) b) IQR c) Range Any outliers? a) Suspected outlier: observations that stand out from the rest of the data b) Actual outlier: observation < Q1 - 1.5(IQR) or Q3 + 1.5(IQR) < observation 1. Summary Statistics 2. 3. 4.
Standard deviation and variance Population Standard Deviation Sample Standard Deviation
Standard deviation and variance Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = =
Standard deviation and variance Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? 1 ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = =
Standard deviation and variance Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? 1 ?1 ?2+ ?2 ?2+ + ?? ?2 ? 1 ?=1 ? = =
Standard deviation and variance Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? 1 ?1 ?2+ ?2 ?2+ + ?? ?2 ? 1 ?=1 ? = = ???????? = standard deviation?
Standard deviation and variance Why do we square the differences in these calculations? Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? 1 ?1 ?2+ ?2 ?2+ + ?? ?2 ? 1 ?=1 ? = =
Standard deviation and variance Why do we square the differences in these calculations? Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? 1 ?1 ?2+ ?2 ?2+ + ?? ?2 ? 1 ?=1 ? = = 1. Get rid of the negatives. -- + +
Standard deviation and variance Why do we square the differences in these calculations? Population Standard Deviation ? ?? ?2 ? ?1 ?2+ ?2 ?2+ + ?? ?2 ? ?=1 ? = = Sample Standard Deviation ? ?? ?2 ? 1 ?1 ?2+ ?2 ?2+ + ?? ?2 ? 1 ?=1 ? = = 1. 2. Get rid of the negatives. Large deviations are weighted more strongly.
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = ? ?? ?2 ? ?=1
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = ? ?? ?2 ? ?=1
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = ? ?? ?2 ? ?=1
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? = Why n-1 vs n ?
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = Biased estimate of ? Sample Standard Deviation (unbiased estimate of ?) ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? =
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = Biased estimate of ? Sample Standard Deviation (unbiased estimate of ?) ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? = Tends to be an underestimate of ?. Underestimation is caused by having to estimate ONE PARAMETER: .
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = Biased estimate of ? Sample Standard Deviation (unbiased estimate of ?) ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? = Tends to be an underestimate of ?. Underestimation is caused by having to estimate ONE PARAMETER: . smaller
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = Biased estimate of ? Sample Standard Deviation (unbiased estimate of ?) ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? = larger Tends to be an underestimate of ?. Underestimation is caused by having to estimate ONE PARAMETER: . smaller
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = Biased estimate of ? Sample Standard Deviation (unbiased estimate of ?) ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? = larger Tends to be an underestimate of ?. Underestimation is caused by having to estimate ONE PARAMETER: . smaller Using n-1 helps counterbalance the underestimation of the biased estimate of .
Standard deviation and variance Why not use this for sample standard deviation instead? Population Standard Deviation ? ?? ?2 ? ?=1 ? = Biased estimate of ? Sample Standard Deviation (unbiased estimate of ?) ? ?? ?2 ? ? ?=1 ?? ?2 ? 1 ?=1 vs. ? = larger Tends to be an underestimate of ?. Underestimation is caused by having to estimate ONE PARAMETER: . smaller Using n-1 helps counterbalance the underestimation of the biased estimate of .