
Exploratory Data Analysis Techniques for Visualizing Relationships in Categorical and Numerical Variables
Explore effective data visualization methods for analyzing relationships between categorical and numerical variables, learn about segmented bar plots, mosaic plots, and side-by-side box plots. Get insights into making inferences and staying aware of Simpson's paradox in statistical analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Unit 1: Introduction todata 3. More exploratory dataanalysis Sta 101 Fall 2019 Duke University, Department of Statistical Science Dr. Ellison Slides posted at https://www2.stat.duke.edu/courses/Fall19/sta101.001/
1.Housekeeping 2.Main ideas A. Two Categorical Variables Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences Not all observed differences are statistically significant Be aware of Simpson's paradox 3.ApplicationExercise
News/Coming up Office hours have opened up! https://www2.stat.duke.edu/courses/Fall19/sta101.001/officehours.html Problem Set 1 due Friday 9/6 11:55pm on Sakai! Handwrite + scan (number pages) or Type up and submit as pdf or Write in Sakai text box
1.Housekeeping 2.Main ideas A. Two Categorical Variables Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences Not all observed differences are statistically significant Be aware of Simpson's paradox 3.ApplicationExercise
What types of plots can we use to visualize two categorical variables? Class Year (Independent Variable) Freshman Yes Senior Freshman It's complicated SophomoreNo Junior No Junior Yes Freshman Yes In a Relationship? (Dependent Variable) Yes ?
Why is a mosaic plot or frequency segmented bar plot a better plot for visualizing the relationship between two categorical variables than a segmented bar plot? Class Year (Independent Variable) Freshman Yes Senior Freshman It's complicated SophomoreNo Junior No Junior Yes Freshman Yes In a Relationship? (Dependent Variable) Yes
1.Housekeeping 2.Main ideas A. Two Categorical Variables A. Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable A. Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences A. Not all observed differences are statistically significant B. Be aware of Simpson's paradox 3.ApplicationExercise
What do the heights of the segments represent? Is there a relationship between class year and relationship status? Do the widths of the bars represent anything? Segmented Bar Plot Relationship status vs. class year 30 relationship_status yes 20 count no it'scomplicated 10 0 First year Sophomore Junior Senior Class year
What do the widths of the bars represent? What about the heights of the boxes? Is there a relationship between class year and relationship status? What descriptive statistics can we use to summarize these data? Relationship status vs. class year First year Sophomore Senior Junior yes no it's complicated Mosaic Plot
What do the widths of the bars represent? What about the heights of the boxes? Is there a relationship between class year and relationship status? What descriptive statistics can we use to summarize these data? Relationship status vs. class year Freshm.Soph. Junior Senior yes Frequency Segmented Bar Plot no it's complicated
A mosaic plot and frequency segmented bar plot looks at proportions of each dependent variable level, given each independent variable level. A segmented bar plot looks at counts. Proportions tell us if there is a relationship!
What is a good plot to use for visualizing the relationship between a categorical and numerical variable? How many nights do you spend drinking a week? (Dependent Variable) 0 1 7 3 4 2 1 Are You a Vegetarian? (Independent Variable) yes yes no no no yes no
1.Housekeeping 2.Main ideas A. Two Categorical Variables Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences Not all observed differences are statistically significant Be aware of Simpson's paradox 3.ApplicationExercise
How do drinking habits of vegetarian vs. non-vegetarian students compare? Nights drinking/week vs. vegetarianism 6 nights drinking 4 2 0 no yes vegetarian
Side-by-Side Boxplots are useful for visualizing the relationship between a categorical and numerical variable. How many nights do you spend drinking a week? (Dependent Variable) 0 1 7 3 4 2 1 Are You a Vegetarian? (Independent Variable) yes yes no no no yes no
1.Housekeeping 2.Main ideas A. Two Categorical Variables Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences Not all observed differences are statistically significant Be aware of Simpson's paradox 3.ApplicationExercise
Building Intuition For Making Inferences Are all observed differences between two groups actually meaningful?
What percent of the students sitting in the left side of the classroom have PC computers? What about on the right? Are these numbers exactly the same? If not, do you think the difference is real, or due to random chance?
1.Housekeeping 2.Main ideas A. Two Categorical Variables Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences Not all observed differences are statistically significant Be aware of Simpson's paradox 3.ApplicationExercise
Building Intuition For Making Inferences Some studies can suggest one relationship, but upon a closer look, can reveal the opposite of this relationship. Beware of Simpson s Paradox!
Race and death-penalty sentences in Florida murder cases A 1991 study by Radelet and Pierce on race and death-penalty (DP) sentences gives the followingtable: Defendant s race Caucasian African American Total DP 53 15 68 No DP 430 176 606 Total 483 191 674 % DP Adapted from Subsection 2.3.2 of A. Agresti (2002), Categorical Data Analysis, 2nd ed., and http://math.stackexchange.com/questions/83756/examples-of-simpsons-paradox.
Race and death-penalty sentences in Florida murder cases A 1991 study by Radelet and Pierce on race and death-penalty (DP) sentences gives the followingtable: Defendant s race Caucasian African American Total DP 53 15 68 No DP 430 176 606 Total 483 191 674 % DP 11% Adapted from Subsection 2.3.2 of A. Agresti (2002), Categorical Data Analysis, 2nd ed., and http://math.stackexchange.com/questions/83756/examples-of-simpsons-paradox.
Race and death-penalty sentences in Florida murder cases A 1991 study by Radelet and Pierce on race and death-penalty (DP) sentences gives the followingtable: Defendant s race Caucasian African American Total DP 53 15 68 No DP 430 176 606 Total 483 191 674 % DP 11% 7.9% Adapted from Subsection 2.3.2 of A. Agresti (2002), Categorical Data Analysis, 2nd ed., and http://math.stackexchange.com/questions/83756/examples-of-simpsons-paradox.
Race and death-penalty sentences in Florida murder cases A 1991 study by Radelet and Pierce on race and death-penalty (DP) sentences gives the followingtable: Defendant s race Caucasian African American Total DP 53 15 68 No DP 430 176 606 Total 483 191 674 % DP 11% 7.9% Who is more likely to get the death penalty? Adapted from Subsection 2.3.2 of A. Agresti (2002), Categorical Data Analysis, 2nd ed., and http://math.stackexchange.com/questions/83756/examples-of-simpsons-paradox.
Another look Same data, taking into consideration victim s race: Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP
Another look Same data, taking into consideration victim s race: Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3%
Another look Same data, taking into consideration victim s race: Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9%
Another look Same data, taking into consideration victim s race: Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9% 0%
Another look Same data, taking into consideration victim s race: Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9% 0% 2.8%
Another look Same data, taking into consideration victim s race: Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9% 0% 2.8% Who is more likely to get the death penalty?
Contradiction? People of one race are more likely to murder others of the same race, murdering a Caucasian is more likely to result in the death penalty, and there are more Caucasian defendants than African American defendants in the sample. Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9% 0% 2.8% =467 + 143 674 =48 + 16 674 90% 10% vs
Contradiction? People of one race are more likely to murder others of the same race, murdering a Caucasian is more likely to result in the death penalty, and there are more Caucasian defendants than African American defendants in the sample. Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9% 0% 2.8% 53 + 11 467 + 48 0 + 4 16 + 143 12% = 2.5% = vs
Contradiction? People of one race are more likely to murder others of the same race, murdering a Caucasian is more likely to result in the death penalty, and there are more Caucasian defendants than African American defendants in the sample. Victim s race Caucasian Caucasian African American African American Total Defendant s race Caucasian African American Caucasian African American DP 53 11 0 4 68 No DP 414 37 16 139 606 Total 467 48 16 143 674 % DP 11.3% 22.9% 0% 2.8%
Contradiction? People of one race are more likely to murder others of the same race, murdering a Caucasian is more likely to result in the death penalty, and there are more Caucasian defendants than African American defendants in the sample. Controlling for the victim s race reveals more insights into the data, and changes the direction of the relationship between race and death penalty. Death Penalty? (Dependent Variable) yes yes no no no yes no no Death Penalty? (Dependent Variable) yes yes no no no yes no no Defendant Races (Independent Variable) White Defendant Black Defendant White Defendant White Defendant White Defendant White Defendant White Defendant Black Defendant Defendant and Victim Races (Independent Variable) Black Victim - White Defendant Black Victim - Black Defendant Black Victim - White Defendant White Victim - White Defendant Black Victim - White Defendant White Victim - White Defendant White Victim - White Defendant Black Victim - Black Defendant
Contradiction? People of one race are more likely to murder others of the same race, murdering a Caucasian is more likely to result in the death penalty, and there are more Caucasian defendants than African American defendants in the sample. Controlling for the victim s race reveals more insights into the data, and changes the direction of the relationship between race and death penalty. This phenomenon is called Simpson s Paradox: An association, or a comparison, that holds when we compare two groups can disappear or even be reversed when the original groups are broken down into smaller groups according to some other feature (a confounding/lurking variable).
1.Housekeeping 2.Main ideas A. Two Categorical Variables Use segmented bar plots or mosaic plots for visualizing relationships between two categorical variables B. One Numerical and One Categorical Variable Use side-by-side box plots to visualize relationships between a numerical and categorical variable C. Building Intuition For Making Inferences Not all observed differences are statistically significant Be aware of Simpson's paradox 3.ApplicationExercise
Application exercise: 1.2 Histogram toboxplot See the course website for instructions. How to approximate a boxplot, given a histogram.