
Descriptive Statistics and Skewness in Data Analysis
Explore the concepts of shape, skewness, and the impact of outliers on statistical measures like mean, median, and mode. Learn how to interpret skewness in datasets and examine the practical application of descriptive statistics using examples and tools like MS Excel.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Descriptive Statistics: Part II Thursday, April 3, 2025
Shape A third important property of data after location and dispersion - is its shape. Shape can be described by degree of asymmetry (i.e., skewness). mean > median positive or right-skewness mean = median symmetric or zero-skewness mean < median negative or left-skewness Positive skewness can arise when the mean is increased by some unusually high values. Negative skewness can arise when the mean is decreased by some unusually low values. Thursday, April 3, 2025 Applied Statistics
Skewness Left skewed Right skewed Symmetric Thursday, April 3, 2025 Applied Statistics
Example: # hours to complete a task This guy took a VERY long time! Data (for n=12 employees): 2 3 8 8 9 10 10 12 15 18 22 63 ?= 180/12 = 15 hours Median = 10 hours The (extremely slow) employee who took 63 hours to complete the task skewed the entire distributon to the right. s2 = 2868 / 11 = 260.79 s = 16.25 hours CV = 107.7% Thursday, April 3, 2025 Applied Statistics
Example: Using MS Excel Scores of 17 students on a national calculus exam. Data: 0, 0, 10, 12, 15, 18, 20, 25, 30, 33, 34, 41, 56, 87, 92, 94, 95 Open MS Excel. Go to Data Analysis Analysis Tools Descriptive Statistics. If you do not have Data Analysis-Analysis Tools, you have to use the Add-in feature and add it to MS Excel. Make sure to check the Summary Statistics box once you are in descriptive statistics. See MS Excel Output on next slide. Thursday, April 3, 2025 Applied Statistics
Using MS Excel From the output: mean is 38.94 median is 30 mode is 0 standard deviation is 33.44 variance is 1118.43 skewness is .78 (positive) range is 95 n is 17 MS Excel uses a formula the Pearson Coefficient of Skewness to calculate skewness. You do not have to know the formula. If the coefficient is 0 or very close to it, you have a symmetric distribution. Thursday, April 3, 2025 Applied Statistics
Five Number Summary When examining a distribution for shape, sometime the five number summary is useful: Smallest| Q1 | Median | Q3 | Largest Example: 2 3 8 8 9 10 10 12 15 18 22 63 ? = 15 Median Q1 Smallest Q3 Largest 5-number summary: 2 | 8 | 10 | 16.5 | 63 This data is right-skewed. In right-skewed distributions, the distance from Q3 to Xlargest (16.5 to 63) is significantly greater than the distance from Xsmallest to Q1(2 to 8). Thursday, April 3, 2025 Applied Statistics
Boxplot The boxplot is a way to graphically portray a distribution of data by means of its five-number summary. - Vertical line drawn within the box is the median. - Vertical line at the left side of box is Q1 - Vertical line at the right side of box is Q3 - Line on left connects left side of box with Xsmallest (lower 25% of data) - Line on right connects right side of box with Xlargest (upper 25% of data) Boxplot can be drawn along the horizontal or vertically. A bell-shaped symmetric data distribution would look like this: Thursday, April 3, 2025 Applied Statistics
Frequency Distribution Data collected in original form is called raw data. A frequency distribution is the organization of raw data in table form, using classes and frequencies. Nominal- or ordinal-level data that can be placed in categories is organized in categorical frequency distributions.
Example of Categorical Frequency Distribution Twenty-five army indicates were given a blood test to determine their blood type. Raw Data: A,B,B,AB,O O,O,B,AB,B B,B,O,A,O A,O,O,O,AB AB,A,O,B,A Construct a frequency distribution for the data.
General rules for constructing frequency distribution 1. 2. 3. 4. Determine the range of values. Select and determine number of class. Determine class length or width. Determine class frequency. Example: the following data represent the compressive strength of concrete cubes in MPa , construct frequency distribution table. 28.3 30.5 25.4 26.8 35.2 34.6 31.3 30.0 31.1 34.5 27.5 30.8 32.0 26.4 27.5 28.6 29.7 34.2 36.3 29.6 37.5 37.5 36.3 29.5 33.0 35.2 35.2 38.5 36.3 33.9 35.1 35.1 38.4 37.3 38.4 38.4 38.2
Solution: 1. . 2. number of class: Struges method =1+(3.3* log N) Note the number of class is between (5-15) We select 6 . 3. class length = range =highest value lowest value = 38.5 25.4 =13.1 Rounding Rule: Always round up if a remainder. ????? ??.?? ?????=13.1 = 2.18 2.2 6 classes 25.4-27.6 27.6-29.8 29.8-32.0 32.0-34.2 34.2-36.4 36.4-38.6 Total Frequency(fi) 5 5 5 3 11 8 37 4. Class frequency
Relative frequency (Percentage) fi Relative frequency=frequency of class/total frequency = fi classes 25.4-27.6 27.6-29.8 29.8-32.0 32.0-34.2 34.2-36.4 36.4-38.6 Total Frequency(fi) Relative frequency 5 5/37=0.135 5 5/37=0.135 5 5/37=0.135 3 3/37=0.081 11 11/37=0.297 8 8/37=0.216 37 1.00
Cumulative frequency A cumulative frequency distribution gives the total number of values that fall below the upper boundaries of each class. classes Frequency(fi) Relative frequency Cumulative frequency 5 5/37=0.135 5 25.4-27.6 27.6-29.8 29.8-32.0 32.0-34.2 34.2-36.4 36.4-38.6 Total 5 5/37=0.135 10 5 5/37=0.135 15 3 3/37=0.081 18 11 11/37=0.297 29 8 8/37=0.216 37 37 1.00
Cumulative relative frequency Cumulative relative frequency=cumulative frequency/total observation in data set classes Frequency (fi) Relative frequency Comulative frequency Comulative relative frequency 5 5/37=0.135 5 5/37=0.135 25.4-27.6 27.6-29.8 29.8-32.0 32.0-34.2 34.2-36.4 36.4-38.6 Total 5 5/37=0.135 10 10/37=0.270 5 5/37=0.135 15 15/37=0.405 3 3/37=0.081 18 18/37=0.486 11 11/37=0.297 29 29/37=0.784 8 8/37=0.216 37 37/37=1.00 37 1.00
Histograms, Frequency Polygons, and Ogives The histogram is a graph that displays the data by using vertical bars of various heights to represent the frequencies of the classes. The class boundaries are represented on the horizontal axis. Example: Construct a histogram to represent the data for the record high temperatures for each of the 50 states (see the table). Class limit frequency 100-104 2 105-109 8 110-114 18 115-119 13 120-124 7 125-129 1 130-134 1
The frequency polygon is a graph that displays the data by using lines that connect points plotted for the frequencies at the class midpoints. The frequencies are represented by the heights of the points. Class limit frequency Example: Construct a frequency polygon to represent the data for the record high temperatures for each of the 50 states (see the table). 100-104 2 105-109 8 110-114 18 115-119 13 120-124 7 125-129 1 130-134 1
Frequency polygons use class midpoints and frequencies of the classes.
The Ogive is a graph that represents the cumulative frequencies for the classes in a frequency distribution. The upper class boundaries are represented on the horizontal axis. Class limit frequency Example: Construct the ogive to represent the data for the record high temperatures for each of the 50 states (see the table). 100-104 2 105-109 8 110-114 18 115-119 13 120-124 7 125-129 1 130-134 1
Ogive Cumulative frequency
Descriptive Statistics 2 variables Categorical Data graphical representation Contingency Table Side-by-Side Bar Chart Numerical Data looking for relationships in bivariate data Scatter Plot Correlation The Regression Line Thursday, April 3, 2025 Applied Statistics
The Contingency Table The Contingency Table Two categorical variables are most easily displayed in a contingency table. This is a table of two-way frequencies. Example: Who would you vote for in the next election? Male 250 150 400 Female 250 350 600 Republican Candidate Democrat Candidate 500 500 1000 This also works for two-way percentages: Thursday, April 3, 2025 Applied Statistics
The Side The Side- -by by- -Side Bar Chart Side Bar Chart Chart: Relative Performance (Source: Microsoft.com) Thursday, April 3, 2025 Applied Statistics
Scatter Plots and Correlation A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the independent variable x and the dependent variable y. A scatter plot is used to determine the relationship exists between the two variables. y x
Analyzing the Scatter Plot 1. A positive linear relationship exists when the points fall approximately in an ascending straight line from left to right and both the x and y values increase at the same time. 2. A negative linear relationship exists when the points fall approximately in a descending straight line from left to right.
Analyzing the Scatter Plot 3. A nonlinear relationship exists when the points fall in a curved line. 4. It is said that no relationship exists when there is no discernable pattern of the points.
Example Two numerical variables Grade and Height (in inches) Y (Grade) X (Height) 100 73 95 79 90 62 80 69 70 74 65 77 60 81 40 63 30 68 20 74 Thursday, April 3, 2025 Applied Statistics