Descriptive Statistics in Data Analysis: Insights from Professor William Greene

statistics and data analysis n.w
1 / 54
Embed
Share

Delve into the world of descriptive statistics with Professor William Greene from the Stern School of Business. Explore topics such as breach rates in mortgages, forensic analysis of loans, population vs. samples, random sampling techniques, and more.

  • Data Analysis
  • Descriptive Statistics
  • Population
  • Sampling
  • Professor

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics 1/54 2: Descriptive Statistics

  2. Statistics and Data Analysis Part 2 Descriptive Statistics Summarizing data with useful statistics 2/54 2: Descriptive Statistics

  3. Use random samples and basic descriptive statistics. What is the breach rate in a pool of tens of thousands of mortgages? ( Breach = improperly underwritten or serviced or otherwise faulty mortgage.) 3/54 2: Descriptive Statistics

  4. The forensic analysis was an examination of statistics from a random sample of 1,500 loans. 4/54 2: Descriptive Statistics

  5. Descriptive Statistics Agenda Populations and Random Samples Descriptive Statistics for a Variable Measures of location: Mean,median,mode Measure of dispersion: Standard deviation Measuring Correlation of Two Variables Understanding correlation Measuring correlation Scatter plots and regression 5/54 2: Descriptive Statistics

  6. Populations and Samples Population: Collection of all possible observations (data points) on a variable Sample: A subset of the data points in the population Random sample: Defined by the way the sample data are obtained. All points in the population are equally likely to be drawn in any particular sample. What is the purpose of obtaining a sample? To describe or learn about the population. The sample is observed The population is assumed. In order to learn confidently about the population from a sample, the sample must be random. 6/54 2: Descriptive Statistics

  7. Random Sampling A production process produces circuit boards. Boards are produced in each hour with an average of 2 defects per board when the process is in control. Each hour, the engineer examines a random sample of 100 circuit boards. The average number of defects per board in a particular 30 hour week is Hour 1: Hour 2: Hour 3: Hour 30: (These are estimates of the defect rate per board) Mean of 100 boards = 1.95, 2.65, 1.80, 2.35. The objective of drawing the sample is to determine whether the process is in control or not. The process is under control if the defect rate is < 2.) Method: Assuming the process is in control, would we expect to see this rate of defects? 7/54 2: Descriptive Statistics

  8. Random samples of behavior are difficult to obtain, especially by telephone. 8/54 2: Descriptive Statistics

  9. Nonrandom Samples Nonrandom samples produce tainted, sometimes not believable results Biased with respect to the population May describe a not useful specific subset of the population. 9/54 2: Descriptive Statistics

  10. (Non)Randomness of Samples Sources of bias in samples (generally related) Bad sample design e.g., home phone surveys conducted during working hours Survey (non)response bias e.g., opinion surveys about service quality Participation bias e.g., voluntary participation in a survey Self selection volunteering for a trial or an opinion sample. (Shere Hite s cultural revolution) Attrition bias from clinical trials - e.g., if the drug works, the subject does not come back. 10/54 2: Descriptive Statistics

  11. Nonrandom results in incubator funds. The NYU No Action Letter 11/54 2: Descriptive Statistics

  12. Nonscientific, Nonrandom (non)Sampling A Cultural Revolution 3000 women, ages 14 to 78 describe in their own words 12/54 2: Descriptive Statistics

  13. http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692 A Cultural Revolution 3000 women, ages 14 to 78 describe in their own words 13/54 2: Descriptive Statistics

  14. http://en.wikipedia.org/wiki/Shere_Hite 14/54 2: Descriptive Statistics

  15. The Lesson Having a really big sample does not assure you of an accurate result. It may assure you of a really solid, really bad (inaccurate) result. 15/54 2: Descriptive Statistics

  16. How do ASCAP, BMI and SESAC allocate the royalty pool to specific authors and publishers? The following relates to terrestrial radio, which, as a group, pays a lump sum into the pool, which is then allocated by the PRSs. http://old.cni.org/docs/ima.ip-workshop/Massarsky.html 16/54 2: Descriptive Statistics

  17. A Descriptive Statistic Is ? Describes what? The sample data The population that the data came from 17/54 2: Descriptive Statistics

  18. Measures of Location These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value? 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70 2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35 1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 Location and central tendency There exists a distribution of values We are interested in the center of the distribution Two measures are the sample mean and the sample median They look similar, and measure the same thing. They differ systematically (and predictably) when the data are not symmetric. 18/54 2: Descriptive Statistics

  19. The Sample Mean These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value? 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70 2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35 1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 There are N observations (data points) in the sample. , In this sample, N = 30. The sample mean is 1 2 i i=1 =[y y y y y Sample data: y , , ,... ] 1 2 3 4 N 1 1 N + + = y = y +y +y +y y y [ ... ] 3 4 N N 1 30 N 56.30 30 = (1.45+...+2.35)= =1 .8767 19/54 2: Descriptive Statistics

  20. It may be necessary to weight aggregate data. Average Home Listings 1 51 Listing= (896,800+713,864+...+164,326)=369,687 20/54 2: Descriptive Statistics

  21. Averaging Averages? = $896,800 = 1,275,194 = $377,683 = 12,763,371 Hawaii s average listing Hawaii s population Illinois average listing Illinois population Illinois and Hawaii each get weight 1/51 = .019607 when the mean is computed. Looks like Hawaii is getting too much influence. 21/54 2: Descriptive Statistics

  22. A Properly Weighted Average Simple average = Listing = Weight Listing State State States 1 51 Weight = =.019607 Illinois is 10 times as big as Hawaii. Suppose we use weights that are in proportion to the st ate's population. (The weights sum to 1.0.) Weight varies from .001717 for Wyoming to .121899 for California State New average is 409,234 compared to 369,687 without weights, an error of 11%. Sometimes an unequal weighting of the observations is necessary. State populations from http://www.factmonster.com/ipka/A0004986.html 22/54 2: Descriptive Statistics

  23. Averaging Trending Time Series Observations Is Usually Not Informative Note how the mean changes completely depending on what time interval is used to compute it. Does the mean over the entire observation period mean anything? (Does it estimate anything meaningful?) 23/54 2: Descriptive Statistics

  24. The Sample Median Median = the middle observation after data are sorted. Odd number: Central observation: Med[1,2,4,6,8,9,17] = 6 Even number: Midpoint between the two central observations Med[1,2,4,6,8,9,14,17] = (6+8)/2=7 24/54 2: Descriptive Statistics

  25. Sample Median of (Sorted) Defects Data 1.05 1.30 1.40 1.45 1.45 1.50 1.55 1.60 1.60 1.65 1.65 1.70 1.70 1.70 1.70 1.90 1.90 1.95 2.05 2.05 2.05 2.20 2.25 2.30 2.30 2.35 2.35 2.35 2.60 2.70 12 Median = 1.8000 9 Frequency 6 Mean = 1.8767 3 0 1. 000 1. 500 2. 000 2. 500 3. 000 D EFEC TS 25/54 2: Descriptive Statistics

  26. (Lets deduce estimates of the mean and median from the histogram.) Tomorrow I will compute the average number of defectives for a 61st day. What is a good guess of the number I will find? 26/54 2: Descriptive Statistics

  27. Skewed Earnings Distribution Mean vs. Median in Skewed Data Monthly Earnings N = 595, Median = 800 Mean = 883 These data are skewed to the right. The mean will exceed the median when the distribution is skewed to the right. (The skewness is in the direction of the long tail.) M y 27/54 2: Descriptive Statistics

  28. Extreme Observations Distort Means but Not Medians Outlying observations distort the mean Med [1,2,4,6,8,9,17] = 6 Mean[1,2,4,6,8,9,17] = 6.714 Med [1,2,4,6,8,9,17000] = 6 (still) Mean[1,2,4,6,8,9,17000] = 2432.8 (!) This typically occurs when there are some outlying obervations, such as in cross sections of income or wealth and/or when the sample is not very large. 28/54 2: Descriptive Statistics

  29. 29/54 2: Descriptive Statistics

  30. The mean does not give information about the shape of the distribution. Two problems with the computations (1) The data are ratings, not quantitative (2) The mean does not suggest the extreme nature of the data 30/54 2: Descriptive Statistics

  31. The problem with the mean or median as a description of a sample more information is usually needed. Both data sets have a mean of about 100. 31/54 2: Descriptive Statistics

  32. Dispersion of the Observations These are 30 hours of average defect data on sets of circuit boards. 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70 2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35 1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 We quantify the variation of the values around the mean. Note the range is from 1.05 to 2.70. This gives an idea where the data lie. The mean plus a measure of the variation do the same job. Histogram of Defects 6 5 4 Frequency 3 2 1 0 1.2 1.6 2.0 2.4 2.8 Defects 32/54 2: Descriptive Statistics

  33. The Problem with the Range as a Measure of Dispersion These two data sets both have 1,000 observations that range from about 10 to about 180 33/54 2: Descriptive Statistics

  34. A Measure of Dispersion The standard deviation is the interesting value. You need to compute the variance to get the standard deviation. 1 ( ) 2 N 1 Variance = sy2 = N Y - Y i i=1 1 ( ) 2 N 1 N Standard deviation = sy = Y -Y i i=1 Note the units of measurement. The standard deviation has the same units as the mean. The standard deviation is the standard measure for the dispersion (spread) of a set of values (sample of observations). 34/54 2: Descriptive Statistics

  35. The variance is the average squared deviation of the sample values from the mean. Why is N-1 in the denominator of s2? Everyone else does it Minitab does it I have totally no idea. Tendency of the variance to be too small when computed using 1/N when the sample size, N, is itself small. (When N is large, it won t matter.) See HOG, p. 37 35/54 2: Descriptive Statistics

  36. Computing a Standard Deviation Y Deviation From Mean 1 -2.1 4 0.9 6 2.9 0 -3.1 3 -0.1 2 -1.1 6 2.9 4 0.9 4 0.9 1 -2.1 SUM Squared Deviation 0.0 4.41 0.81 8.41 9.61 0.01 1.21 8.41 0.81 0.81 4.41 38.90 Sum = 31 Mean = 31/10=3.1 Sum of squared deviations = 38.90 Variance = 38.90/(10-1) = 4.322 Standard Deviation = 2.079 36/54 2: Descriptive Statistics

  37. Standard Deviation These are 30 hours of average defect data on sets of circuit boards. 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70 2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35 1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 1 1 2 ( ) 30 Variance = Y -1.8767 = 4.808667=0.165816 i 30-1 30-1 i=1 1 2 ( ) 30 Standard Deviation = Y -1.8767 =0.407205 i 30-1 i=1 37/54 2: Descriptive Statistics

  38. Distribution of Values Histogram of Defects 6 5 4 Frequency 3 2 1 0 1.2 1.6 2.0 2.4 2.8 Defects 38/54 2: Descriptive Statistics

  39. Reliable Rules of Thumb Almost always, 66% of the observations in a sample will lie in the range [mean - 1 s.d. to mean + 1 s.d.] Almost always, 95% of the observations in a sample will lie in the range [mean - 2 s.d. to mean + 2 s.d.] Almost always, 99.5% of the observations in a sample will lie in the range [mean - 3 s.d. to mean + 3 s.d.] When these rules are not met, they will almost be met. Data nearly always act this way. 39/54 2: Descriptive Statistics

  40. A Reliable Empirical Rule Dotplot of Defects Mean 2 s = 1.8767 2(.4072) = 1.06 to 2.69 includes 28/30 = 93% 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 Defects Mean 1 s =(1.47 to 2.28) includes 18/30 = 60% Minitab: Graph Dotplot 40/54 2: Descriptive Statistics

  41. Rules For Transformations y Mean of a + bY = a + b Standard deviation of a + bY = |b| sy 41/54 2: Descriptive Statistics

  42. Which city is warmer, New York (USA) or Old York (England)? Which is more variable? Average Temperatures (high + low)/2 Month NY (f) OY(c) Month NY(f) OY(c) Jan 29.5 2.0 Feb 32.0 2.0 Mar 35.0 4.5 Apr 50.0 8.5 May 60.5 9.5 Jun 70.0 13.0 Jul Aug Sep Oct Nov 45.0 Dec 35.0 75.5 73.5 66.0 55.0 15.5 15.0 13.0 9.5 6.0 3.5 City Old York New York 52.25 Mean 8.500 Std.Dev. 4.913 16.93 Min 2.000 29.50 Max 15.50 75.50 42/54 2: Descriptive Statistics

  43. Application Cost of Defects These are 30 observations of average defect data on sets of manufactured circuit boards. 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70 2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35 1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 Suppose the cost to repair defects is $25 + 10*Defects I.e., a $25 setup cost plus $10 per defect. Mean defects = 1.8767 Standard Deviation = 0.407205 Mean Cost = $25 + $10(1.8767) = $43.767 Standard Deviation Cost = $10(.407205) = $4.07205 43/54 2: Descriptive Statistics

  44. Correlation Variables Y and X vary together Causality vs. correlation: Does movement in X cause movement in Y in some metaphysical sense? Correlation Simultaneous movement through a statistical relationship Simultaneous variation induced by the variation of a common third effect 44/54 2: Descriptive Statistics

  45. Samples of House Listings and Per Capita Incomes at a Particular Time 45/54 2: Descriptive Statistics

  46. Scatter Plot Suggests Positive Correlation Scatterplot of Listing vs IncomePC 900000 800000 700000 600000 Listing 500000 400000 300000 200000 100000 15000 17500 20000 22500 25000 27500 30000 32500 IncomePC 46/54 2: Descriptive Statistics

  47. Regression Measures Correlation Scatterplot of Listing vs IncomePC 900000 Regression Line: Listing = a + b IncomePC 800000 700000 600000 Listing 500000 400000 300000 200000 100000 15000 17500 20000 22500 25000 27500 30000 32500 IncomePC 47/54 2: Descriptive Statistics

  48. Correlation Is Not Causation Price and Income seem to be positively related. Scatterplot of Income vs GasPrice 27500 The U.S. Gasoline Market. Data are yearly from 1953 to 2004. Plot of per capita income vs. gasoline price index. 25000 22500 20000 Income 17500 15000 12500 10000 20 40 60 80 100 120 GasPrice 48/54 2: Descriptive Statistics

  49. The Hidden (Spurious) Relationship Not positively related to each other; both positively related to time. Scatterplot of Income vs Year Scatterplot of GasPrice vs Year 27500 120 25000 100 22500 20000 80 GasPrice Income 17500 60 15000 40 12500 10000 20 1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010 Year Year 49/54 2: Descriptive Statistics

  50. Correlation is the interesting number. We must compute covariance and the two standard deviations first. 1 1 ( ) ( ) 2 2 N n = = Standard Deviations: s X - X , s Y - Y X i Y i N 1 N 1 i=1 i=1 ( )( ) N N 1 X X Y Y i i i=1 = Covariance: s XY s = Correlation: r -1 < rXY < +1 Units free. A pure number. XY XY s s X Y 50/54 2: Descriptive Statistics

Related


More Related Content