
Understanding Random Variables in Statistics and Data Analysis
Explore the concept of random variables in statistics, including types of random variables such as discrete and binary, with real-world examples like credit card approval and default rates. Learn how random variables play a crucial role in modeling and analyzing data.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics 1/35 Part 5: Random Variables
Statistics and Data Analysis Part 5 Random Variables 2/35 Part 5: Random Variables
Random Variable Using random variables to organize the information about a random occurrence. Random Variable: A variable that will take a value assigned to it by the outcome of a random experiment. Realization of a random variable: The outcome of the experiment after it occurs. The value that is assigned to the random variable is the realization. X = the variable, x = the outcome 3/35 Part 5: Random Variables
Types of Random Variables Discrete: Takes integer values Binary: Will an individual default (X=1) or not (X=0)? How many messages arrive at a switch (customers at a service point) per unit of time? Finite: How many female children in families with 4 children; values = 0,1,2,3,4? Infinite: How many people will catch a certain disease per year in a given population? Values = 0,1,2,3, (How can the number be infinite? It is a model.) Continuous: A measurement. How long will a light bulb last? Values X = 0 to Performance of financial assets over time How do we describe the distribution of biological measurements? Measures of intellectual performance 4/35 Part 5: Random Variables
Modeling Fair Isaacs: A Binary Random Variable (Real) Sample of Applicants for a Credit Card Experiment = One randomly picked application. Let X = 0 if Rejected Let X = 1 if Accepted X is DISCRETE (Binary). This is called a Bernoulli random variable. Rejected Approved The outcome is random from the credit card vendor s point of view. Fair Isaacs uses a formula. Given the information on the application, the outcome is not random to Fair Isaacs. It is random to the vendor because they do not know the formula. 5/35 Part 5: Random Variables
The Random Variable Lenders Are Really Interested In Is Default Of 10,499 people whose application was accepted, 996 (9.49%) defaulted on their credit account (loan). We let X denote the behavior of a credit card recipient. X = 0 if no default X = 1 if default This is a crucial variable for a lender. They spend endless resources trying to learn more about it. 6/35 Part 5: Random Variables
7/35 Part 5: Random Variables
Distribution Over a Count Of 13,444 Applications, 2,561 had at least one derogatory report in the previous 12 months. Let X = the number of reports for individuals who have at least 1. X = 1,2, ,>10. X is a discrete random variable. (There are also about 9,500 individuals in this data set who had X=0.) 8/35 Part 5: Random Variables
Discrete Qualitative Random Variable Response (0 to 10) to the question: How satisfied are you with your health right now? Experiment = the response of an individual drawn at random. Let X = their response to the question. X = 0,1, ,10 This is a DISCRETE random variable, but it is not a count. Do women answer systematically differently from men? 9/35 Part 5: Random Variables
Continuous Variable Light Bulb Lifetimes Probability for a specific value is 0. Probabilities are defined over intervals, such as P(1000 < Lifetime < 2500). Needs calculus. 10/35 Part 5: Random Variables
Lightbulb Lifetimes Distribution of T = the lifetime of the bulb. 10,000 Hours? Philips DuraMax Long Life Lasts 1 Year Life 1000 Hours. Exactly? Probability for a specific value is 0. Probabilities are defined over intervals, such as P(200 < Lifetime < 250). Needs calculus. 11/35 Part 5: Random Variables
Probability Distribution Range of the random variable = the set of values it can take Discrete: A set of integers. May be finite or infinite Continuous: A range of values Probability distribution: Probabilities associated with values in the range. 12/35 Part 5: Random Variables
Bernoulli Random Variable Probability Distribution P(X=0) P(X=1) 0.5556 0.4444 Experiment = A randomly picked application. Let X = 0 if Rejected Let X = 1 if Accepted The range of X is [0,1] Reject Approve 13/35 Part 5: Random Variables
Probability Distribution Over Derogatory Reports Derogatory Reports X P(X=x) 1 .5100 2 .2085 3 .0953 4 .0547 5 .0430 6 .0226 7 .0148 8 .0125 9 .0109 10 .0277 14/35 Part 5: Random Variables
Notation Probability distribution = probabilities assigned to outcomes. P(X=x) or P(Y=y) is common. Probability function = PX(x). Sometimes called the density function Cumulative probability is Prob(X < x) for the specific x. 15/35 Part 5: Random Variables
Cumulative Probability Derogatory Reports X P(X=x) P(X<x) 1 .5100 .5100 2 .2085 .7185 3 .0953 .8138 4 .0547 .8685 5 .0430 .9115 6 .0226 .9341 7 .0148 .9489 8 .0125 .9614 9 .0109 .9723 10 .0277 1.0000 16/35 Part 5: Random Variables
Rules for Probabilities 1. 0 < P(x) < 1 (Valid probabilities) = x all possible outcomesP(x) = 1 2. 3. For different values of x, say A and B, Prob(X=A or X=B) = P(A) + P(B) 17/35 Part 5: Random Variables
Probabilities Derogatory Reports X P(X=x) P(X<x) 1 .5100 .5100 2 .2085 .7185 3 .0953 .8138 4 .0547 .8685 5 .0430 .9115 6 .0226 .9341 7 .0148 .9489 8 .0125 .9614 9 .0109 .9723 10 .0277 1.0000 P(a < x < b) = P(a)+P(a+1)+ +P(b) E.g., P(5 < Derogs < 8) = .0430 + .0226 + .0148 + .0125 = .0929 P(a < x < b) = P(x < b) P(x < a-1) E.g., P(5 < Derogs < 8) = P(Derogs < 8) P(Derogs < 4) = .9614 - .8685 = .0929 18/35 Part 5: Random Variables
Mean of a Random Variable Average outcome; outcomes weighted by probabilities (likelihood) i = all outcomes = DenotedE[X] = P(X x ) x i i Typical value Usually not equal to a value that the random variable actually takes. E.g., the average family size in the U.S. is 1.4 children. Usually denoted E[X] = (mu) 19/35 Part 5: Random Variables
Expected Value X = Derogs x P(X=x) 1 .5100 2 .2085 3 .0953 4 .0547 5 .0430 6 .0226 7 .0148 8 .0125 9 .0109 10 .0277 =2.361 E[X] = 1(.5100) + 2(.2085) + 3(.0953) + + 10(.0277) = 2.3610 20/35 Part 5: Random Variables
Expected Payoffs are Expected Values of Random Variables Bet $1 on a number If it comes up, win $35. If not, lose the $1 The amount won is the random variable: Win = -1 P(-1) = 37/38 +35 P(+35) = 1/38 E[Win] = (-1)(37/38) + (+35)(1/38) = -0.053 = -5.3 cents (familiar). 18 Red numbers 18 Black numbers 2 Green numbers (0,00) 21/35 Part 5: Random Variables
Buy a Product Warranty? Should you buy a $20 replacement warranty on a $47.99 appliance? What are the considerations? Probability of product failure = P (?) Expected value of the insurance = -$20 + P*$47.99 < 0 if P < 20/47.99. Expected value of the warranty is negative if P < 0.42. 22/35 Part 5: Random Variables
Median of a Random Variable The median of X is the value x such that Prob(X < x) = .5. For a continuous variable, we will find this using calculus. For a discrete value, Prob(X < M+1) > .5 and Prob(X < M-1) < .5 X Prob(X=x) Prob(X < x) 0 .0164 .0164 1 .0093 .0257 2 .0235 .0492 3 .0429 .0921 4 .0509 .1430 5 .1549 .2979 6 .0926 .3905 7 .1548 .5453 8 .2259 .7712 9 .1120 .8832 10 .1168 1.0000 Mean (6.8) Median (7) Health Satisfaction Sample Proportions. 23/35 Part 5: Random Variables
Measuring the Spread of the Random Outcomes Derogatory Reports X P(X=x) 1 .5100 2 .2085 3 .0953 4 .0547 5 .0430 6 .0226 7 .0148 8 .0125 9 .0109 10 .0277 The range is 1 to 10, but values outside 1 to 5 are rather unlikely. =2.361 24/35 Part 5: Random Variables
Variance Variance = E[X ]2 = 2 (sigma2) Compute The square root is usually more useful. Standard deviation = Compute ( i = all outcomes = = 2 2 P(X x )(x ) i i i = all outcomes = 2 P(X x ) (x ) i i i = all outcomes ) = = 2 i 2 P(X x )x i 25/35 Part 5: Random Variables
Variance Computation X = Derogatory Reports. = 2.361 x P(X=x) x- (x- )2 P(X=x)(x- )2 1 .5100 -1.361 1.85232 0.94468 2 .2085 -0.361 0.13032 0.02717 3 .0953 0.639 0.40832 0.03891 4 .0547 1.639 2.28632 0.14694 5 .0430 2.639 6.96432 0.29947 6 .0226 3.639 13.24232 0.29928 7 .0148 4.639 21.53032 0.31850 8 .0125 5.639 31.79832 0.39748 9 .0109 6.639 44.07632 0.48043 10 .0277 7.639 58.35432 1.61641 SUM 4.56928 2 = 4.56928 = 2.13759 26/35 Part 5: Random Variables
Common Results for Random Variables Concentration of Probability For almost any random variable, 2/3 of the probability lies within 1 For almost any random variable, 95% of the probability lies within 2 For almost any random variable, more than 99.5% of the probability lies within 3 What it means: For any random outcome, An (observed) outcome more than one away from is somewhat unusual. One that is more than 2 away is very unusual. One that is more than 3 away from the mean is so unusual that it might be an outlier (a freak outcome). 27/35 Part 5: Random Variables
Outlier? In the larger credit card data set, there was an individual who had 14 major derogatory reports in the year of observation. Is this within the expected range by the measure of the distribution? The person s deviation is (14 2.361)/2.138 = 5.4 standard deviations above the mean. This person is very far outside the norm. 28/35 Part 5: Random Variables
Application: Sharpe Ratio (From your text, pp. 212-213) Sharpe Ratio = Distance of Return from Risk Free Rate Used to compare assets with different means and different standard deviations. Example: 2002 - 2011: Apple: A A 2.48%, McDonalds: 1.28%, Which looks better? Choose 2.48 0.40 S(A) 0.151; S(M) 13.8 Apple looks better. = = = = 13.8% 6.5% 0.4% (Risk free rate) 1.28 0.40 6.5 M A = 0 = = = = 0.135 29/35 Part 5: Random Variables
Recall from day 2 of class Reliable Rules of Thumb Almost always, 66% of the observations in a sample will lie in the range [mean+1 s.d. and mean 1 s.d.] Almost always, 95% of the observations in a sample will lie in the range [mean+2 s.d. and mean 2 s.d.] Almost always, 99.5% of the observations in a sample will lie in the range [mean+3 s.d. and mean 3 s.d.] 30/35 Part 5: Random Variables
A Possibly Useful Shortcut E[X ]2 = E[X2] 2 = ( ) = 2 2 P(X x )x i i i = all outcomes 31/35 Part 5: Random Variables
Application PartyPlanners plans parties each day, and must order supplies for the events. The number of requests for party plans varies day by day according to P(X=0) = .4 P(X=1) = .3 P(X=2) = .25 P(X=3) = .05 H ow many parties should they expect on a given day? E[X] = .4(0) + .3(1) + .25(2) + .05(3) = .95, or about 1. What are the variance and standard deviation? Var[X] = .4(0 )+ .3(1 ) + .25(2 ) + .05(3 ) -.952 = .8475. 0.8475 = 0.9206 2 2 2 2 If they plan for 1 party per day, it is rather likely that they will run out of materials since 2 is only 1.1 standard deviations above the mean. 32/35 Part 5: Random Variables
Important Algebra Linear Translation: For the random variable X with mean E[X] = , if Y = a+bX, then E[Y] = a + b Scaling: For the random variable X with standard deviation X, if Y = a+bX, then Y = |b| X 33/35 Part 5: Random Variables
Example: Repair Costs The number of repair orders per day at a body shop is distributed by: Repairs 0 1 Probability .1 .2 Opening the shop costs $500 for any repairs. Two people each cost $100/repair to do the work. What are the mean and standard deviation of the number of repair orders? = 0(.1) + 1(.2) + 2(.35) + 3(.2) + 4(.15) = 2.10 2 = 02(.1) + 12(.2) + 22(.35) + 32(.2) + 42(.15) 2.12 = 1.39 = 1.179 What are the mean and standard deviation of the cost per day to run the shop? Cost = $500 + $100*(2)*(Number of Repairs) Mean = $500 + $100*(2)*(2.1) = $920/day Standard deviation = $100*(2)*(1.179) = $235.80/day 2 .35 3 .2 4 .15 34/35 Part 5: Random Variables
Summary Random variables and random outcomes Outcome or sample space = range of the random variable Types of variables: discrete vs. continuous Probability distributions Probabilities Cumulative probabilities Rules for probabilities Moments Mean of a random variable Standard deviation of a random variable 35/35 Part 5: Random Variables
Application: Expected Profits and Risk You must decide how many copies of your self published novel to print . Based on market research, you believe the following distribution describes X, your likely sales (demand). x P(X=x) 25 .10 (Note: Sales are in thousands. Convert your final result to 40 .30 dollars after all computations are done by multiplying your 55 .45 final results by $1,000.) 70 .15 Printing costs are $1.25 per book. (It s a small book.) The selling price will be $3.25. Any unsold books that you print must be discarded (at a loss of $2.00/copy). You must decide how many copies of the book to print, 25, 40, 55 or 70. (You are committed to one of these four 0 is not an option.) A. What is the expected number of copies demanded. B. What is the standard deviation of the number of copies demanded. C. Which of the four print runs shown maximizes your expected profit? Compute all four. D. Which of the four print runs is least risky i.e., minimizes the standard deviation of the profit (given the number printed). Compute all four. E. Based on C. and D., which of the four print runs seems best for you? 36/35 Part 5: Random Variables
X = Sales (Demand) x P(X=x) 25,000 .10 40,000 .30 55,000 .45 70,000 .15 x P(X=x) A. Expected Value = all values of x = .1(25,000) + .3(40,000) = 49,750 + .45(55,000) + .15(70,000) 37/35 Part 5: Random Variables
B. Standard Deviation Get the Variance First = 2 2 (x - E[x]) P(X=x) all values of x + 2 2 = .1(25,000 - 49,750) + .45(55,000 - 49,750) + .15(70,000 = 163,687,500 Standard Deviation = square root of variance. .3(40,000 - 49,750) 2 2 - 49,750) = 163,687,500 = 12,794.041 There is a shortcut = = lues of x = .1(25,000 ) x P(X=x) 2 2 2 all values of x 2 2 (x - E[x]) P(X=x) all va + 2 2 2 2 2 .3(40,000 ) + .45(55,000 ) + .15(70,000 ) - 49,750 = 163,687,500 38/35 Part 5: Random Variables
x P(X=x) Revenue per book = $3.25 25,000 .10 Cost per book = $1.25 40,000 .30 Profit per book sold = $2.00/book 55,000 .45 70,000 .15 Expected Profit | Print Run = 25,000 is $2 25,000 = $50,000 (Demand is guaranteed to be at least 25,000) Expected Profit | Print Run = 40,000 is $2 .9 40,000 + .1 ($2 25,000 - $1.25 15, (If print 40,000, .9 chance sell all and .1 chance sell only 25,000) Expected Profit | Print Run = 55,000 is $2 .6 55,000 + .1 ($2 25,000 - $1.25 30,000) + .3 ($2 40,000 $1.25 15,000) = $85,625 Expected Profit | Print Run=70,000 is $2 .15 70,000 + .10 ($2 25,000 - $1.25 45,000) + .30 ($ 2 40,000 $1.25 30000) + .45 ($2 55,000 $1.25 15000) = $74,187,50 000) = $75,125 39/35 Part 5: Random Variables
Expected Profit Given Print Run 40/35 Part 5: Random Variables
Variances Print Run = 25,000. Variance = 0. Std. Dev. = 0 Demand will be at least 25,000. Print Run .1*[(2*25000 1.25*15000) 75,125] . 9*[(2*4000 0 Standard Deviation = square root = $14625 Print Run = 55,000. Variance .1*[(2*25000 1.25*30,000) 85,625] .3*[(2*40000) 1.25*15,0 00) 85,625 ] = 40,000. ) Va rian ce = + 2 (if demand is o (if demand is nly 25,000) 40,000) 2 75,125)] = + 2 (if demand is only + (if demand is 25,0 40,000) 55,000) 00) = 2 2 .6*[(2*55,000 85,625] (if demand is Standard D eviation = square root = r n P i .1* [(2*25000 1.25*45000) .3* [(2*40000 1.25*30,0 .45*[(2*5 5 ,000 1.25*15,000) 74,187 .15*[2*70,000 Standard Deviation = square root = $415 $32,702.49 t Run = 70,000. Variance = + 2 74,187.5] 0) 74,18 74,187 (if demand is only 25,000) + (if demand is + (if demand is (if demand is 80.64 = = = 2 0 7.5] .5] .5] 40,000) 55,000) 70,000) 2 2 41/35 Part 5: Random Variables
Run=70,000 Run=55,000 Run=40,000 Run=25,000 42/35 Part 5: Random Variables
Run=70,000 Run=55,000 70,000 is inferior to 40,000 Run=40,000 Run=25,000 43/35 Part 5: Random Variables
Which of these choices would you prefer? Run=55,000 Run=40,000 25,000 is safe, but an extremely risk averse choice and has far lower expected payoff than 40 or 55. Run=25,000 44/35 Part 5: Random Variables