
Understanding Maximum Likelihood Estimation (MLE)
Learn about Maximum Likelihood Estimation (MLE), a statistical method designed to find parameter values that maximize predictive accuracy. Explore the concept, examples like the Gaussian case, Fisher information, and the role of Ronald Aylmer Fisher in its development.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Maximum Likelihood Estimation MLE problem statement MLE examples Fisher information & CR bound Gaussian case 1
Maximum Likelihood Estimation Most statistical methods are designed to minimize error. Choose the parameter values that minimizes predictive error: Maximum likelihood estimation seeks the parameter values that are most likely to have produced the observed data. Introduced by Ronald Fisher ( ) 2 y y or y y
Ronald Aylmer Fisher (1890~1962) Known for : 1912 : Maximum likelihood 1922 : F-test 1925 : Analysis of variance (Statistical Method for Research Workers ) Notable Prizes : Royal Medal (1938) Copley Medal (1955) This is not how he looked like when he introduced MLE. He looked like you! 3
Basic setup : an unknown parameter : The set of all possible values of X is a random variable with p.d.f. parameterized by X1 , , Xn : i.i.d. observations of X Joint p.d.f. of the random variables X1, , Xn ( , , | ) n f x x 1 n = = | ) ( | ) | ) | ) ( ( ( f x f x f x f x 1 2 n i = 1 i
Likelihood Function Once the observations X1 , , Xn have been made, they are fixed values. For different values of , the probability of such observations would happen is different. Likelihood Function of given the observations X1 , , Xn is defined as: x x L 1 ) , , | ( n = i = ( | ) f x n i 1
Joint p.d.f. vs. Likelihood Function Identical quantities Different interpretation Joint p.d.f. of X1 , , Xn : A function of 1, , n for given Probability interpretation ( ) ( ) ( ,..., f x x f x f x = n ) ( ) ( ) = ... f x f x 1 1 2 n n i = 1 i + + ( ) = ,..., 1 f x x dx dx 1 n 1 n Likelihood Function of : A function of for givenX1 , , Xn No probability interpretation ) ( ,..., ,..., x x f x x = n ( ) ( ) ( ) ( ) = = ... L f x f x f x 1 1 1 n n n i = 1 i
Example : Normal Distribution Suppose X1, , Xn is a random sample from a normal distribution with p.d.f.: 2 ( ) 1 x = 2 ( | , f x ) exp{ } 2 2 a vector parameter: Likelihood Function: 2 ( , ) 2 2 ( ) n 1 x i ( , = 2 ) [ exp{ }] L 2 2 2 = 1 i 1 1 n = 2 n ( ) exp{ ( ) } x i 2 2 2 = 1 i 7
MLE Problem Statement Given the parameter space , and observations {X1, , Xn} = {x1, x2, , xn} Determine by solving the optimization problem: n max max ( | L = | ) , , ) ( x x f x 1 n i = 1 i 8
Example: Bernoulli distribution If , , are a set of Bernoulli observations, n x x f p p f p = = 1 = 1 x x with (1| ) and (0| ) 1 , . . p ie f x ( | ) p (1 ) p p i i i The likelihood function for observations , , is x x 1 n n x n x = = 1 x x ( | L p x , , ) (1 ) (1 ) , x p p p p i i i i 1 n = 1 he value that maximizes this. i p and the m.l.e is t ln( ) = + The log-likelihood is ln( ) ( )ln(1 ) L p x n x p i i x n x x solve for p ln( ) dp d L i i i = = = p Set: 0 1 p p n 9
MLE for Vector Parameters For several unknown parameters , ..., , ..., , 1 k the maximum likelihood estimates are found bymaximizing the likelihood function ( , ..., , , ) ) = 1 k L x x 1 ( ; f x 2 , ..., 1 n ( ; f x , ..., ) 1 1 2 1 2 n 10
Example 1 2 2 ) /2 = 2 ( x The normal dist. ( | , f x ) e 2 n | , = | , 2 2 likelihood ( , L x ) ( ) x f x 1 n i = 1 i /2 n 1 n = ) / 2 2 2 exp ( x i 2 2 = 1 i n 2 ( 2 ) x ( 2 n i = 2 = so that log-likelihood ln( ) ln(2 ) 1 L i 2 2 n n 2 ( ) ) x x ln( ) d setting the two equations to zeros, ln ( ( ) L d L d d n i i = = + = = , 1 1 i i 2 2 2 4 ) 2 n n = + = 2 2 ( ) 0, ( ) 0 x n x i i = = 1 1 i i n n 2 2 ( ) ( ) x n x n x i i = = = 2 = = , 1 1 x i i 11
Asymptotically unbiased An estimate is called asymptotically unbiased if it is unbiased when n . The MLE estimate of 2 is asymptotically unbiased. n 2 ( ) x x 1 1 1 n n n i = = = 2 2 2 = ( ) ( ) ( ) E s 1 E E i 1 n n n n 1 n lim n lim n = = 2 2 2 ( ) E n 12
Example Defective circuits on a wafer At a semiconductor company, 30 randomly selected wafers are inspected. If the distribution of the number of defective circuits perwafer is assumed be a Poisson distribution, how should theparameter be estimated? x e = Poisson distr ibution: ( | ) f x ! x - The method of moment = = ( ) E X x 13
MLE: x e i = ( | ) i f x , so that the likelihood is ! x i + + ( ) x x n n e 1 n ( | = | ) = , , ) ( L x x f x 1 n i ( ! x !) x = 1 i 1 n The log-likelihood is therefore ln( ) ( ln( ) so that d = + + + )ln( ) ln( ! x !) L n d x x x x 1 1 n n + + ( ) x L = + = = 0 1 n n x 14
MLE for uniform distribution For some distribution, the MLE cannot be solved by taking derivatives. If X is uniformly distributed on [0, ], what is the MLE estimate of q when a set of samples x1, x2, , xnhave been obtained? 15
Mean time to failure n chips have been tested to have failure times t1, ,tn hypothesis: failure time has an exponential p.d.f. with mean 1 ( ; ) exp( ) f t (log ) 0 = 1 1 i i n = n n t 1 t = = log ( ) L log ( ; ) f t (log ) i = i = = 1 1 i i L n t 1 set = n 1 n = = 0 ( ) i t 2 i = 1 i = 1 i [ ( , E , )] n t ( , t , ) n t ( , t , ; ) n t t f dt dt 1 1 joint 1 1 n nt t 1 n 1 1 1 n 1 = ( ) t e e dt dt 1 i n = 1 i t t 1 n n n j i = = = t e dt e dt i j = j i 1 1 i
Mean time to failure 50 failure times = 1. 0 = 1. 062
Maximum likelihood 1 How about using as a parameter? = a ( a ) given 0 L L a = = 0 = ( ) a a a n 1 n n 1 = = ] [ unbiased estimator for when n = = E n 1 n n = i 1 1 it 1
Delta Method for Approximating the Variance of an Estimator To estimate a nonlinear function a( ). Suppose that: and is a known function of Delta Method: ) ( ( ) E Var ( ) ( ) ( + ) '( ) a a a 2 ( ) '( ) ( ) Var a a Var 19
Generalization to multi-parameter nonlinear function: ( ) ( ) + '( )( ) a a a where is a column vector and '( ) is a row vector. = a 2 T ( ) ( ) '( )( '( ) T )( ) E a a E a a T = '( ) '( ) T ( )( ) a E a T '( ) ( ) ( ) '( ) ( ) a diag Var Var Var a 1 2 n = '( ) ( ) + '( ) ( ) + 2 2 2 '( ) ( ) a Var a Var a Var 1 1 2 2 n n 20
Maximum likelihood Variance of ML estimators many experiments (same n): spread of ? analytically (exponential) n 1 n = t i = 1 i [ ] = ] ( [ ]) E 1 ( i n 2 2 2 [ V E nt t 1 1 n 1 = 2 2 ) t e e dt dt 1 i n = 1 = n 2 = 2 But this # is unknown. Therefore we have: n 2 n = 2 Can we use ?
Maximum likelihood Variance: Monte Carlo method For cases too difficult to solve analytically: MC method simulate a large number of experiments compute the ML estimate each time distribution of the resulting values S2 unbiased estimator for the variance of a p.d.f. S from MC experiments: statistical errors of the parameter estimated from real measurement
Maximum likelihood 1000 experiments 50 obs/experiment sample standard deviation s = 0.151 1. 062 n = = 50 = 0. 150
Fisher Information: 2 2 ln ( | ) f x d ln ( | ) f x d d d ( ) = = ( | ) f x I dx E Alternative expression 2 2 ln ( | ) f x d ln ( | ) f x d d d ( ) = = ( | ) f x I dx E 2 MLE is trying to take out all information. 24
2 ln ( | ) f x d ln ( | ) f x d d d ( ) = = I E Var = ( | ) 1 f x dx ( | ) df x d = = 1 0 dx d ( d | ) ( | ) 1 df x df x = ( | ) dx f x dx ( | ) d d f x ln ( | ) d f x = ( | ) f x dx d ln ( | ) d f x = = 0 E d
2 2 ln ( | ) f x d ln ( | ) f x d d d ( ) = = ( | ) f x I dx E 2 ln ( | ) f x d d diffrentiating ( | ) f x dx 2 ln ( | ) f x d ln ( | ) f x d ( | ) df x d d d + ( | ) f x dx 2 2 ln ( | ) f x d ln ( | ) f x d ( | ) ( | ) f x d d df x = + ( | ) f x dx 2 d 2 2 ln ( | ) f x d ln ( | ) f x d d d = + dx = ( | ) f x 0 2
for an i.i.d. sample: , i.i.d. sample from p.d.f ( | ) X , X X f x 1 2, n + , X | ) n 2 ln ( , , d f X X ( ) = 1 2 I E n 2 d 2 d = | ) ln ( + | ) + | ) ln ( ln ( E f X f X f X 1 2 n 2 d = | ) | ) | ) 2 2 2 l n ( d ln ( d ( ) ln ( d d f X d f X + d f X = E E E n 1 2 2 2 2 = ( ) ( ) ( ) I I I nI 27
for k-dimensional vector parameter = ( , p.d.f. of r.v. is ( | ), where information matrix of , ( ), is given by , , ) X f x 1 2 k I ln ( | ) f x ln ( | ) f x = ( ) I E ij i j ( | ) f x 2 ln = E i j 28
Cramr-Rao Lower Bound A random sample X1, X2, , Xn from p.d.f f(x| ). Let be any estimator of with the bias of If B( ) is differentiable in and if certain regularity conditions holds, then ( 1 ( ) ( ) ( ) nI ) = + where B( ) is ), ( ( E B . ) 2 + B Var (Cram r-Rao inequality) The ratio of the lower bound to the variance of any estimator of is called the efficiency of the estimator, which is always <=1. An estimator has efficiency = 1 is called the efficient estimator. In this case, the equality sign holds. 2 ( ) 2 log B L + 1 Var E 2 29
Maximum likelihood If efficient estimators exist for a problem, the ML will find them. ML estimators: always efficient in the large sample limit. Ex: exponential (mean time to failure) t 1 = 2 log 2 1 n L n n = ( ; ) f t = = 1 2 e 1 t i B 2 2 2 n = 1 i MLE is unbiased. B=0. 0 2 1 ( ) = equal to exact result Therefore efficient estimator Var n 2 n 1 E 2 30
Large Sample Inferences To make Large sample inference on unknown parameter (Single Parameter), we need to estimate : nI( ) is estimated by: ( ) i = 1 = ( ) Var ( ) nI 2 | ) ln ( d d f X n = i nI 2 = 1 This estimate does not require evaluation of the expected value. An approximate large sample CI for 100(1- )% confidence on is: 1 1 + z z ( ) ( ) 2 2 nI nI
The regularity condition: ln ( | ) f x ln ( | ) f x = = ( | ) f x 0, E dx The necessary and sufficient condition for achieving the lower bound is that there exist In and g such that: ln ( | ) f x ( ) = ( ) ( ) g x ) , I n In this case, the minimum variance unbiased estimator is: = ( ) g x 32
In the Gaussian case /2 n 1 n | , = ) / 2 2 2 2 ( , L x ) exp ( x x 1 n i 2 2 = 1 i n 2 ( 2 ) x n i = 2 = ln( ) ln(2 ) 1 i L 2 2 n ( ) x ln( ) d 1 n d L n n i = = = 1 i x i 2 2 = 1 i n n 2 2 ( 2 ) ( ) x x n ln( ) ( d d L n n i i = + = 2 = = 1 1 i i 2 2 4 4 ) 2 2 Right form. Wrong form! 33
Graphical method single parameter 2 log 1 2! log L L = + + + 2 log ( ) L log ( ) L ( ) ( ) ... 2 = = 2 ( ) logLmax 2 log ( ) L log L max 2 2 1 1 2 log ( ) log L L max + [ , ] later 68.3% central confidence interval + +
= sig f nT + + = cos(2 ) ; 1,2,..., x A w n N n s n 2 where , , and are known but unknown, and s T i.i.d. (0, ). A f w N sig n 1 1 N ( ) 2 = sig f nT + ( , ) f x exp cos(2 ) x A ( ) A n s /2 N 2 2 2 2 = 1 n ln ( , ) f x N A = sig f nT + sig f nT + 2 ) sin(2 ) sin (4 x n s s 2 2 = 1 n To check the regularity condition, find the expectation of this. To do this, we simpliy replace by its expectation. ln ( , ) ( )sin(2 n E E x 1 n A A f nT = = x n N f x A A = sig f nT + sig f nT + 2 ) ) sin(4 s s 2 2 A = N = + sig f nT + sig f nT + 2 ) cos(2 )sin(2 ) sin(4 sig s s s 2 2 1 n 0 Condition satisfied. 35
2 N ln ( , ) f x A ( ) = sig f nT + sig f nT + 2 ) cos(2 ) cos(4 x A n s s 2 2 = 1 A n 2 N ln ( , ) f x ( ) = sig f nT + sig f nT + 2 ) 2 cos (2 ) cos(4 E A A s s 2 2 = 1 n 2 N 1 2 1cos(4 2 A = + sig f nT + 2 ) cos(4 sig f nT + 2 ) s s 2 = 1 n 2 NA 2 2 2 2 NA = var( ) 2 The sign will be = if and only if , and are coherent. N T f s sig 36
To find a minimum variance estimate that is unbiased, we need to see if we can get: ln ( | ) f x ( ) = ( ) ( ) g x ) , I n Under the assumption of coherency: ln ( , ) f x A N A = sig f nT + sig f nT + 2 ) sin(2 ) sin(4 x n s s 2 2 = 1 n N A ( ) = sig f nT + sin(2 ) x n s 2 = 1 n N A ( ) = sig f nT + sig f nT sin(2 )cos cos(2 )sin x x n s n s 2 = 1 n N sig f nT sin(2 ) x n s N A = + = 1 cos cos(2 ) tan n N x sig f n T n s 2 sig f nT = 1 cos(2 ) n x n s 37 = 1 n
N sig f nT sin(2 ) x n s N A sig f nT + = 1 cos cos(2 ) tan n N x n s 2 sig f nT = 1 cos(2 ) n x n s = 1 n Two issues here: The coefficient has xn in it. In the parenthesis, we have tan( ) not . For the first issue, can we take the expectation to get rid of xn? For the second, can we just set the inside of the parenthesis to zero and solve for ? This would correspond to treating tan( ) as the parameter. 38
HW Show that the following is true iff NfsigTs is integer. N N sig f nT + = sig f nT + ) 0 = sin(2 ) cos(2 s s = = 1 1 n n Now redo what we did on the last few slides for the three parameter sine fitting problem: = + sig f nT + + = cos(2 ) ; 1,2,..., x C A w n N n s n where C, A, and are all unknown. Other conditions are as before, including coherency. and Acos as parameters. Use C, Asin 39
For multidimensional = ( , , ) 1 m 2log ( ) V L ( ) i = ( ) = cov[ , ] I E j n ij ij i j Then the CRLB is given by: ( ) is positive semidefinite 1 V I n In particular: ( ) ( ) 1 0 V I n ii
The inverse of the information matrix forms a lower bound. Since: 2 log L ( ) ( ) = I E n ij i j 2 n n = log ( ; ) f x ( ; ) l f x dx k l = = 1 k 1 l i j 2 = ( ; ) f x log ( ; ) f x n dx i j The lower bound of estimate var. decreases with 1/n. Standard error decreases with 1/sqrt(n).
= sig f nT + + = cos(2 ) ; 0,1,2,..., 1 x A w n N n s n 1 T = , ] , with T [ , A f 0 and 0< < . A f sig sig 2 s 1 1 1 N ( ) 2 = sig f nT + ( , ) f x exp cos(2 ) x A ( ) n s /2 N 2 2 2 2 = 0 n 1 1 N N sig f nT + 2 ) sig f nT + 2 ) 0 Assume: cos(4 sin(4 s s = = 0 0 n n nd Take the various 2 derivatives of ln ( , ), to get: f x 2 N 2 A 0 0 var( ) 2 N 1 1 N N 1 2 6 N N N f ( ) = 2 2 2 2 0 2 I A n A n var( ) N 2 sig 2 2 2 ( 1) A = = 0 0 n n 2 1 N NA + 2 4 A N N (2 ( 1) 1) 2 0 A n var( ) 2 2 = 0 n 42
Simplification: large data sample Evaluating the second derivative with the measured data and the ML estimates 2 log L ( ) ( ) I N ij i j = Use these for estimating the covariance matrix when the likelihood function is maximized numerically Instead of running MC to generate covariance matrix.
ML example with two parameters If a voltage sample x is taken from a unit ramp signal with uniformly time distribution. x is supposed to be uniformly distributed on [-1, +1]. But the distribution is actually nonlinear, as parameterized below. 1 + + + 2 x x normalized -1 x +1 ; = f ( x , ) 2 2 3 realistic measurements only in xmin x xmax + + 2 x x 1 = f ( x ; , ) + + 2 2 3 3 ( x x ) ( x x ) ( x x ) max min max min max min 2 3 Example: = 0.5, = 0.5, xmin = 0.95, xmax = 0.95, generate n = 2000 events with Monte Carlo.
+ + 2 i 1 x den x ( ; , ) i f x = i ( ) + + 2 i 2000 1 x den x , , ) = ( , L x x ,..., i 2000 x 1 2 ( ) = 1 i 2000 ( ) ( ) = + + 2 i log ( , , ) L x log 1 2000log x x den i = 1 i 2000 log ( , , ) L x x = = 0 i + + 2 i 1 x x = 1 i i 2 i 2000 log ( , , ) L x ve for , x = = 0 + + 2 i 1 x x = 1 i i sol Cannot be solved easily by hand. Have to use numerical tools. 45
2 i 2 2000 log ( , , ) L x x + = ( ) 2 2 + 2 i 1 x x = 1 i i 4 i 2 2000 log ( , , ) L x x + = ( ) 2 2 + 2 i 1 x x = 1 i i 3 i 2 2000 log ( , , ) L x x + = ( ) 2 + 2 i 1 x x = 1 i i Instead of performing expectation, can use estimated values of and , together with observed values of xi, to estimate I11, I22, and I12. From I-1, we get estimate of var( ), var( ), and covariance between and . 46
ML example with two parameters True values: = = 0. 0. 5 5 2000 events Estimated values: = = 0.508 0.477 var( ) = 0.052 var( ) = 0.108 cov( ) =0.0026 Note that and are correlated with = 0.463
MC simulation to verify MLE 500 MC exper. 2000 evts/exp. = 0. 499 = s 0. 051 = 0. 498 Both marginal pdf s are aprox. gaussian = s 0. 111 cov =0.0024 = 0.43
The lnL contour The log likelihood log ( , ) can be approximate with Taylor series expansion: log ( , ) log ( , ) log L x L + L x ( ) L x max = ( ) ( ) 2 1 2 log ( , ) L x T + T = ( ) ( ) 1 2 T ( ) 1 log ( , ) L x log L Var max For large For > For < L L , this approximation is quite accurate. , log ( , ) log has no solution. , log ( , ) log is an ellipse. L x = n = L L L x L L max max
(Co)variances from ln L contour The , plane for the first MC data set = log ( , , ) L x log 0.5 L max Tangent lines to contours give standard deviations. 2 = Angle of ellipse related to correlation: tan2 2 2 Correlations between estimators result in an increase in their standard deviations (more statistical errors).