
Maximum Likelihood and Maximum Entropy Duality
Explore the duality between Maximum Likelihood and Maximum Entropy in data modeling and prediction tasks, illustrated with examples and principles like Laplace's Principle of Unbiased reasoning.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Maximum Likelihood-Maximum Entropy Duality: Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14thJanuary, 2015
Phenomenon/Event Data/Observation Model Phenomenon/Event could be a linguistic process such as POS tagging or sentiment prediction. Model uses data in order to predict future observations w.r.t. a phenomenon
Notations X : x1, x2, x3.... xm (m observations) A: Random variable with n possible outcomes such as a1, a2, a3... an e.g. One coin throw : a1= 0, a2=1 One dice throw: a1=1, a2=2, a3=3, a4=4, a5=5, a6=6
Goal Goal: Estimate P(ai) = Pi Two paths ME ML Are they equivalent?
Calculating probability from data Suppose in X: x1, x2, x3.... xm (m observations), ai occurs f(ai) = fi times e.g. Dice: If outcomes are 1 1 2 3 1 5 3 4 2 1 F(1) = 4, f(2) = 2, f(3) = 2, f(4) = 1, f(5) = 1, f(6)=0 and m = 10 Hence, P1 = 4/10, P2=2/10, P3=2/10, P4=1/10, P5=1/10, P6=1/10
In general, the task is... Task: Get : the probability vector <P( i)> from X
MLE MLE: * = argmax Pr (X; ) With i.i.d. (identical independence) assumption, m * = argmax Pr (X; ) i=1 Where, : <P1, P2, ... Pn> P(an) P(a1) P(a2)
What is known about: : <P1, P2, ... Pn> Pi = 1 i=1 n Pi >= 0 for all i Introducing Entropy: n H( )= - Pi ln Pi i=1 Entropy of distribution <P1, P2, ... Pn>
Some intuition Example with dice Outcomes = 1,2,3,4,5,6 P(1) + P(2)+P(3)+...P(6) = 1 6 Entropy(Dice) = H( )= - P(i) ln P(i) i=1 Now, there is a principle called Laplace s Principle of Unbiased(?) reasoning
The best estimate for the dice P(1) = P(2) = ... P(6) = 1/6 We will now prove it assuming: NO KNOWLEDGE about the dice except that it has six outcomes, each with probability >= 0 and Pi = 1
What does best mean? BEST means most consistent with the situation. Best means that these Pi values should be such that they maximize the entropy.
Optimization formulation 6 Max. - P(i) log P(i) i=1 6 Subject to: Pi = 1 i=1 Pi >= 0 for i = 1 to 6
Solving the optimization (1/2) Using Lagrangian multipliers, the optimization can be written as: 6 6 6 Q = - P(i) log P(i) ( P(i) 1) - i P(i) i=1 i=1 i=1 For now, let us ignore the last term. We will come to it later.
Solving the optimization (2/2) Differentiating Q w.r.t. P(i), we get Q/ P(i) = - log (P(i) 1 Equating to zero, log P(i) + 1 + = 0 log P(i) = -(1+ ) P(i) = e -(1+ ) This means that to maximize entropy, every P(i) must be equal.
This shows that P(1) = P(2) = ... P(6) But, P(1) + P(2) + .. + P(6) = 1 Therefore P(1) = P(2) = ... P(6) = 1/6
Introducing data in the notion of entropy Now, we introduce data: X : x1, x2, x3.... xm (m observations) A: a1, a2, a3... an (n outcomes) e.g. For a coin, In absence of data: P(H) = P(T) = 1/2 ... (As shown in the previous proof) However, if data X is observed as follows: Obs-1: H H T H T T H H H (m=10) (n=2) P(H) = 6/10, P(T) = 4/10 Obs-2: T T H T H T H T T T (m=10) (n=2) P(H) = 3/10, P(T) = 7/10 Which of these is a valid estimate?
Change in entropy Emax (uniform distribution) E1 : P(H) = 6/10 E2 : P(H) = 3/10 Entropy reduces as data is observed!
Start of Duality Emax (Maximum entropy: uniform distribution) Entropy reduction Edata (Entropy when Observations are made) Maximizing entropy in this situation is same as minimizing the `entropy reduction distance. i.e. Minimizing relative entropy
Concluding remarks Thus, in this discussion of ML-ME duality, we will show that: MLE minimizes relative entropy distance from uniform distribution. Question: The entropy corresponds to probability vectors. The distance can be measured by squared distances. WHY relative entropy?
Maximum Likelihood-Maximum Entropy Duality: Session 2 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 28th January, 2015
Recap (1/2) Whatever we can show with MLE, (in most cases), we can through ME as well Laplace s unbiased reasoning principle Make only the most limited assumptions. We assume data: X: x1, x2.... Xm m: number of observations Every xi is the outcome of the values of a random variable
Recap (2/2) E.g. xi = {1, 2, ...6} for dice. If no data then by Laplace s principle, uniform distribution is the best estimate of the probability of each outcome.
NLP Perspective Probability of mapping of words between parallel sentences. E.g. The mapping between blue and neela , and blue and haraa (wrong translation) So how can you maintain probabilities of correct mappings based on the corpus?
Starting off This week, we assume that we have data X: x1.... xm Outcome of r.v. Is : a1, a2, a3... an in the data such that each aj occurs fi times. In absence of any data, Pj = 1/n When we have data?
In presence of data When we have the data, Pj = ? Answer: Pj = f j / m WHY? Can we arrive at this value through (a) MLE, or (b) ME ??
Taking the MLE route MLE: We maximize data likelihood P(X; ) where = <P1, P2, ... Pn> Under i.i.d., e.g. 1 1 2 3 4 5 4 P1 P1 P2 P3 P4 P5 P4 m P(X; ) = P(xi; ) i=1 ... P12 P21 P31 P42 P51 m = Pjfj i=1 n fj = m j=1 Where,
Maximization MLE demands maximize P(X; ) Subject to : Maximize ln (P(X; )) n n Pj = 1 i=1 Pj = 1 j=1 Pj >= 0 for all i Pj>= 0 for all i
Evaluating the parameter Pj (1/2) n n Q = fj ln Pj ( Pj 1) j=1 j=1 Taking derivative w.r.t. Q/ Pj = Pj 1 Taking derivative w.r.t. Pj Q/ Pj = fj / Pj - Equating to zero, Pj = 1 Equating to zero, fj / Pj = 0 Pj = fj / (2) (1)
Evaluating the parameter Pj (2/2) From (1) and (2), n fj / = 1 j=1 n fj = j=1 = m Proved! Pj = fj / m
Summary Entropy (as seen last time) 1) No observation: Pj = 1/n 2) Observation: Pj = fj/m ; fj = m MLE (as shown Today) ME??
Does entropy change Before we move on to discussion on ME in case of observed data, we first see how entropy gets affected in presence of data. In context of cases given in the previous slide, we wish to see: Is it true that Entropy (Case 2) <= Entropy (Case 1)?? Let s verify.
Situation 1: No observed data n Entropy(Case1): - P(j) log P(j) j=1 n = - (1/n) log (1/n) j=1 = log (1/n) = log (n)
Situation 2: Observed Data (1/2) n Entropy(Case2): - P(j) log P(j) j=1 n = - (fj/m) log (fj/m) j=1 n = -1/m fj log (fj/m) j=1 n = -1/m fj [ log (fj) log(m) ] j=1
Situation 2: Observed Data (2/2) n Entropy(Case2): - P(j) log P(j) n j=1 n = -1/m[ fj log (fj) fj log(m) ] j=1 n j=1 = -1/m[ fj log (fj) m log(m) ] j=1 n = [ m log(m) - fj log (fj) ] / m j=1
Comparing the entropies Now, we must show: n m log(m) m log(n) <= fj log (fj) j=1 n i.e. m log(m/n) <= fj log (fj) j=1 .... We will show this in next class.
Change in entropy No data, 1/n, E1 Data, fj/m, E2 Entropy reduces as data is observed!
Concluding remark If you can show that MLE is same as maximum entropy, then MLE-ME duality is shown.
Maximum Likelihood-Maximum Entropy Duality: Session 3 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 11th February, 2015
Recap X1, X2, X3.... XN : Data A1, A2, A3... AM : Outcomes Goal: Estimate P(Ai) = Pi In absence of any other information (not even data), Pi = 1/M When data is observed, Pi = fi/N where fi = freq(Ai) This was obtained using ME Max. m This was obtained using ME Max. - Pi log Pi i=1 m m - fi log Pi i=1 m m s.t. Pj = 1 i=1 s.t. Pi = 1 & fi = N i=1 i=1
Change in entropy No data, 1/m, Eu Data, fj/n, Ed Entropy changes as data is observed. Today, we show that the entropy is reduced . i.e. Eu > Ed
Todays goal Today, we will show that the entropy is reduced . i.e. Eu > Ed Two proofs
Proof 1 (1/..) Suppose without loss of generality, P1 -> P + P2 -> P - Thus, in case of uniform distribution, P1 = P2 =... P = 1/M, and = 0 m - Pi log Pi i=1 Eu = m Ed = -(P+ ) log(P+ ) -(P- ) log(P- ) - Pi log Pi i=3
Proof 1 (2/...) Eu Ed = -P log P + (P+ ) log(P+ ) -P log P + (P- ) log(P- ) = P log ((P+ )/P) + (log P + log (1+ /P)) + P log ((P- )/P) - (log P + log (1- /P)) = P [log (1+ /P)+ log (1- /P)] + [log (1+ /P) - log (1- /P)]
Proof 1 (3/...) Eu Ed = P [log (1+ /P)+ log (1- /P)] + [log (1+ /P) - log (1- /P)] = P [log (1 2/P2)] + [log ((1+ /P)/(1- /P))] = P [log (1 y2)] + [log ((1+y)/(1-y))] Where, y= /P
Proof 1 (4/...) Eu-Ed = P [log (1 y2)] + [log ((1+y)/(1-y))] log (1+x) = x x2/2 + x3/3 x4/4+ .. log ((1+x)/(1-x)) = 2x +2x3/3+2x5/5+2x7/7+.. = P [ -y2 y4/2 y6/3 y8/4...] + [2y + = -Py2 [1+ y2/2+ y4/3+ y6/4...] + Py2 [2 + 2y2/3+ 2y4/5+ 2y6/7...] 2y3/3+2y5/5+2y7/7...] ...substitute = Py When we compare the above statement term by term, (1..2), (1/2.. 2/3), (1/3...2/5).. Etc., we see that the above value is > 0. i.e. Eu Ed > 0. i.e. Eu > Ed
Proof 2 (1/...) Identity: for x,y > 0 y ylog y <= x ylog x Proof: log t <= log t -1 Put t = (x/y) . log (x/y) <= (x/y) 1 i.e. log x - log y <= (x-y)/y i.e. y log x y log y <= x y i.e. y y log y <= x y log x
Proof 2 (2/...) p1, p2.... pm = 1/M .............. Uniform q1, q2... qm ............... Perturbed distribution We put x = p, y= q in y ylog y <= x ylog x qi qi log qi <= pi - qi log pi take sum over i = 1....m, 1 + Ed <= 1 + Eu qi qi log qi <= pi - qi log pi But, qi = 1 & pi = 1 1 qi log qi <= 1 + log M qi 1 + Ed <= 1 + log M Eu >= Ed
Conclusion We showed today that there is a reduction in entropy when data is observed. We showed two proofs: the first is more intuitive but long; the second is simpler.
Maximum Likelihood-Maximum Entropy Duality: Session 4 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 25th March, 2015
A uniform distribution and any pd P Uniform, 1/n P Let us now compute relative entropy / KLD between these two vectors.