Parameter Estimation in Probabilistic Models: An Example in Probabilistic Machine Learning

parameter estimation in probabilistic models n.w
1 / 7
Embed
Share

Learn about parameter estimation in probabilistic models using Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation techniques. Understand how to estimate the bias of a coin by analyzing sequences of coin toss outcomes and incorporating prior distributions for more reliable estimates.

  • Probabilistic Models
  • Machine Learning
  • Parameter Estimation
  • Maximum Likelihood Estimation
  • Maximum A Posteriori Estimation

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Parameter Estimation in Probabilistic Models: An Example CS772A: Probabilistic Machine Learning Piyush Rai

  2. 2 Estimating a Coin s Bias: MLE Consider a sequence of ? coin toss outcomes (observations) Probability of a head Each observation ?? is a binary random variable. Head: ??= 1, Tail: ??= 0 Each ?? is assumed generated by a Bernoulli distribution Likelihood or observation model Bernoulli distribution with param ? (0,1) ? ??? = Bernoulli ?n? = ???(1 ?)1 ?? Here ? the unknown param (probability of head). Want to estimate it using MLE assuming i.i.d. data ? Log-likelihood: ?=1 log ? ??? = ?=1 ? [??log + (1 ??)log (1 ?)] Maximizing log-lik, or minimizing neg. log-lik (NLL) w.r.t. ? gives Indeed, with a small number of training observations, MLE may overfit and may not be reliable. An alternative is MAP estimation which can incorporate a prior distribution over ? Thus MLE solution is simply the fraction of heads! Makes intuitive sense! ? I tossed a coin 5 times gave 1 head and 4 tails. Does it means ? = 0.2?? The MLE approach says so. What is I see 0 head and 5 tails. Does it mean ? = 0? ?=1 ?? ????= ? CS772A: PML

  3. 3 Estimating a Coin s Bias: MAP Let s again consider the coin-toss problem (estimating the bias of the coin) Each likelihood term is Bernoulli ? ??? = Bernoulli ?n? = ???(1 ?)1 ?? Also need a prior since we want to do MAP estimation Since ? (0,1), a reasonable choice of prior for ? would be Beta distribution (? + ?) ? ? ?? 11 ?? 1 ? ? ?,? = ? and ? (both non-negative reals) are the two hyperparameters of this Beta prior The gamma function Using? = 1 and ? = 1 will make the Beta prior a uniform prior Can set these based on intuition, cross-validation, or even learn them CS772A: PML

  4. 4 Estimating a Coin s Bias: MAP The log posterior for the coin-toss model is log-lik + log-prior ? ?? ? = log ? ??? + log ? ? ?,? ?=1 Plugging in the expressions for Bernoulli and Beta and ignoring any terms that don t depend on ?, the log posterior simplifies to ? ??log + (1 ??log 1 ? ] + ? 1 log ? + ? 1 log(1 ?) ?? ? = ?=1 Maximizing the above log post. (or min. of its negative) w.r.t. ? gives Prior s hyperparameters have an interesting interpretation. Can think of ? 1 and ? 1 as the number of heads and tails, respectively, before starting the coin-toss experiment (akin to pseudo-observations ) Using? = 1 and ? = 1 gives us the same solution as MLE ? ?=1 ? + ? + ? 2 ??+ ? 1 ????= Recall that? = 1 and ? = 1 for Beta distribution is in fact equivalent to a uniform prior (hence making MAP equivalent to MLE) Such interpretations of prior s hyperparameters as being pseudo-observations exist for various other prior distributions as well (in particular, distributions belonging to exponential family of distributions CS772A: PML

  5. 5 Estimating a Coin s Bias: Fully Bayesian Inference In fully Bayesian inference, we compute the posterior distribution Bernoulli likelihood: ? ??? = Bernoulli ?n? = ???(1 ?)1 ?? Beta prior: ? ? = Beta ? ?,? = (?+?) ? ? ?? 11 ?? 1 Number of tails (?0) Number of heads (?1) The posterior can be computed as ? ? ? ?=1 ??(1 ?)? ?=1 ?? ? ? ? ?(?|?) ?(?) (?+?) ? ? ?? 11 ?? 1 ?=1 (?+?) ? ? ?? 11 ?? 1 ?=1 ? ???(1 ?)1 ?? =? ? ?=1 ?(??|?) ?(?) ? ? ? = = ???(1 ?)1 ???? ? Here, even without computing the denominator (marg lik), we can identify the posterior It is Beta distribution since ? ? ? ??+?1 11 ??+?0 1 Exercise: Show that the normalization constant equals Hint: Use the fact that the posterior must integrate to 1 ? ? ? ?? = 1 Thus ? ? ? = Beta ? ? + ?1,? + ?0 Here, finding the posterior boiled down to simply multiply, add stuff, and identify Here, posterior has the same form as prior (both Beta): property of conjugate priors conjugate priors. CS772A: PML

  6. 6 Conjugacy and Conjugate Priors Many pairs of distributions are conjugate to each other Bernoulli (likelihood) + Beta (prior) Beta posterior Binomial (likelihood) + Beta (prior) Beta posterior Multinomial (likelihood) + Dirichlet (prior) Dirichlet posterior Poisson (likelihood) + Gamma (prior) Gamma posterior Gaussian (likelihood) + Gaussian (prior) Gaussian posterior and many other such pairs .. Tip: If two distr are conjugate to each other, their functional forms are similar Example: Bernoulli and Beta have the forms Not true in general, but in some cases (e.g., the variance of the Gaussian likelihood is fixed) This is why, when we multiply them while computing the posterior, the exponents get added and we get the same form for the posterior as the prior but with just updated hyperparameter. Also, we can identify the posterior and its hyperparameters simply by inspection Bernoulli ? ? = ??(1 ?)1 ? (? + ?) ? ? ?? 11 ?? 1 Beta ? ?,? = More on conjugate priors when we look at exponential family distributions CS772A: PML

  7. 7 Making Predictions Suppose we want to compute the prob that the next outcome ??+1will be head (=1) The plug-in predictive distribution using a point estimate ? (e.g., using MLE/MAP) The posterior predictive distribution (averaging over all ? s weighted by their respective posterior probabilities) Expectation of ? w.r.t. the Beta posterior distribution Therefore the PPD is CS772A: PML

Related


More Related Content