Introduction to NLP Generative vs Discriminative Models

slide1 n.w
1 / 23
Embed
Share

Explore the differences between Generative and Discriminative Models in Natural Language Processing (NLP). Learn about the assumptions of Discriminative Classifiers, the effectiveness of Discriminative vs Generative Classifiers, and dive into the Naïve Bayes Generative Classifier. Understand concepts like Bag of Words representation and how these models impact text classification accuracy.

  • NLP
  • Generative Models
  • Discriminative Models
  • Naïve Bayes
  • Bag of Words

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. NLP

  2. Introduction to NLP Generative vs. Discriminative Models

  3. Generative vs. Discriminative Models Discriminative Generative Learn a model of the joint probability p(d, c) Model posterior probability p(c|d) directly Use Bayes Rule to calculate p(c|d) Class is a function of document vector Build a model of each class; given example, return the model most likely to have generated that example Find the exact function that minimizes classification errors on the training data Examples: Na ve Bayes, Gaussian Discriminant Analysis, HMM, Probabilistic CFG Examples: Logistic regression, Neural Networks (NNs), Support Vector Machines (SVMs), Decision Trees

  4. Assumptions of Discriminative Classifiers Data examples (documents) are represented as vectors of features (words, phrases, ngrams, etc) Looking for a function that maps each vector into a class. This function can be found by minimizing the errors on the training data (plus other various criteria) Different classifiers vary on what the function looks like, and how they find the function

  5. Discriminative vs. Generative Classifiers Discriminative classifiers are generally more effective, since they directly optimize the classification accuracy. But They are all sensitive to the choice of features, and so far these features are extracted heuristically Also, overfitting can happen if data is sparse Generative classifiers are the opposite They directly model text, an unnecessarily harder problem than classification They can easily exploit unlabeled data

  6. Introduction to NLP Generative Classifier: Na ve Bayes

  7. Nave Bayes Intuition Simple ( na ve ) classification method based on Bayes rule Relies on very simple representation of document Bag of words

  8. Bag of Words

  9. Remember Bayes Rule? Bayes Rule:P(c|d)=P(d |c)P(c) P(d) d is the document (represented as a list of features of a document, x1, , xn) c is a class (e.g., not spam )

  10. Nave Bayes Classifier (I) cMAP=argmax P(c|d) MAP is maximum a posteriori = most likely class c C P(d |c)P(c) P(d) P(d |c)P(c) =argmax c C =argmax c C Bayes Rule Dropping the denominator

  11. Nave Bayes Classifier (II) cMAP=argmax P(d |c)P(c) c C Document d represented as features x1..xn = argmax c ( , , , | ) ( ) P x x x c P c 1 2 n C But where will we get these probabilitites?

  12. Nave Bayesian classifiers Na ve Bayesian classifier ( , ,... P | ) ( ) P F F F d C P d C = ( | , ,... ) 1 2 k P d C F F F 1 2 k ( , ) F F F ,... 1 2 k Assuming statistical independence k ( | ) ( ) P F d C P d C j = 1 j = ( | , ,... ) P d C F F F 1 2 k k ( ) P F j = 1 j Features = words (or phrases) typically

  13. Multinomial Nave Bayes Independence Assumptions , , ( 2 1 x x P , | ) x c n Bag of Words assumption Assume position doesn t matter Conditional Independence Assume the feature probabilities P(xi|c) are independent given the class c. ( ) | ( ) | , , ( 1 1 P c x P c x x P n = | ) ( | ) ( | ) x c P x c P x c 2 3 n [Jurafsky and Martin]

  14. Multinomial Nave Bayes Classifier , ( argmax 2 1 x x P C c = , , | ) ( ) c x c P c MAP n cNB=argmax P(cj) P(x|c) c C x X This is why it s na ve! [Jurafsky and Martin]

  15. Learning the Multinomial Na ve Bayes Model First attempt: maximum likelihood estimates simply use the frequencies in the data P(cj)=doccount(C =cj) Ndoc count(wi,cj) count(w,cj) w V P(wi|cj)= [Jurafsky and Martin]

  16. Parameter Estimation count(wi,cj) count(w,cj) w V fraction of times word wi appears among all words in documents of topic cj P(wi|cj)= Create mega-document for topic j by concatenating all docs in this topic Use frequency of w in mega-document [Jurafsky and Martin]

  17. Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)? " ( fantastic" count positive , ) count " ( P = 0 = fantastic" positive ) V w ( , positive ) w Zero probabilities cannot be conditioned away, no matter the other evidence! = c MAP c P c ) argmax ( P ( | ) x c i i [Jurafsky and Martin]

  18. Laplace Smoothing count(wi,c) count(w,c) ( w V P(wi|c)= ) count(wi,c)+1 count(w,c)+1 ( w V count(wi,c)+1 P(wi|c)= ) = +V count(w,c ) w V [Jurafsky and Martin]

  19. Multinomial Nave Bayes: Learning From training corpus, extract Vocabulary Calculate P(wk | cj) terms Textj single doc containing all docsj Foreach word wk in Vocabulary nk # of occurrences of wk in Textj Calculate P(cj) terms For each cj in C do docsj all docs with class =cj |docsj| P(cj) nk+a P(wk|cj) |total # documents| n+a |Vocabulary| [Jurafsky and Martin]

  20. Example ? ? = 1/2 1/2 Features = {I hate love this book} Training I hate this book Love this book What is P(Y|X)? Prior p(Y) Testing hate book Different conditions a = 0 (no smoothing) a = 1 (smoothing) ? =1 1 0 0 1 1 1 1 1 0 1/4 0 1/4 0 0 1/4 1/3 1/4 1/3 ? ? ? = 1/3 ? ?|? 1/2 1/4 1/4 1/2 0 1/3 = 1 0 ? =2 2 1 1 2 2 2 2 2 1 2/9 1/8 2/9 1/8 1/9 2/8 2/9 2/8 2/9 2/8 ? ? ? = ? ?|? 1/2 2/9 2/9 1/2 1/8 2/8 = 0.613 0.387

  21. Example ? ? = 1/2 1/2 2/9 1/8 2/9 1/8 1/9 2/8 2/9 2/8 2/9 2/8 ? ? ? = ? ?|? 1/2 2/9 2/9 1/2 1/8 2/8 = 0.613 0.387

  22. Ways Naive Bayes Is Not So Naive Very fast, low storage requirements Robust to irrelevant features Irrelevant features cancel each other without affecting results Very good in domains with many equally important features Decision trees suffer from fragmentation in such cases especially if little data Optimal if the independence assumptions hold If assumed independence is correct, then it is the Bayes Optimal Classifier for problem A good, dependable baseline for text classification But other classifiers give better accuracy [Jurafsky and Martin]

  23. NLP

More Related Content