Enhancing Natural Language Processing Models with Factor Graphs & Belief Propagation

section 4 n.w
1 / 74
Embed
Share

Learn how to advance NLP models beyond traditional approaches by incorporating structure through factor graphs and utilizing belief propagation algorithms. Explore how to improve accuracy and efficiency in linguistic analysis with practical insights and strategies.

  • NLP
  • Factor Graphs
  • Belief Propagation
  • Algorithms
  • Linguistic Structures

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Section 4: Incorporating Structure into Factors and Variables 1

  2. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1. Models: Factor graphs can express interactions among linguistic structures. 2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3. Intuitions: What s going on here? Can we trust BP s estimates? 4. Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7. Software: Build the model you want! 2

  3. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1. Models: Factor graphs can express interactions among linguistic structures. 2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3. Intuitions: What s going on here? Can we trust BP s estimates? 4. Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7. Software: Build the model you want! 3

  4. BP for Coordination of Algorithms parser Each factor is tractable by dynamic programming Overall model is no longer tractable, but BP lets us pretend it is F tagger F F T T T 2 4 tagger house white the parser blanca T T T T aligner F 4 casa T T T T F F 2 T T T la T 4

  5. BP for Coordination of Algorithms parser Each factor is tractable by dynamic programming Overall model is no longer tractable, but BP lets us pretend it is F tagger F F T T T 2 4 tagger house white the parser blanca T T T T aligner F 4 casa T T T T F F 2 T T T la T 5

  6. Sending Messages: Computational Complexity From Variables To Variables 2 X2 3 1 X1 1 X1 X3 O(d*kd) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for Xi 6

  7. Sending Messages: Computational Complexity From Variables To Variables 2 X2 3 1 X1 1 X1 X3 O(d*kd) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for Xi 7

  8. INCORPORATING STRUCTURE INTO FACTORS 8

  9. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow 9 (Naradowsky, Vieira, & Smith, 2012)

  10. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow 10 (Naradowsky, Vieira, & Smith, 2012)

  11. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 T T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow 11 (Naradowsky, Vieira, & Smith, 2012)

  12. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow 12 (Naradowsky, Vieira, & Smith, 2012)

  13. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow 13 (Naradowsky, Vieira, & Smith, 2012)

  14. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F F 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F T F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow 14 (Naradowsky, Vieira, & Smith, 2012)

  15. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow Sending a messsage to a variable from its unary factors takes only O(d*kd) time where k=2 and d=1. 15 (Naradowsky, Vieira, & Smith, 2012)

  16. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F T F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like flies time an arrow But nothing prevents non-tree structures. Sending a messsage to a variable from its unary factors takes only O(d*kd) time where k=2 and d=1. 16 (Naradowsky, Vieira, & Smith, 2012)

  17. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F T F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like like flies flies time time an an arrow arrow But nothing prevents non-tree structures. Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise. 17 (Naradowsky, Vieira, & Smith, 2012)

  18. Unlabeled Constituency Parsing Given: a sentence. Predict: unlabeled parse. T 13 F T 10 12 F F T 10 10 11 We could predict whether each span is present T or not F. F F F T 10 10 10 10 T T T T T 6 2 4 8 5 9 1 3 7 like like flies flies time time an an arrow arrow But nothing prevents non-tree structures. Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise. 18 (Naradowsky, Vieira, & Smith, 2012)

  19. Unlabeled Constituency Parsing How long does it take to send a message to a variable from the the CKYTreefactor? T 13 F T 10 12 F F T For the given sentence, O(d*kd) time where k=2 and d=15. 10 10 11 F F F T 10 10 10 10 For a length n sentence, this will be O(2n*n). T T T T T 6 2 4 8 5 9 1 3 7 like like flies flies time time an an arrow arrow But we know an algorithm (inside-outside) to compute all the marginals in O(n3) So can t we do better? Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise. 19 (Naradowsky, Vieira, & Smith, 2012)

  20. Example: The Exactly1 Factor Variables: d binary variables X1, , Xd Global Factor: Exactly1(X1, , Xd) = 1 if exactly one of the d binary variables Xi is on, 0 otherwise E1 X1 X4 X2 X3 1 4 2 3 20 (Smith & Eisner, 2008)

  21. Example: The Exactly1 Factor Variables: d binary variables X1, , Xd Global Factor: Exactly1(X1, , Xd) = 1 if exactly one of the d binary variables Xi is on, 0 otherwise E1 X1 X4 X2 X3 1 4 2 3 21 (Smith & Eisner, 2008)

  22. Example: The Exactly1 Factor Variables: d binary variables X1, , Xd Global Factor: Exactly1(X1, , Xd) = 1 if exactly one of the d binary variables Xi is on, 0 otherwise E1 X1 X4 X2 X3 1 4 2 3 22 (Smith & Eisner, 2008)

  23. Example: The Exactly1 Factor Variables: d binary variables X1, , Xd Global Factor: Exactly1(X1, , Xd) = 1 if exactly one of the d binary variables Xi is on, 0 otherwise E1 X1 X4 X2 X3 1 4 2 3 23 (Smith & Eisner, 2008)

  24. Example: The Exactly1 Factor Variables: d binary variables X1, , Xd Global Factor: Exactly1(X1, , Xd) = 1 if exactly one of the d binary variables Xi is on, 0 otherwise E1 X1 X4 X2 X3 1 4 2 3 24 (Smith & Eisner, 2008)

  25. Example: The Exactly1 Factor Variables: d binary variables X1, , Xd Global Factor: Exactly1(X1, , Xd) = 1 if exactly one of the d binary variables Xi is on, 0 otherwise E1 X1 X4 X2 X3 1 4 2 3 25 (Smith & Eisner, 2008)

  26. Messages: The Exactly1 Factor From Variables To Variables E1 E1 X1 X4 X1 X4 X2 X3 X2 X3 1 4 1 4 2 2 3 3 O(d*2) d = # of neighboring factors O(d*2d) d = # of neighboring variables 26

  27. Messages: The Exactly1 Factor From Variables To Variables E1 E1 X1 X4 X1 X4 X2 X3 X2 X3 1 4 1 4 2 2 3 3 Fast! O(d*2) d = # of neighboring factors O(d*2d) d = # of neighboring variables 27

  28. Messages: The Exactly1 Factor To Variables E1 X1 X4 X2 X3 1 4 2 3 O(d*2d) d = # of neighboring variables 28

  29. Messages: The Exactly1 Factor To Variables E1 But the outgoing messages from the Exactly1 factor are defined as a sum over the 2d possible assignments to X1, , Xd. X1 X4 X2 X3 1 4 2 3 Conveniently, E1(xa) is 0 for all but d values so the sum is sparse! O(d*2d) d = # of neighboring variables So we can compute all the outgoing messages from E1 in O(d) time! 29

  30. Messages: The Exactly1 Factor To Variables E1 But the outgoing messages from the Exactly1 factor are defined as a sum over the 2d possible assignments to X1, , Xd. X1 X4 X2 X3 1 4 2 3 Fast! Conveniently, E1(xa) is 0 for all but d values so the sum is sparse! O(d*2d) d = # of neighboring variables So we can compute all the outgoing messages from E1 in O(d) time! 30

  31. Messages: The Exactly1 Factor E1 A factor has a belief about each of its variables. X1 X4 X2 X3 1 4 2 3 An outgoing message from a factor is the factor's belief with the incoming message divided out. We can compute the Exactly1 factor s beliefs about each of its variables efficiently. (Each of the parenthesized terms needs to be computed only once for all the variables.) 31 (Smith & Eisner, 2008)

  32. Example: The CKYTree Factor Variables: O(n2) binary variables Sij Global Factor: CKYTree(S01, S12, , S04) = 1 if the span variables form a constituency tree, 0 otherwise S04 S03 S14 S13 S24 S02 S01 S34 S12 S23 0 2 1 3 4 made the barista coffee 32 (Naradowsky, Vieira, & Smith, 2012)

  33. Messages: The CKYTree Factor From Variables To Variables S04 S04 S03 S14 S03 S14 S13 S24 S02 S13 S24 S02 S01 S34 S12 S23 S01 S34 S12 S23 0 2 1 3 4 0 2 1 3 4 made the barista coffee made the barista coffee O(d*2) d = # of neighboring factors O(d*2d) d = # of neighboring variables 33

  34. Messages: The CKYTree Factor From Variables To Variables S04 S04 S03 S14 S03 S14 S13 S24 S02 S13 S24 S02 S01 S34 S12 S23 S01 S34 S12 S23 0 2 1 3 4 0 2 1 3 4 Fast! made the barista coffee made the barista coffee O(d*2) d = # of neighboring factors O(d*2d) d = # of neighboring variables 34

  35. Messages: The CKYTree Factor To Variables S04 S03 S14 S13 S24 S02 S01 S34 S12 S23 0 2 1 3 4 made the barista coffee O(d*2d) d = # of neighboring variables 35

  36. Messages: The CKYTree Factor To Variables But the outgoing messages from the CKYTree factor are defined as a sum over the O(2n*n) possible assignments to {Sij}. S04 S03 S14 S13 S24 S02 S01 S34 S12 S23 0 2 1 3 4 made the barista coffee CKYTree(xa) is 1 for exponentially many values in the sum but they all correspond to trees! O(d*2d) d = # of neighboring variables With inside-outside we can compute all the outgoing messages from CKYTree in O(n3) time! 36

  37. Messages: The CKYTree Factor To Variables But the outgoing messages from the CKYTree factor are defined as a sum over the O(2n*n) possible assignments to {Sij}. S04 S03 S14 S13 S24 S02 S01 S34 S12 S23 Fast! 0 2 1 3 4 made the barista coffee CKYTree(xa) is 1 for exponentially many values in the sum but they all correspond to trees! O(d*2d) d = # of neighboring variables With inside-outside we can compute all the outgoing messages from CKYTree in O(n3) time! 37

  38. Example: The CKYTree Factor For a length n sentence, define an anchored weighted context free grammar (WCFG). Each span is weighted by the ratio of the incoming messages from the corresponding span variable. S04 S03 S14 S13 S24 S02 S01 S34 S12 S23 0 2 1 3 4 made barista the coffee Run the inside-outside algorithm on the sentence a1, a1, , anwith the anchored WCFG. 38 (Naradowsky, Vieira, & Smith, 2012)

  39. Example: The TrigramHMM Factor Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: It s traditionally defined as a Bayes Network But we can represent it as a (loopy) factor graph We could even pack all those factors into a single TrigramHMM factor (Smith & Eisner, 2008) X3 X5 X1 X2 X4 W3 W5 W1 W2 W4 like flies time an arrow 39 (Smith & Eisner, 2008)

  40. Example: The TrigramHMM Factor Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: It s traditionally defined as a Bayes Network But we can represent it as a (loopy) factor graph We could even pack all those factors into a single TrigramHMM factor (Smith & Eisner, 2008) 10 11 12 6 2 4 8 X3 X5 X1 X2 X4 5 9 1 3 7 like flies time an arrow 40 (Smith & Eisner, 2008)

  41. Example: The TrigramHMM Factor Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: It s traditionally defined as a Bayes Network But we can represent it as a (loopy) factor graph We could even pack all those factors into a single TrigramHMM factor 10 11 12 6 2 4 8 X3 X5 X1 X2 X4 5 9 1 3 7 like flies time an arrow 41 (Smith & Eisner, 2008)

  42. Example: The TrigramHMM Factor Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: It s traditionally defined as a Bayes Network But we can represent it as a (loopy) factor graph We could even pack all those factors into a single TrigramHMM factor X3 X5 X1 X2 X4 like flies time an arrow 42 (Smith & Eisner, 2008)

  43. Example: The TrigramHMM Factor Variables: d discrete variables X1, , Xd Global Factor: TrigramHMM(X1, , Xd) = p(X1, , Xd) according to a trigram HMM model X3 X5 X1 X2 X4 like flies time an arrow 43 (Smith & Eisner, 2008)

  44. Example: The TrigramHMM Factor Variables: d discrete variables X1, , Xd Global Factor: TrigramHMM(X1, , Xd) = p(X1, , Xd) according to a trigram HMM model Compute outgoing messages efficiently with the standard trigram HMM dynamic programming algorithm (junction tree)! X3 X5 X1 X2 X4 like flies time an arrow 44 (Smith & Eisner, 2008)

  45. Combinatorial Factors Usually, it takes O(kd) time to compute outgoing messages from a factor over d variables with k possible values each. But not always: 1. Factors like Exactly1 with only polynomially many nonzeroes in the potential table 2. Factors like CKYTree with exponentially many nonzeroes but in a special pattern 3. Factors like TrigramHMM with all nonzeroes but which factor further 45

  46. Combinatorial Factors Factor graphs can encode structural constraints on many variables via constraint factors. Example NLP constraint factors: Projective and non-projective dependency parse constraint (Smith & Eisner, 2008) CCG parse constraint (Auli & Lopez, 2011) Labeled and unlabeled constituency parse constraint (Naradowsky, Vieira, & Smith, 2012) Inversion transduction grammar (ITG) constraint (Burkett & Klein, 2012) 46

  47. Structured BP vs. Dual Decomposition Sum-product BP Max-product BP Dual Decomposition Output Approximate marginals Approximate MAP assignment True MAP assignment (with branch-and-bound) Structured Variant Coordinates marginal inference algorithms Coordinates MAP inference algorithms Coordinates MAP inference algorithms Example Embedded Algorithms - Inside-outside - Forward- backward - CKY - Viterbi algorithm - CKY - Viterbi algorithm (Koo et al., 2010; Rush et al., 2010) 48 (Duchi, Tarlow, Elidan, & Koller, 2006)

  48. Additional Resources See NAACL 2012 / ACL 2013 tutorial by Burkett & Klein Variational Inference in Structured NLP Models for An alternative approach to efficient marginal inference for NLP: Structured Mean Field Also, includes Structured BP http://nlp.cs.berkeley.edu/tutorials/variational-tutorial-slides.pdf 49

  49. Sending Messages: Computational Complexity From Variables To Variables 2 X2 3 1 X1 1 X1 X3 O(d*kd) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for Xi 50

  50. Sending Messages: Computational Complexity From Variables To Variables 2 X2 3 1 X1 1 X1 X3 O(d*kd) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for Xi 51

More Related Content