Probabilistic Language Modeling in Natural Language Processing

natural language processing language modeling n.w

1 / 52

Embed Share

Explore language models, their applications, and computation of probabilities in NLP. Learn about joint probabilities, chain rule, conditional probabilities, and how language models play a vital role in tasks like machine translation, sentence completion, spell correction, and more.

no_nna Follow

Uploaded on Apr 09, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Natural Language Processing Language Modeling Demetris Paschalides Department of Computer Science University of Cyprus

What is a Language Model? Compute the probability of a sentence or word sequence. P(W) = P(w1, w2, w3, , wn) Probability of an upcoming word. P(w5 | w1, w2, w3, w4) Joint Probability of W Conditional Probability Please turn off your cell _____ Your program does not _____ A model able to compute either P(W) or P(w5 | w1, w2, w3, w4) is called a Language Model.

Language Model Application Machine Translation: p( strong winds ) > p( large winds ) Sentence Completion: P( Today is Tuesday ) > P( Tuesday Today is ) Spell Correction: The office is about 15 minutes from my house. p( 15 minutes from my house ) > p( 15 minuets from my house ) Speech Recognition p( I saw a van ) >> p( eyes awe of an ) Summarization, question-answering, handwriting recognition, etc..

Language Model Application Image from Dr. Diyi Yang, CS 4650/7650 from Georgia Tech

Language Model Application

Computing Probability P(W) Let s compute the following joint probability: P(W) = P(its, water, is, so, transparent, that) Intuition: Rely on the Chain Rule of Probability: Definition of conditional probabilities: P(B | A) = P(A, B) / P(A) P(A, B) = P(A) P(B | A) P(x1, x2, x3, , xn) = P(x1) P(x2 | x1) P(x3 | x1, x2) P(xn | x1, , xn-1)

Computing Probability P(W) Let s compute the following joint probability: P(W) = P(its, water, is, so, transparent, that) Intuition: Rely on the Chain Rule of Probability: Definition of conditional probabilities: P(B | A) = P(A, B) / P(A) P(A, B) = P(A) P(B | A) P(x1, x2, x3, , xn) =

Computing Probability P(W) Let s compute the following joint probability: P(W) = P(its, water, is, so, transparent, that) Intuition: Rely on the Chain Rule of Probability: Definition of conditional probabilities: P(B | A) = P(A, B) / P(A) P(A, B) = P(A) P(B | A) P(x1, x2, x3, , xn) = P( its water is so transparent ) =

Computing Probability P(W) Let s compute the following joint probability: P(W) = P(its, water, is, so, transparent, that) Intuition: Rely on the Chain Rule of Probability: Definition of conditional probabilities: P(B | A) = P(A, B) / P(A) P(A, B) = P(A) P(B | A) P(x1, x2, x3, , xn) = P( its water is so transparent ) = P(its) x P(water | its) x P(is | its water) x P(so | its water is) x P(transparent | its water is so)

Computing Probability P(W) Let s compute the following joint probability: P(W) = P(its, water, is, so, transparent, that) Intuition: Rely on the Chain Rule of Probability: Definition of conditional probabilities: P(B | A) = P(A, B) / P(A) P(A, B) = P(A) P(B | A) P(x1, x2, x3, , xn) = P( its water is so transparent ) = P(its) x P(water | its) x P(is | its water) x P(so | its water is) x P(transparent | its water is so) How to compute these?

Bag-of-Words with N-Grams N-grams: a contiguous sequence of n tokens from a given piece of text Image from http://recognize-speech.com/language-model/n-gram-model/comparison

N-Gram Models Unigram model: Bigram model: Trigram model: N-gram model: P(the | its water is so transparent that) = Count(its water is so transparent that the) / Count(its water is so transparent that)

N-Gram Models Unigram model: Bigram model: Trigram model: N-gram model: Too many possible sentences and not enough data. P(the | its water is so transparent that) = Count(its water is so transparent that the) / Count(its water is so transparent that)

Markov Assumption Simplifying assumption: P(the | its water is so transparent that) P(the | that) Or maybe: P(the | its water is so transparent that) P(the | transparent that) Approximate each component: P(wi| w1, w2, , wi-1) P(wi| wi-k, wi-k-1, , wi-1) Andrey Markov

Unigram Model k=1 Examples from a unigram model: fifth, an, of, futures, the, an, incorporated, a,a, the, inflation, most, dollars, quarter, in, is, mass thrift, did, eighty, said, hard, july, bullish

Unigram Model k=1 Sentences don t make sense Words are independent Examples from a unigram model: fifth, an, of, futures, the, an, incorporated, a,a, the, inflation, most, dollars, quarter, in, is, mass thrift, did, eighty, said, hard, july, bullish

Bigram Model k=2 Examples from a bigram model: outside, new, car, parking, lot, of, the, agreement, reached this, would, be, a, record, november

Limitations Extent to trigrams, 4-grams, 5-grams etc. N-grams are generally insufficient language models. Language contains long-distance word dependencies: The computer I had just put on the machine room on the fifth floor crashed. He is from France, so it makes sense that his first language is ________ But we can often get away with N-gram models.

Maximum Likelihood Estimate Estimating bigram model probabilities: P(wi| wi-1) = count(wi-1, wi) \ count(wi-1) Example: <s> I am Sam <\s> P(I | <s>) = _ am) = _ P(Sam | <s> Sam I am <\s> P(Sam | <s>) = _ P(do | I) = _ <s> I do not like green eggs and ham <\s> P(am | I) = _ P(</s> | Sam) = _

Maximum Likelihood Estimate Estimating bigram model probabilities: P(wi| wi-1) = count(wi-1, wi) \ count(wi-1) Example: <s> I am Sam <\s> P(I | <s>) = am) = _ P(Sam | <s> Sam I am <\s> P(Sam | <s>) = _ P(do | I) = _ <s> I do not like green eggs and ham <\s> P(am | I) = _ P(</s> | Sam) = _

Maximum Likelihood Estimate Estimating bigram model probabilities: P(wi| wi-1) = count(wi-1, wi) \ count(wi-1) Example: <s> I am Sam <\s> P(I | <s>) = am) = P(Sam | <s> Sam I am <\s> P(Sam | <s>) = _ P(do | I) = _ <s> I do not like green eggs and ham <\s> P(am | I) = _ P(</s> | Sam) = _

Maximum Likelihood Estimate Estimating bigram model probabilities: P(wi| wi-1) = count(wi-1, wi) \ count(wi-1) Example: <s> I am Sam <\s> P(I | <s>) = am) = P(Sam | <s> Sam I am <\s> P(Sam | <s>) = P(do | I) = _ <s> I do not like green eggs and ham <\s> P(am | I) = _ P(</s> | Sam) = _

Maximum Likelihood Estimate Estimating bigram model probabilities: P(wi| wi-1) = count(wi-1, wi) \ count(wi-1) Example: <s> I am Sam <\s> P(I | <s>) = am) = P(Sam | <s> Sam I am <\s> P(Sam | <s>) = P(do | I) = <s> I do not like green eggs and ham <\s> P(am | I) =

Practical Issues We do everything in the log space: Avoid underflow Adding is faster than multiplying

Example from Restaurant Project Reviews: Can you tell me about any good Cantonese restaurants close by. Mid priced thai food is what I m looking for. Tell me about Chez Panisse. Can you give me a listing of the kinds of food that are available. I m looking for a good place to eat breakfast. When is caffe Venezia open during the day.

Berkeley Restaurant Project Reviews: Can you tell me about any good Cantonese restaurants close by. Mid priced thai food is what I m looking for. Tell me about Chez Panisse. Can you give me a listing of the kinds of food that are available. I m looking for a good place to eat breakfast. Total of 9222 sentences When is caffe Venezia open during the day.

Raw Bigram Counts I want to eat Chinese food lunch spend I 5 827 0 9 0 0 0 2 want 2 0 608 1 6 6 5 1 to 2 0 4 686 2 0 6 211 eat 0 0 2 0 16 2 42 0 Chinese 1 0 0 0 0 82 1 0 food 15 0 15 0 1 4 0 0 lunch 2 0 0 0 0 1 0 0 spend 1 0 1 0 0 0 0 0

Normalize by Unigrams I want to eat Chinese food lunch spend Unigrams 2533 927 2417 746 158 1093 341 278 I 5 827 0 9 0 0 0 2 want 2 0 608 1 6 6 5 1 to 2 0 4 686 2 0 6 211 eat 0 0 2 0 16 2 42 0 Chinese 1 0 0 0 0 82 1 0 food 15 0 15 0 1 4 0 0 lunch 2 0 0 0 0 1 0 0 spend 1 0 1 0 0 0 0 0

Calculating the P(wi| wi-1) I want to eat Chinese food lunch spend Unigrams 2533 927 2417 746 158 1093 341 278 I 0.002 0.33 0 0.0036 0 0 0 0.00079 want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011 to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087 eat 0 0 0.0027 0 0.021 0.0027 0.056 0 Chinese 0.0063 0 0 0 0 0.52 0.0063 0 food 0.014 0 0.014 0 0.00092 0.0037 0 0 lunch 0.0059 0 0 0 0 0.0029 0 0 spend 0.0036 0 0.0036 0 0 0 0 0

What did we Learn? P(english|want) 0.0011 P(chinese|want) 0.0065 P(to|want) P(eat | to) P(food | to) P(want | spend) P (i | <s>) = People like Chinese stuff more in the specific corpus. = English language behaves a certain way. = 0.66 = 0.28 = 0.00 = 0.00 = 0.25

Language Generation Approximating Shakespeare:

Language Generation Approximating Wall Street Journal: More: https://nbviewer.org/gist/yoavg/d76121dfde2618422139

Shakespeare Corpus Tokens Verbs Produced 300,000 bigrams types out of V2=844M bigrams. 99.96% of the bigrams were never seen (zero entries). Quadrigrams worse: Output looks like Shakespeare because it actually is Shakespeare. N = 884,647 V = 29,066

Shakespeare Corpus Tokens Verbs Produced 300,000 bigrams types out of V2=844M bigrams. 99.96% of the bigrams were never seen (zero entries). Quadrigrams worse: Output looks like Shakespeare because it actually is Shakespeare. N = 884,647 V = 29,066 Data Overfitting

Data Overfitting N-grams models perform well for word prediction if the test and training corpus are similar In reality, this rarely occur. Language models should be robust and able to generalize. N-gram generalization: Entries with zero (0) occurrences. Things that are not in the training set but are in the test set.

Data Overfitting N-grams models perform well for word prediction if the test and training corpus are similar In reality, this rarely occur. Language models should be robust and able to generalize. N-gram generalization: Entries with zero (0) occurrences. Things that are not in the training set but are in the test set. Training set: denied the allegations denied the reports denied the claims denied the request Test set: denied the offer denied the loan P( offer | denied the)

Data Overfitting N-grams models perform well for word prediction if the test and training corpus are similar In reality, this rarely occur. Language models should be robust and able to generalize. N-gram generalization: Entries with zero (0) occurrences. Things that are not in the training set but are in the test set. Training set: denied the allegations denied the reports denied the claims denied the request Test set: denied the offer denied the loan P( offer | denied the) = 0

Smoothing Intuition of smoothing: Sparse statistics: Steal probability mass to generalize:

Laplace (Add-one) Smoothing Basically an Add-one Estimation to all the word/token counts. MLE estimate: Add-one estimate:

Add-one Smoothing Example xya 100 100/300 xyb 0 0/300 xyc 0 0/300 xyd 200 200/300 xye 0 0/300 xyz 0 0/300 Total xy 300 300/300

Add-one Smoothing Example xya 100 100/300 101 xyb 0 0/300 1 xyc 0 0/300 1 xyd 200 200/300 201 xye 0 0/300 1 xyz 0 0/300 1 Total xy 300 300/300

Add-one Smoothing Example xya 100 100/300 101 xyb 0 0/300 1 xyc 0 0/300 1 xyd 200 200/300 201 xye 0 0/300 1 xyz 0 0/300 1 Total xy 300 300/300 326

Add-one Smoothing Example xya 100 100/300 101 101/326 xyb 0 0/300 1 1/326 xyc 0 0/300 1 1/326 xyd 200 200/300 201 201/326 xye 0 0/300 1 1/326 xyz 0 0/300 1 1/326 Total xy 300 300/300 326 326/326

Berkeley Restaurant Corpus I want to eat Chinese food lunch spend I 6 828 1 10 1 1 1 3 want 3 1 609 2 7 7 6 2 to 3 1 5 687 3 1 7 212 eat 1 1 6 1 17 3 43 1 Chinese 2 1 1 1 1 83 2 1 food 16 1 16 1 2 5 1 1 lunch 3 1 1 1 1 2 1 1 spend 2 1 2 1 1 1 1 1

Laplace-Smoothed Bigrams I want to eat Chinese food lunch spend 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075 I 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.00084 want 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055 to 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046 eat 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062 Chinese 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039 food 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056 lunch 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058 spend I want to eat Chinese food lunch spend 2533 927 2417 746 158 1093 341 278 V = 1446 in the Berkeley Restaurant Project Corpus

Reconstruct Count Matrix I want to eat Chinese food lunch spend 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9 I 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78 want 1.9 0.63 3.1 430 1.9 0.63 4.4 133 to 0.34 0.34 1 0.34 5.8 1 15 0.34 eat 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098 Chinese 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43 food 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19 lunch 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16 spend

Compare with Raw Counts I want to eat Chinese food lunch spend I 5 827 0 9 0 0 0 2 want 2 0 608 1 6 6 5 1 to 2 0 4 686 2 0 6 211 eat 0 0 2 0 16 2 42 0 Chinese 1 0 0 0 0 82 1 0 food 15 0 15 0 1 4 0 0 lunch 2 0 0 0 0 1 0 0 spend I 1 3.8 0 527 1 0.64 0 6.4 0 0.64 0 0.64 0 0.64 0 1.9 want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78 to 1.9 0.63 3.1 430 1.9 0.63 4.4 133 eat 0.34 0.34 1 0.34 5.8 1 15 0.34 Chinese 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098 food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43 lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19 spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16

Add-one Estimate is a Blunt Instrument Add-one Estimation is not used for N-gram However, it can be used to smooth other NLP models: Text classification models. Domains where zero is not a significant.

Web-scale N-Grams Deal with massive n-gram corpus i.e. Google s N-Grams Pruning: Only store N-Grams with count > threshold Entropy-based pruning Efficiency: Efficient data structures e.g. tries Bloom filters: approximate language models Store as indexes e.g. Huffman coding