
Probabilistic Modeling in Information Retrieval and User Queries
Explore the concept of probabilistic models in Information Retrieval, focusing on ranking principles, user query generation, language models, and estimating probabilities. Understand how these models aid in determining document relevance, ranking functions, and query generation processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CS246: Introduction to Probabilistic Model Junghoo John Cho UCLA
Probabilistic Model IR as a ranking problem: Given q, ? ?, compute ranking function ? ?,? such that if ? ?1,? > ? ?2,? , then ?1is more relevant than ?2 Probabilistic approach Use ? ?,? = Pr(?(?) = 1|?) ? is a binary relevance indicator variable. 1 if relevant 0 if not. Given document ? and ?, what is the probability that they are relevant? 2
Probabilistic Ranking Principle (PRP) Return documents in the decreasing order of their Pr(?(?) = 1|?) [Robertson 77] proved that PRP is optimal assuming The utility of a document is independent of the utility of other documents The user looks at the results from top to bottom without skipping However, How do we know the true probability Pr(?(?) = 1|?)? To compute the probability we need a mathematical model that captures the gist of the user s querying process 3
Users Query Model Before any query, a user have in mind an ideal answer set ? that she wants to retrieve Given ?, a user comes up with a query ?that best describes ? Q: Exactly how does a user generate ? from ?? Many different models exist for the generation of ? from ? Intuition: Users are more likely to select a few key words in ? 4
Language Model Probability distribution of word sequences P( UCLA is best ) ~ 0.001 P( USC is best ) = 0 P( Poop grew would ) ~ 0.000000001 Used to estimate the probability of a particular word sequence generation Q: Where is it useful? A: Many different applications! Spell correction: John went there vs John went their Speech recognition: Koreans love rice vs Koreans love lice Can be used for users query generation model 5
Estimating Language Model Q: How do we compute P(sentence)? In principle, look at a large language corpus and see how many times sentence appears Example Corpus with 1,000,000,000 words UCLA is the best appears 10,000 times Q: What is a reasonable estimation of P( UCLA is the best )? 6
Estimating Language Model Q: How to estimate P(sentence) when the sentence was never seen? UCLA is located in a very expensive and safe neighborhood that everyone loves to visit Assign P(sentence)=0? We need ways to estimate P(sentence) for unseen sentence. Many different models exist 7
Unigram Language Model Measure ? ?? for every word ?? ? ?1,?2, ,?? = ? ?1? ?2 ?(??) Independence assumption Simplest language model and easier to analyze Less likely to be accurate, but better than no language model Extensions for a more precise model possible Bigram language model, n-gram language model ? ?1,?2,?3 = ? ?1? ?2|?1?(?3|?1,?2) ~ ? ?1? ?2|?1? ?3?2 = ? ?1 ? ?1 ? ?1,?2 [? ?2,?3 ? ?2 ] 8
Back to Users Query Model How do users generate a query and consider a document relevant? How can we capture this process using a mathematical model? 9
References [Robertson 77] S. E. Robertson, The probability ranking principle in IR, Journal of Documentation, Vol 43(4), 1977 10