
Understanding IR Models for Ranking Tasks
Explore the importance of Information Retrieval (IR) Models like Vector-Space Model (VSM) and Boolean Model in ranking tasks. Learn about similarity-based and probabilistic models for computing relevance functions. Discover how VSM represents documents as vectors and the significance of similarity functions like dot product. Dive into the world of IR models and their applications in Information Retrieval systems.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CS246: VSM and TFIDF Junghoo John Cho UCLA
IR as a Ranking Problem Given q, ? ?, compute ranking function ? ?,? such that if ? ?1,? > ? ?2,? , then ?1is more relevant than ?2 Q: How can we compute the scoring function ? ?,? ? 2
IR Models for Ranking Problem Many different models exist for computing ? ?,? Similarity-based model: ? ?,? = ?????????? (?,?) How do we compute similarity ? Vector-space model (VSM) Probabilistic model: ? ?,? = Pr{? ? ? } How do we compute probability ? Classic probabilistic model, language model, Interestingly, different models often lead to very similar ranking functions TFIDF (VSM) and BM25 (probabilistic) are two most popular ranking functions 3
Similarity-Based: Vector-Space Model (VSM) Represent document (and query) as a vector Each dimension corresponds to a term or concept The value in each dimension captures relevance of the document to the concept Relevance is measured by the similarity of two vectors VSM itself does not tell us how the dimensions and their relevance values are chosen Topic of next discussion movies ?1 ?2 query ?4 school ?3 car 4
Boolean Model: VSM Interpretation Matrix representation of Boolean model Document Term matrix Boolean values (0 or 1 ) for each entry school movies car ?1 ?2 ?3 1 0 0 . . . 0 0 1 0 1 1 . . . Each document is a {0,1} binary vector Q: How many dimensions in each document vector? A document is relevant to a query if its entries are 1 for query terms Q: Any way to extend it to incorporate ranking ? 5
Similarity Function: Dot Product ? = ?1,?2, ,?? ? = ?1,?2, ,?? ??,??= 1 if ?? is present 0 if not ??? ?, ? = ? ? = ?1?1+ + ???? (dot product) Q: What is the meaning of this similarity function? Q: Can we further improve it? 6
General VSM Instead of {0, 1}, assign real-valued weight value to the matrix entries depending on the relevance or importance of the term to the document Q: How should the weights be assigned? Ask Arturo? 7
Term Frequency (TF) Q: Which one is more relevant to UCLA? ?1= {UCLA, UCLA, UCLA, USC} ?2= {UCLA,USC,USC,USC} Term Frequency: ???,? # occurrence of the term ? in document ? Measures relevance of ? to ? 8
TF Scaling Q: Does 10 occurrence of ? signify 10 times more relevance? Popular TF scaling functions Linear: ???,? Log: 1 + log(???,?) (if ???,?> 0) BM25: (?+1)???,? ???,?+? (?: parameter) Q: What is the maximum TF for each scaling function? BM25 enforces an upper bound on TF 9
Inverse Document Frequency (IDF) ? = Princeton,university ?1= {university, admission},?2= {Princeton, admission} Q: Intuitively, which one is more relevant? Q: Using TF weight and dot product similarity, which one is more relevant? ? = (Princeton:1,university:1,admission:0) ?1= (Princeton:0,university:1,admission:1) ?2= Princeton:1,university:0,admission:1 ? ?1= 1, ? ?2= 1 Q: Anyway to capture our intuition? 10
Inverse Document Frequency (IDF) Document Frequency (???): # documents containing the term ? Collection Frequency (???): total # of occurrences of ? in the corpus ? Inverse Document Frequency (????): log( N: # docs in corpus IDF measures term specificity Rare words should be considered important because they are more specific Princeton carries more information than university ? ???) 11
? IDF: Why log( ???)? ? log 0 ??? ? = 1,000,000 ???= 10 vs ???= 10,000 What is the difference between ? and ? without log()? What is the difference between ? and ? with log()? Information theoretic view informational content of observing any event is proportional to log(?), where ? is the probability of the event (Shannon s information theory) ?is the probability of ? appearing in a random document log ? = log ??? ????is the informational content of the term ? ??? ??? ? = ????. 12
TFIDF Weighting A term ? is important for document ? If ? appears many times in ? ???,? If ? is a rare term ???? TF IDF weighting Use ???,? ???? as the weight (or importance) of term ? to document ? Each component of a document vector corresponds to each term in vocabulary 13
Example: TFIDF Weighting ? = Princeton,university ?1= {university, admission},?2= {Princeton, admission} idf Princeton = 3,idf university = idf admission = 1 Q: Given ?, what should be ? ?,? of ?1 and ?2? A: First, represent both as vectors using TFIDF UCLA Princeton university admission ?1 ?2 ? 0 0 1 1 1 1 0 1 3 0 1 1 0 1( 3) 1 0 Depending on the system, IDF may or may not be included in ? 14
TFIDF and Similarity UCLA Princeton university admission ?1 ?2 ? 0 0 1 1 0 3 0 1 0 1( 3) 1 0 Use dot product between query and document vectors as the similarity ? ?1= 1 1 = 1 ? ?2= 1 3 = 3 Dot product is the sum of all query-word weights of a document Assuming IDF is not included in the query vector 15
Document Length Normalization UCLA MIT USC ? = {UCLA} ?1 ?2 ? 2 3 2 3 2 3 ?1= {UCLA, MIT, USC, UCLA, MIT, USC} ?2= {UCLA} idf UCLA = idf MIT = idf USC = 3 Q: Intuitively, which document is more relevant to the query? Q: If we use dot product as relevance, which document is considered more relevant? Q: Why do we get this result? 3 0 0 1 0 0 16
Document Length Normalization We want to penalize long documents with many words Otherwise, the document with all words many times will be considered most relevant to all queries! Decrease relevance of longer documents 17
?? ? |?| Cosine similarity: Dot product of ? and ? vectors divided by their length Division by |?| penalizes long documents Division by ? does not affect relative ranking of documents ? ? ? |?|= ? |?| = cos? ? ? cos ? ? ? ? 18
Document Length Normalization Document length |?| term can be Euclidean distance (= L2 norm): ?1 Sum of weights (= L1 norm): ?1+ ?2+ + ?? Word count: ? Size of document: # bytes 2+ ?2 2+ + ?? 2 19
Pivoted Length Normalization Set normalization term ? to 1 for average length document Adjust length penalty factor by a parameter ? |?| avg( d ) When ? = 0: ??= 1. No length penalty When ? = 1: ??= avg( d ) The larger ?, the larger length penalty ??= 1 ? + ? ?? ? = 1 |?| 1 ? = 0 |?| avgdl 20
Two Most Popular Similarity Functions 1. TFIDF 1 ? ? ?,? = |?| ? ? ???,? log ??? 2. BM25-Okapi (?1+ 1)???,? ? ???+ 0.5 ???+ 0.5 ? ?,? = log ? ???,?+ ?1 1 ? + ? ? ? avgdl More discussion on BM25 when we learn probabilistic models 21
Finding High Cosine-Similarity Documents Q: How to find the documents with highest cosine similarity from corpus? Q: Any way to avoid complete scan of corpus? Key idea: q di = 0 if di has no query words Consider di only if it contains at least one query word 22
Inverted Index for TFIDF Cosine Similarity Inverted Index: Word Document Use it to filter out any document with q di = 0 Q: Can we do better? Can we use it to compute ? ?,? as well? 1 |?| ? ????,? ???? ? ?,? = docid 1/|d| Word Stanford UCLA MIT IDF 1/3530 1/9860 1/937 docid 12 26 96 TF 2 30 8 1 0.5 Lexicon Posting list 2 0.2 3 0.8 23
Three Key Data Structures for Ranking Lexicon: All terms and their collection-wide statistics ???? Postings list: (Term-document) specific information ???,? or normalized ???,?/|?| Field in which the term appears Location of the term in the document (more discussion later) Font size Document feature table Document length ? Document popularity Document creation date 24
Summary Vector-space model (VSM) TF IDF weight TF, IDF, length normalization BM25 (VSM interpretation) Cosine similarity Inverted index for document ranking 25