
Understanding Relevance Feedback in Information Retrieval Systems
Explore the concept of relevance feedback in information retrieval systems, covering topics such as user feedback, query expansion, learning-based retrieval, and real-world applications. Learn how feedback can improve search results by leveraging user judgments and expanding query terms based on relevant documents.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Relevance Feedback Hongning Wang CS@UVa
What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Feedback Evaluation Doc Analyzer (Query) User Query Rep Doc Rep (Index) results Ranker Indexer Index CS@UVa CS4780: Information Retrieval 2
User feedback should be An IR system is an interactive system Query Information need Feedback GAP! Ranked documents Inferred information need CS@UVa CS4780: Information Retrieval 3
Relevance feedback Results: d1 3.5 d2 2.4 dk 0.5 ... Query Retrieval Engine User judgment Updated query Document collection Judgments: d1 + d2 - d3 + dk - ... Feedback CS@UVa CS4780: Information Retrieval 4
Basic idea in feedback Query expansion Feedback documents can help discover related query terms E.g., query= information retrieval Relevant docs may likely share very related words, such as search , search engine , ranking , query Expand the original query with such words will increase recall and sometimes also precision CS@UVa CS4780: Information Retrieval 5
Basic idea in feedback Learning-based retrieval Feedback documents can be treated as supervision for ranking model update Covered in the lecture of learning-to-rank CS@UVa CS4780: Information Retrieval 6
Relevance feedback in real systems Google used to provide such functions Relevant Nonrelevant Guess why? CS@UVa CS4780: Information Retrieval 7
Relevance feedback in real systems Popularly used in image search systems CS@UVa CS4780: Information Retrieval 8
Pseudo relevance feedback What if the users are reluctant to provide any feedback Results: d1 3.5 d2 2.4 dk 0.5 ... Query Retrieval Engine Updated query Document collection Judgments: d1 + d2 + d3 + dk - ... top k Feedback CS@UVa CS4780: Information Retrieval 9
Feedback techniques Feedback as query expansion Step 1: Term selection Step 2: Query expansion Step 3: Query term re-weighting Feedback as training signal Covered in learning to rank CS@UVa CS4780: Information Retrieval 10
Relevance feedback in vector space models General idea: query modification Adding new (weighted) terms Adjusting weights of old terms The most well-known and effective approach is Rocchio [Rocchio 1971] CS@UVa CS4780: Information Retrieval 11
Illustration of Rocchio feedback --- -- - + + -- + - + - - --- - + + q + q ++ + + + + - - - ++ + + --- -- + + - -- - CS@UVa CS4780: Information Retrieval 12
Formula for Rocchio feedback Standard operation in vector space Parameters Modified query ? ? ?? ?? ??= ? ? + |??| |??| ?? ?? ?? ?? Original query Rel docs Non-rel docs CS@UVa CS4780: Information Retrieval 13
Rocchio in practice Negative (non-relevant) examples are not very important (why?) Efficiency concern Restrict the vector onto a lower dimension (i.e., only consider highly weighted words in the centroid vector) Avoid training bias Keep relatively high weight on the original query Can be used for relevance feedback and pseudo feedback Usually robust and effective CS@UVa CS4780: Information Retrieval 14
Feedback in probabilistic models Rel. doc model = = ( ( | , | , 1) 0) P D Q R P D Q R = Classic Prob. Model ( 1| , ) Q D O R NonRel. doc model = = Language Model ( 1| , ) Q D ( | P Q D R , 1) Rel. query model O R (q1,d1,1) (q1,d2,1) (q1,d3,1) (q1,d4,0) (q1,d5,0) P(D|Q,R=1) Feedback: - P(D|Q,R=1) can be improved for the current query and future doc - P(Q|D,R=1) can be improved for the current doc and future query P(D|Q,R=0) Parameter Estimation (q3,d1,1) (q4,d1,1) (q5,d1,1) (q6,d2,1) (q6,d3,0) P(Q|D,R=1) CS@UVa CS4780: Information Retrieval 15
Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) u 1 ( ) 1 k k p u p u Rank d i d i (RSJ model) = = + log ( | 1 , ) log log log i 1 ( i i i O R Q D ) 1 u p p = = = = = = , 1 1 , 1 1 i q i q i i i i i i Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc ui = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate these parameters? Suppose we have relevance judgments, 5 . 0 + 5 . 0 + # ( . ) # ( . ) rel doc with A nonrel doc with A = = p u i i i i + + # ( . ) 1 # ( . ) 1 rel doc nonrel doc +0.5 and +1 can be justified by Bayesian estimation as priors P(D|Q,R=1) can be improved for the current query and future doc Per-query estimation! CS@UVa CS4780: Information Retrieval 16
Feedback in language models Recap of language model Rank documents based on query likelihood = log ( | ) log ( | ) p q d p w d i w w 1 q i = , ... where q w w Document language model 2 n Difficulty Documents are given, i.e., ?(?|?) is fixed CS@UVa CS4780: Information Retrieval 17
Feedback in language models Approach Introduce a probabilistic query model Ranking: measure distance between query model and document model Feedback: query model update Q: Back to vector space model? A: Kind of, but in a different perspective. CS@UVa CS4780: Information Retrieval 18
Kullback-Leibler (KL) divergence based retrieval model Probabilistic similarity measure ??? ?;? ??(??| ?? ?? ? ?? log? ? ?? + ?? ? ?? log? ? ?? Query-specific quality, ignored for ranking Query language model, need to be estimated Document language model, we know how to estimate CS@UVa CS4780: Information Retrieval 19
Background knowledge Kullback-Leibler divergence A non-symmetric measure of the difference between two probability distributions P and Q ??(?| ? = ? ? log?(?) ?(?)?? It measures the expected number of extra bits required to code samples from P when using a code based on Q P usually refers to the true data distribution, Q refers to the approximated distribution Properties Non-negative ??(?| ? = 0, iff ? = ? almost everywhere Explains why ??? ?;? ?(??| ?? CS@UVa CS4780: Information Retrieval 20
Kullback-Leibler (KL) divergence based retrieval model Retrieval estimation of ?? and ?? ??? ?;? ? ?,? ? ??>0? ? ?? log A generalized version of query-likelihood language model ? ? ?? is the empirical distribution of words in a query same smoothing strategy ?(?|?) ???(?|?)+ log?? CS@UVa CS4780: Information Retrieval 21
Feedback as model interpolation Document D D Results ( || ) D Q D Query Q Q Key: estimate the feedback model Feedback Docs F={d1, d2, , dn} = ) + ' 1 ( Q Q F F =0 =1 = ' Generative model = ' Q Q Q F No feedback Full feedback Q: Rocchio feedback in vector space model? A: Very similar, but with different interpretations. CS@UVa CS4780: Information Retrieval 22
Feedback in language models Feedback documents protect passengers, accidental/malicious harm, crime, rules CS@UVa CS4780: Information Retrieval 23
Generative mixture model of feedback F={d1, ,dn} Background words P(w| C) 0.12 0.1 P(feedback) 0.08 Topic words 0.06 0.04 1- P(w| F ) 0.02 0 w1 w3 w5 w7 w9 w11w13w15w17w19w21w23w25 log?(??) = ? ?,? log 1 ? ? ? ?? + ??(?|?) ?,? = Noise ratio in feedback documents Maximum Likelihood ??= ???????log?(??) CS@UVa CS4780: Information Retrieval 24
How to estimate F? fixed the 0.2 a 0.1 we 0.01 to 0.02 flight 0.0001 company 0.00005 Feedback Doc(s) =0.7 Known Background p(w|C) ML Estimator Unknown query topic p(w| F)=? accident =? regulation =? passenger=? rules =? =0.3 airport security Suppose, we know the identity of each word ; but we don t... CS@UVa CS4780: Information Retrieval 25
Appeal to Expectation Maximization algorithm Identity ( hidden ) variable: zi {1 (background), 0(topic)} Suppose the parameters are all known, what s a reasonable guess of zi? - depends on (why?) - depends on p(w|C) and p(w| F) (how?) ( ( 1| ) ( 1) ( | ( ( | ) (1 i p w C + zi 1 1 1 1 0 0 0 1 0 ... the paper presents a text mining algorithm the paper ... = = = 1) ( 1) | 1) 0) ( p z p w z p z + = = i i i = p z w i i = = ( | 0) p z p w z p w C p w z i i i i i i | ) = i E-step ) ( p w | ) i F )( = z ( ( ) n ( , )( 1 ( | 1 )) c w F p 1 z ( w | 1 = new ( | ) i ( i ) i p w M-step vocabulary i F = n , )) c w F p w j j j w j CS@UVa CS4780: Information Retrieval 26
A toy example of EM computation ( | ) p 1 ( w C ) Expectation-Step: Augmenting data by guessing hidden variables = = ( ) n ( | 1 ) i p z w i i + ( ) n ( | ) ( | ) p w C p w i i F )) )( = z ( ( ) n ( , )( 1 ( | 1 c w F p 1 z ( w | 1 + = ( ) 1 n ( | ) i ( i ) i p w vocabulary Maximization-Step With the augmented data , estimate parameters using maximum likelihood i F = n , )) c w F p w j j j w j Assume =0.5 Word # P(w|C) Iteration 1 P(w| F) 0.25 0.25 0.25 0.25 -16.96 Iteration 2 P(w| F) 0.20 0.14 0.44 0.22 -16.13 Iteration 3 P(w| F) 0.18 0.10 0.50 0.22 -16.02 P(z=1) 0.67 0.55 0.29 0.29 P(z=1) 0.71 0.68 0.19 0.31 P(z=1) 0.74 0.75 0.17 0.31 The Paper Text Mining 4 2 4 2 0.5 0.3 0.1 0.1 Log-Likelihood Why in Rocchio we did not distinguish a word s identity? CS@UVa CS4780: Information Retrieval 27
Example of feedback query model Open question: how do we handle negative feedback? Query: airport security Pesudo feedback with top 10 documents =0.7 =0.9 W the p(W| ) 0.0405 0.0377 0.0342 0.0305 0.0304 0.0268 0.0241 0.0214 0.0156 0.0150 0.0137 0.0135 0.0127 0.0127 0.0125 W p(W| ) 0.0558 0.0546 0.0488 0.0474 0.0236 0.0217 0.0206 0.0188 0.0186 0.0173 0.0142 0.0129 0.0124 0.0121 0.0121 F F security airport beverage alcohol bomb terrorist author license bond counter-terror terror newsnet attack operation headline security airport beverage alcohol to of and author bomb terrorist in license state by CS@UVa CS4780: Information Retrieval 28
What you should know Purpose of relevance feedback Rocchio relevance feedback for vector space models Query model based feedback for language models CS@UVa CS4780: Information Retrieval 29
Todays reading Chapter 9. Relevance feedback and query expansion 9.1 Relevance feedback and pseudo relevance feedback 9.2 Global methods for query reformulation CS@UVa CS4780: Information Retrieval 30