Document Evaluation in Information Retrieval

evaluation n.w
1 / 14
Embed
Share

Dive into the fundamentals of document evaluation in information retrieval, covering topics such as test collections, ranking assessment, user studies, relevance judgments, and more. Explore strategies for batch evaluation models, IR test collection design, TREC ad hoc topics, and insights from experts like Saracevic and Soergel.

  • Information Retrieval
  • Document Evaluation
  • Test Collections
  • Relevance Judgments
  • User Studies

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Evaluation INST 734 Module 5 Doug Oard

  2. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies

  3. Batch Evaluation Model Documents Query IR Black Box Search Result Evaluation Module Relevance Judgments Measure of Effectiveness These are the four things we need

  4. IR Test Collection Design Representative document collection Size, sources, genre, topics, Random sample of topics Associated somehow with queries Known (often binary) levels of relevance For each topic-document pair (topic, not query!) Assessed by humans, used only for evaluation Measure(s) of effectiveness Used to compare alternate systems

  5. A TREC Ad Hoc Topic Title: Health and Computer Terminals Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

  6. Saracevic on Relevance measure degree dimension estimate appraisal relation correspondence utility connection satisfaction fit bearing matching Relevance is the of a document article textual form reference information provided fact query request information used point of view information need statement existing between a and a person judge user requester Information specialist as determined by Tefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343.

  7. Teasing Apart Relevance Relevance relates a topic and a document Duplicates are equally relevant by definition Constant over time and across users Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document Accounts for prior knowledge Dagobert Soergel (1994). Indexing and Retrieval Performance: The Logical Evidence. JASIS, 45(8), 589-599.

  8. Set-Based Effectiveness Measures Precision How much of what was found is relevant? Often of interest, particularly for interactive searching Recall How much of what is relevant was found? Particularly important for law, patents, and medicine Fallout How much of what was irrelevant was rejected? Useful when different size collections are compared

  9. Relevant + Retrieved Relevant Retrieved Not Relevant + Not Retrieved

  10. Effectiveness Measures System Retrieved Not Retrieved Truth Relevant Retrieved Miss Relevant Not Relevant False Alarm Irrelevant Rejected Relevant Retrieved = Precision User- Oriented Retrieved Relevant Retrieved = = Recall 1 Miss Rate System- Oriented Relevant Irrelevant Rejected = = Fallout 1 False Alarm Rate Relevant Not

  11. Single-Figure Set-Based Measures Balanced F-measure Harmonic mean of recall and precision F 5 . 0 5 . 0 + 1 = P R Weakness: What if no relevant documents exist? Cost function Reward relevant retrieved, Penalize non-relevant For example, 3R+ - 2N+ Weakness: Hard to normalize, so hard to average

  12. (Paired) Statistical Significance Tests Query 1 2 3 4 5 6 7 Average System A 0.02 0.39 0.16 0.58 0.04 0.09 0.12 0.20 Sign Test + - + - - + + p=1.0 Wilcoxon +0.74 - 0.32 +0.21 - 0.37 - 0.02 +0.82 + 0.38 p=0.94 t-test +0.74 - 0.32 +0.21 - 0.37 - 0.02 +0.82 + 0.38 p=0.34 System B 0.76 0.07 0.37 0.21 0.02 0.91 0.46 0.40 95% of outcomes 0 Try some at: http://www.socscistatistics.com/tests/signedranks/

  13. Reporting Results Do you have a measurable improvement? Inter-assessor agreement limits max precision Using one judge to assess another yields ~0.8 Do you have a meaningful improvement? 0.05 (absolute) in precision might be noticed 0.10 (absolute) in precision makes a difference Do you have a reliable improvement? Two-tailed paired statistical significance test

  14. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies

Related


More Related Content