Enhancing Scientific Document Retrieval with Hybrid Approach

Enhancing Scientific Document Retrieval with Hybrid Approach
Slide Note
Embed
Share

A hybrid approach combining sparse and dense retrieval methods to improve scientific document retrieval. Sparse models use high-dimensional Bag of Words vectors with TF-IDF weights, while dense models employ transformer-based LLM for nuanced vector representations. By leveraging both sparse and dense vector spaces, the hybrid approach achieves superior performance in information retrieval tasks.

  • Scientific Retrieval
  • Hybrid Approach
  • Sparse Models
  • Dense Models
  • Information Retrieval

Uploaded on Feb 23, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval Priyanka Mandikal, Raymond Mooney University of Texas at Austin

  2. Traditional IR Encodes documents and queries as sparse, high- dimensional Bag of Words (BoW) TF-IDF weighted vectors. Performs document retrieval using an inverted index to efficiently find documents whose sparse vector has high cosine-similarity with the query vector. Weakness: No use of linguistic syntax or semantics.

  3. Deep-Learning based Retrieval Use a transformer-based LLM to encode documents and queries into dense lower-dimensional vectors. Again, use vector similarity (Euclidean or cosine) to find document vectors that are close to the query. Strength: LLM uses syntax and semantics to produce a more nuanced vector representation. Weakness: Does not necessarily sufficiently account for exact token matching.

  4. Hybrid Retrieval We found that for a small, classic IR benchmark in a medical domain (cystic fibrosis), that modern dense retrieval did not perform better than a classical sparse approach. However, a very simple hybrid approach that combines similarity in both the sparse and dense vector spaces performs significantly better than either alone.

  5. Hybrid Architecture

  6. Medical IR Benchmark 1,239 documents published from 1974 to 1979 discussing Cystic Fibrosis. Just abstracts and major and minor topic lists 100 full-sentence queries. Can one distinguish between the effects of mucus hypersecretion and infection on the submucosal glands of the respiratory tract in CF? Each query has been annotated with ratings from four domain experts, where each document is assigned a relevance score: 0: not relevant, 1: marginally relevant, 2: highly relevant.

  7. Sparse Model A simple classical vector-space IR model implemented for teaching undergrad IR. Represents documents and queries as BOW vectors with Term-Frequency Inverse-Document- Frequency (TF-IDF) token weights. Ranks retrievals using cosine similarity. Similar to BM25

  8. Dense Model Uses SPECTER2 (Singh et al., 2023) SOTA deep-learned transformer-based language model for science. Developed at AI2 as a followup to SciBERT and SPECTER models trained on scientific documents from the Semantic Scholar search engine. A contrastive learning approach is used to encourage similar 768-dimensional dense embeddings for an article and its direct citations. Trained on 6M triplets spanning 23 fields of study.

  9. Hybrid Model Use a weighted linear combination of distance in sparse and dense spaces to rank retrievals

  10. Evaluation Metrics Precsion/Recall curves Counts any document marked as relevant by any judge as relevant. Normalized Discounted Cummulative Gain (NDCG) Uses average relevance rating across all judges as the gold-standard continuous relevance score used to compute gain.

  11. Precision Recall Results

  12. NDCG Results

  13. Hyperparameter Search

  14. Conclusions Modern dense retrieval using a SOTA science- based LLM does not improve performance on a classic medical IR problem over traditional sparse retrieval. A simple hybrid of sparse and dense retrieval does significantly improve results. Future work: Test hybrid approaches on other SDU problems.

Related


More Related Content