
Sparse vs Dense Retrieval Models
Explore the differences between Sparse and Dense Retrieval models such as BM25, Bi-encoder, and Cross-encoder. Learn about lexical matching, vocabulary mismatch problems, and the use of Bi-Encoder vs Cross-Encoder in Dense Retrieval. Dive into the intricate details of cosine similarity scoring and ranking in Dense Retrieval with Bi-Encoder.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Overview Sparse Retrieval : BM25 ? Dense Retrieval : Bi-encoder, Cross-encoder, 2
Overview: Sparse, Dense Sparse Retrieval Dense Retrieval Q: Q Lexical matching D: . 0.92 D D Q + D + D Score model 0.92 3
Sparse Retrieval vs Dense Retrieval Q: Sparse Retrieval model , lexical matching . TF-IDF, BM25 . . . . ( .) Lexical matching D: . Dense Retrieval model LM , Semantic matching . . Bi-Encoder, Cross-Encoder . . ( .) . Q or D 4
Dense Retrieval: Retrieval, Reranking Q Q > > > > > > > > > Top-k (=Retrieval) Sparse or Bi-encoder Sort (=Rerank) Cross-encoder 6
Dense Retrieval: Bi-Encoder vs Cross-Encoder Bi-encoder ( Retrieval) (encoder) , . . , , . - , cross-encoder . Cross-encoder ( Reranking) . . - , . 7
Dense Retrieval: Bi-Encoder Q or D ??(?1) Input: 8 ?1
Dense Retrieval: Bi-Encoder ??(?2) ??(?1) Input: 9 ?1 ?2
Dense Retrieval: Bi-Encoder ??(??) ??(?3) ??(?2) ??(?1) Input: 10 ?1 ?2 ??
Dense Retrieval: Bi-Encoder ??(??) ??(?3) ??(?2) ??(?1) Input: ????? 11 ?1 ?2 ??
Dense Retrieval: Bi-Encoder ??(??) cosine similarity (=score) . ??(?3) ??(?2) ?????1 ??(?1) Input: ????? 12 ?1 ?2 ??
Dense Retrieval: Bi-Encoder ??(??) ?????? (= ) . ?????3 ??(?3) ?????2 ??(?2) ?????1 ??(?1) <Ranking> ????? 1.?????2 2.?????3 ?2 ?3 ?1 > 3.?????1 > Input: ????? 13 ?1 ?2 ??
Dense Retrieval: Bi-Encoder ??(??) ?????? cosine similarity . (= ) . ?????3 ??(?3) ?????2 ??(?2) ?????1 ??(?1) ) Query: 100 , Document: 100,000 : 100 + 100,000 (cosine similarity ) Input: ????? 14 ?1 ?2 ??
Dense Retrieval: Cross-Encoder ????? ) Query: 100 , Document: 100,000 : 100*100,000 = 10,000,000 , . Input: ?????, ?????, ?????, , , ?1 ?? ?2 15
Dense Retrieval: Retrieval, Reranking Sparse or Bi-encoder Cross-encoder Q > > > > > > > > > Top-k (=Retrieval) Sort (=Rerank) 16
BM25, Bi-encoder, Cross-Encoder ( ) https://colab.research.google.com/drive/1Gx8sjSWz3W64ZbhPRTX_3uyHvisqtig S#scrollTo=D_hDi8KzNgMM SenteceBERT Bi-encoder, Cross-Encoder https://colab.research.google.com/drive/1KetpPOHrmSgif6UXmIJovl2phcKAZdT g#scrollTo=rqUfYfFROkq7 17
1 Bi-encoder (Inference) 1. Model load 2. ( ) Input: [CLS] [SEP] Output: embedding 3. Input: [CLS] [SEP] Output: embedding 4. (= ) Cosine similarity Faiss 19
1 : Bi-Encoder ??(??) ?????? cosine similarity . (= ) . ?????3 ??(?3) ?????2 ??(?2) ?????1 ??(?1) Input: ????? 20 ?1 ?2 ??
1 Cross-Encoder 1. Model load 2. (= ) Input: [CLS] [SEP] [SEP] Output: Score 3. 21
1 : Cross-Encoder ????? Input: ?????, ?????, ?????, , , ?1 ?? ?2 22
1 Bi-encoder 23
1 Cross-encoder 24
Dense Retrieval: Retrieval, Reranking Sparse or Bi-encoder Cross-encoder Q > > > > > > > > > Top-k (=Retrieval) Sort (=Rerank) 26
: Bi-encoder 28
Loss: Bi-encoder Bi-encoder contrastive loss . loss . q , (=positive) passage , (=negative) passage . , passage + , passage . negative . (ex. passage) 29
Loss: Bi-encoder Passage negative ? {query-passage} pair, positive pair negative pair . BM25 negative . BM25 negative BM25 query passage , negative . 30
: Cross-encoder Q1, D1 Q1, D2 Q1, D3 (1) (0) (0) Q2, D2 Q2, D1 Q2, D3 (1) (0) (0) Q3, D3 Q3, D1 Q3, D2 (1) (0) (0) 31
Loss: Bi-encoder Cross-encoder binary cross-entropy (BCE) . . - (1) (0) . 32
BM25, Bi-encoder, Cross-Encoder https://colab.research.google.com/drive/1Gx8sjSWz3W64ZbhPRTX_3uyHvisqtig S#scrollTo=D_hDi8KzNgMM SenteceBERT Bi-encoder, Cross-Encoder https://colab.research.google.com/drive/1KetpPOHrmSgif6UXmIJovl2phcKAZdT g#scrollTo=rqUfYfFROkq7 33
2 - Bi-encoder loop loss 37
Ref Cross Encoder from scratch https://github.com/UKPLab/sentence- transformers/blob/master/examples/training/ms_marco/train_cross- encoder_scratch.py https://github.com/UKPLab/sentence- transformers/blob/master/sentence_transformers/cross_encoder/CrossEncoder.py https://www.sbert.net/docs/package_reference/cross_encoder.html?highlight=mod el%20fit Bi encoder from scratch https://github.com/UKPLab/sentence- transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py https://www.sbert.net/examples/training/ms_marco/README.html?highlight=multi plenegative#bi-encoder 39
MTEB Leaderboard Massive Text Embedding Benchmark https://huggingface.co/spaces/mteb/leaderboard , ? ( ) ( , GPU ) 40
1. Task? 2. (multi-lingual) ? 3. ? (colab ~500M ) 41
Thank you 42
Huggingface? (Natural Language Processing, NLP) (Machine Learning) . Transformer , . https://huggingface.co/ 43
Mr-tydi https://huggingface.co/datasets/castorini/mr-tydi/viewer/korean KorQuad https://korquad.github.io/ https://huggingface.co/datasets/squad_kor_v1 BERT bert-base-multilingual-uncased https://huggingface.co/bert-base-multilingual-uncased koBERT https://huggingface.co/skt/kobert-base-v1 44
Dataset korQuAD homepage ! 45
https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification 46