Compression Techniques in Information Retrieval: Enhancing Storage Efficiency

information retrieval information retrieval n.w

1 / 12

Embed Share

Explore the concept of compression in information retrieval, focusing on techniques to optimize storage efficiency, such as compression for inverted indexes, dictionary storage methods, and fixed-width terms. Learn why compression is crucial for maintaining main memory, reducing disk space usage, and improving search engine performance.

dart_y Follow

Uploaded on May 02, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

INFORMATION RETRIEVAL INFORMATION RETRIEVAL TECHNIQUES TECHNIQUES BY DR. ADNAN ABID Lecture # 11 Compression 1

ACKNOWLEDGEMENTS ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources 1. Introduction to information retrieval by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Sch tze 2. Managing gigabytes by Ian H. Witten, Alistair Moffat, Timothy C. Bell 3. Modern information retrieval by Baeza-Yates Ricardo, 4. Web Information Retrieval by Stefano Ceri, Alessandro Bozzon, Marco Brambilla

Outline Outline compression for inverted indexes Dictionary storage Dictionary-as-a-String Blocking 3

Basic indexing pipeline Documents to be indexed Friends, Romans, countrymen. Tokenizer Friends Romans Countrymen Token stream Linguistic modules friend roman countryman Modified tokens 2 4 Indexer friend 1 2 roman Inverted index 16 13 countryman 4

Why compression for inverted indexes? Why compression for inverted indexes? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Postings file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory 5

Dictionary storage Dictionary storage - - first cut first cut Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. Freq. Terms Postings ptr. a 999,712 aardvark 71 . . zzzz 99 20 bytes 4 bytes each Allows for fast binary search into dictionary

Fixed Fixed- -width terms are wasteful width terms are wasteful Most of the bytes in the Term column are wasted we allot 20 bytes for 1 letter terms. And still can t handle supercalifragilisticexpialidocious. Written English averages ~4.5 characters. Exercise: Why is/isn t this the number to use for estimating the dictionary size? Short words dominate token counts. Average word in English: ~8 characters. Explain this. What are the corresponding numbers for Italian text?

Compressing the term list: Compressing the term list: Dictionary Dictionary- -as as- -a a- -String String Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space. .systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo . Freq. Postings ptr. Term ptr. Total string length = 500KB x 8 = 4MB 33 29 44 Pointers resolve 4M positions: log24M = 22bits = 3bytes 126 Binary search these pointers 8 8

Total space for compressed list Total space for compressed list 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 500K terms 9.5MB Now avg. 11 bytes/term, not 20. Total Space = 500K terms * (4 bytes Freq + 4 bytes Postings + 3 bytes term pointer + 8 bytes per term in term string) = 500K terms * (19 Bytes/Term)

Blocking Blocking Store pointers to every kth on term string. Example below: k=4. Need to store term lengths (1 extra byte) .7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo . Freq. Postings ptr. Term ptr. 33 Save 9 bytes on 3 pointers. 29 Lose 4 bytes on term lengths. 44 126 7

Net Net Where we used 3 bytes/pointer without blocking 3 x 4 = 12 bytes for k=4 pointers, now we use 3+4=7 bytes for 4 pointers. Shaved another ~0.5MB; can save more with larger k. Why not go with larger k?

Resources Resources Chapter 5 of IIR Resources at http://ifnlp.org/ir Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a) Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002) More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006) 12

Compression Techniques in Information Retrieval: Enhancing Storage Efficiency

Download Presentation

Presentation Transcript

Related

More Related Content