Information Retrieval

1 / 13

Embed Share

Discover the different models within Information Retrieval (IR) in Computer Science, from Exact Match Query to Best Match Query and Boolean model to Vector Space model. Explore how these models are used to retrieve information efficiently from large collections of unstructured text data.

trle215 Follow

Uploaded on Apr 04, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval Models A model is an abstract representation of a process or object Used to study properties, draw conclusions, make predictions The quality of the conclusions depends upon how closely the model represents reality

Exact Match Query specifies precise retrieval criteria Every document either matches or fails to match query Result is a set of documents Usually in no particular order Often in reverse-chronological order

Best Match Query describes retrieval criteria for desired documents Every document matches a query to some degree Result is a ranked list of documents, best first

Information Retrieval Models Information retrieval models can be classified into : 1. Boolean model (Exact Match) 2. Vector Space model (Best Match) 3. Probabilistic model (Best Match)

Boolean model The model can be explained by thinking of a query term as a unambiguous definition of a set of documents. Documents are sets of terms Queries are Boolean expressions on terms. Queries are index terms linked by AND, OR, or NOT. It is an exact match model, which implies that a document is retrieved if and only if it matches the description of the query term set .

Example 1: Boolean model D1= computer information retrieval D2= computer retrieval D3= information D4= computer information Q1= information AND retrieval Q2 = information BUT NOT Computer . Answer: Q1= information AND retrieval D1 Q2 = information BUT NOT Computer D3

Example 2: Boolean model Doc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies Ultimately, this study makes us look a new at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science

Example 2: Boolean model Query: (principles AND knowledge) OR (science AND engineering) Doc 1 : 0 1 1 0 FALSE Doc 2 : 1 0 1 1 TRUE

Example 3: Boolean model Query: (principles OR knowledge) AND (science OR NOT engineering) Doc 1 : 0 1 1 0 TRUE

Example 4: INDEX The matrix below represent whether a certain word occurs (1) or does not occur (0) in agiven document. d1 d2 d3 d4 d5 d6 . . . Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Hence, the documents that contain Brutus and Caesar but do not contain Calpurnia are: 110100 and 110111 and 101111 = 100100 in words, d1, d4.

Problems with index Usually IR is done from a very large document collection (or corpus ). For instance, assume we have: _ 1 million documents, _ each document is about 1,000 words (2-3 book pages), _ each word is about 6 bytes. _ Then, the document collection is about 6 gigabytes (GB) size. _ With around 500,000 distinct terms

Problems with index The term-document matrix would be too big: 500K * 1M matrix has half-a-trillion 0 s and 1 s. They would not fit in a computer s memory. The 0 s could be many (sparse data). It might be better to record only the things that do occur, that is, the 1 s. This is the idea behind inverted index .

Information Retrieval

Download Presentation

Presentation Transcript

Related

More Related Content