
Introduction to Information Retrieval: Document Indexing and Parsing Techniques
Explore the key concepts of document indexing and parsing in information retrieval, including the basic indexing pipeline, complications with document formats and languages, and considerations for defining a document. Learn about the challenges and solutions in handling diverse document types and languages efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion
Introduction to Information Retrieval Recall the basic indexing pipeline Documents to be indexed Friends, Romans, countrymen. Tokenizer Friends Romans Countrymen Token stream Linguistic modules friend roman countryman Modified tokens 2 4 Indexer friend 1 2 roman Inverted index 16 13 countryman
Sec. 2.1 Introduction to Information Retrieval Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? (CP1252, UTF-8, ) Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically
Sec. 2.1 Introduction to Information Retrieval Complications: Format/language Documents being indexed can include docs from many different languages A single index may contain terms from many languages. Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment. French email quote clauses from an English-language contract There are commercial and open source libraries that can handle a lot of this stuff
Sec. 2.1 Introduction to Information Retrieval Complications: What is a document? We return from our query documents but there are often interesting questions of grain size: What is a unit document? A file? An email? (Perhaps one of many in a single mbox file) What about an email with 5 attachments? A group of files (e.g., PPT or LaTeX split over HTML pages)
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion