
Understanding Stemming and Lemmatization in Information Retrieval
Learn about stemming and lemmatization in information retrieval, where stemming reduces terms to their roots before indexing, while lemmatization reduces inflectional/variant forms to their base form. Explore popular algorithms like Porter's algorithm and other stemmers, and understand the language-specific transformations involved in these processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introduction to Information Retrieval Introduction to Information Retrieval Stemming and Lemmatization
Sec. 2.2.4 Introduction to Information Retrieval Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color Lemmatization implies doing proper reduction to dictionary headword form
Sec. 2.2.4 Introduction to Information Retrieval Stemming Reduce terms to their roots before indexing Stemming suggests crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress.
Sec. 2.2.4 Introduction to Information Retrieval Porter s algorithm Commonest algorithm for stemming English Results suggest it s at least as good as other stemming options Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.
Sec. 2.2.4 Introduction to Information Retrieval Typical rules in Porter sses ss ies i ational ate tional tion Weight of word sensitive rules (m>1) EMENT replacement replac cement cement
Sec. 2.2.4 Introduction to Information Retrieval Other stemmers Other stemmers exist: Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Single-pass, longest suffix removal (about 250 rules) Paice/Husk stemmer Snowball Full morphological analysis (lemmatization) At most modest benefits for retrieval
Sec. 2.2.4 Introduction to Information Retrieval Language-specificity The above methods embody transformations that are Language-specific, and often Application-specific These are plug-in addenda to the indexing process Both open source and commercial plug-ins are available for handling these
Sec. 2.2.4 Introduction to Information Retrieval Does stemming help? English: very mixed results. Helps recall for some queries but harms precision on others E.g., operative (dentistry) oper Definitely useful for Spanish, German, Finnish, 30% performance gains for Finnish!
Introduction to Information Retrieval Introduction to Information Retrieval Stemming and Lemmatization