Building an Inverted Index with Dr. Claudia Pearce

how to build an inverted index n.w
1 / 30
Embed
Share

In this tutorial by Dr. Claudia Pearce, learn how to build an inverted index in phase 1 by parsing, down-casing words, ignoring punctuation, and creating files for each word. Incremental counts for terms like "he," "likes," "wink," and "drink" are demonstrated in documents D1 and D2.

  • Inverted Index
  • Text Processing
  • Data Analysis
  • Information Retrieval

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. How to Build an Inverted Index Dr. Claudia Pearce

  2. What we did in Phase 1 For each document: parse, remove tokens one at a time, down-case all words, Ignore all punctuation, write each word to a file, each on a new line Save each file created from doc-number.html in a new directory as doc-number.txt D1 stored as 1.txt D2 stored as 2.txt .

  3. 1|1 2|0 3|0 5|0 4|0 D1: he likes wink he likes drink Doc ID = i |Di|= number of terms in Di he 1 df Hash table bucket (Dictionary Term in Python) 1 1 Count of word (he) in document Often written as tf doc id

  4. 1|2 2|0 3|0 5|0 4|0 D1: he likes wink he likes drink likes 1 he 1 1 1 1 1

  5. 1|3 2|0 3|0 5|0 4|0 D1: he likes wink he likes drink wink 1 likes 1 he 1 1 1 1 1 1 1

  6. 1|4 2|0 3|0 5|0 4|0 D1: he likes wink he likes drink 4 wink 1 likes 1 he 1 1 1 1 2 1 1 Increment the count for he in D1

  7. 1|5 2|0 3|0 5|0 4|0 D1: he likes wink he likes drink wink 1 likes 1 he 1 1 1 1 2 1 2 Increment the count for likes in D1

  8. 1|6 2|0 3|0 5|0 4|0 D1: he likes wink he likes drink wink 1 likes 1 he 1 drink 1 1 1 1 2 1 2 1 1

  9. 1|6 2|1 3|0 5|0 4|0 D2: he likes drink drink drink likes 1 wink 1 drink 1 he 2 df +1 1 1 1 2 1 2 1 1 2 1

  10. 1|6 2|2 3|0 5|0 4|0 D2: he likes drink drink drink df+1 likes 2 wink 1 drink 1 he 2 1 1 1 2 1 2 1 1 2 1 2 1

  11. 1|6 2|3 3|0 5|0 4|0 D2: he likes drink drink drink likes 2 wink 1 drink 2 he 2 1 1 1 2 1 2 1 1 2 1 2 1 2 1

  12. 1|6 2|4 3|0 5|0 4|0 D2: he likes drink drink drink likes 2 wink 1 drink 2 he 2 1 1 1 2 1 2 1 1 2 2 2 1 2 1 Increment the count for drink in D2

  13. 1|6 2|5 3|0 5|0 4|0 D2: he likes drink drink drink likes 2 wink 1 drink 2 he 2 1 1 1 2 1 2 1 1 2 3 2 1 2 1 Increment the count for drink in D2

  14. 1|6 2|5 3|1 5|0 4|0 D3: thing he likes drink ink likes 2 wink 1 drink 2 thing 1 he 2 1 1 1 2 1 2 1 1 3 1 2 3 2 1 2 1

  15. 1|6 2|5 3|2 5|0 4|0 D3: thing he likes drink ink likes 2 wink 1 drink 2 thing 1 he 3 1 1 1 2 1 2 1 1 3 1 2 3 2 1 2 1 3 1

  16. 1|6 2|5 3|3 5|0 4|0 D3: thing he likes drink ink likes 3 wink 1 drink 2 thing 1 he 3 1 1 1 2 1 2 1 1 3 1 2 3 2 1 2 1 3 1 3 1

  17. 1|6 2|5 3|4 5|0 4|0 D3: thing he likes drink ink likes 3 wink 1 drink 3 thing 1 he 3 1 1 1 2 1 2 1 1 3 1 2 3 2 1 2 1 3 1 3 1 3 1

  18. 1|6 2|5 3|5 5|0 4|0 D3: thing he likes drink ink ink 1 likes 3 wink 1 drink 3 thing 1 he 3 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 3 1 3 1 3 1

  19. 1|6 2|5 3|5 5|0 4|1 D4: ink he likes drink pink ink 2 likes 3 wink 1 drink 3 thing 1 he 3 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1

  20. 1|6 2|5 3|5 5|0 4|2 D4: ink he likes drink pink ink 2 likes 3 wink 1 drink 3 thing 1 he 4 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1 1 4

  21. 1|6 2|5 3|5 5|0 4|3 D4: ink he likes drink pink ink 2 likes 4 wink 1 drink 3 thing 1 he 4 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1 1 4 1 4

  22. 1|6 2|5 3|5 5|0 4|4 D4: ink he likes drink pink ink 2 likes 4 wink 1 drink 4 thing 1 he 4 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4

  23. 1|6 2|5 3|5 5|0 4|5 D4: ink he likes drink pink ink 2 pink 1 likes 4 wink 1 drink 4 thing 1 he 4 1 4 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4

  24. ink 2 pink 1 likes 4 wink 1 drink 4 thing 1 he 5 D5: he likes wink drink pink ink 1 4 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4 1 5 1|6 2|5 3|5 5|1 4|5

  25. ink 2 pink 1 likes 5 wink 1 drink 4 thing 1 he 5 D5: he likes wink drink pink ink 1 4 3 1 1 1 3 1 1 2 1 2 1 1 2 3 2 1 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4 1 1 5 5 1|6 2|5 3|5 5|2 4|5

  26. ink 2 pink 1 likes 5 wink 2 drink 4 thing 1 he 5 D5: he likes wink drink pink ink 1 4 3 1 1 1 3 1 1 2 1 2 1 1 1 2 3 2 1 5 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4 1 1 5 5 1|6 2|5 3|5 5|3 4|5

  27. ink 2 pink 1 likes 5 wink 2 drink 5 thing 1 he 5 D5: he likes wink drink pink ink 1 4 3 1 1 1 3 1 1 2 1 2 1 1 1 2 3 2 1 5 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4 1 1 5 5 1 5 1|6 2|5 3|5 5|4 4|5

  28. ink 2 pink 2 likes 5 wink 2 drink 5 thing 1 he 5 D5: he likes wink drink pink ink 1 4 3 1 1 1 3 1 1 2 1 2 1 1 1 1 2 3 2 1 5 5 2 1 1 4 3 1 3 1 3 1 1 1 4 4 1 4 1 1 5 5 1 5 1|6 2|5 3|5 5|5 4|5

  29. SUM(dfi) =23 ink 3 pink 2 likes 5 wink 2 drink 5 thing 1 he 5 Is the number of postings in your index D5: he likes wink drink pink ink 1 4 3 1 1 1 3 1 1 2 1 2 1 1 1 1 2 3 2 1 5 5 2 1 1 4 3 1 3 1 3 1 1 5 1 1 4 4 1 4 1 5 1 1 5 5 1|6 2|5 3|5 5|6 4|5

  30. Storage/Query: Inverted Index In memory if you have enough Now there are Petabyte Memory Machines - $$$$ Term-at-a-time On disc Pull up a list or more at a time from disc Doc-at-a-time Query mode If Query is Q: pink ink Then retrieve the resulting lists from disc Use Linear Merge to match up documents to score ink 3 pink 2 1 4 3 1 1 5 1 4 1 5

More Related Content