Analyzing Text Length and Lexical Diversity in NLTK Book Texts

ling 388 computers and language n.w
1 / 25
Embed
Share

Explore the relationship between text length and lexical diversity in various NLTK Book texts by ranking them based on length and diversity. Investigate the hypothesis that lexical diversity decreases as text length increases, backed by Python code snippets and sorted data.

  • NLTK Book
  • Text Length
  • Lexical Diversity
  • Python Code
  • Data Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. LING 388: Computers and Language Lecture 22

  2. Administrivia Homework 10 Review Term Project Proposal On Stylometry Homework 11: easy! Due Sunday midnight

  3. Homework 10 Using text1, text2, , text9 from nltk.book, explore the hypothesis that lexical diversity decreases with text length from nltk.book import * texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9] Hint: can do it manually for each text or use sorted() with key=len. https://docs.python.org/3/howto/sorting.html Rank text1,..,text9 in order of text length. Report lengths. Rank text1,..,text9 in order of lexical diversity. Report diversity. Is the claim true?

  4. Homework 10 Review from nltk.book import * texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9] sorted(texts, key=len, reverse=True) [<Text: Moby Dick by Herman Melville 1851>, <Text: Inaugural Address Corpus>, <Text: Sense and Sensibility by Jane Austen 1811>, <Text: Wall Street Journal>, <Text: The Man Who Was Thursday by G . K . Chesterton 1908>, <Text: Chat Corpus>, <Text: The Book of Genesis>, <Text: Monty Python and the Holy Grail>, <Text: Personals Corpus>] [len(text) for text in sorted(texts, key=len, reverse=True)] [260819, 149797, 141576, 100676, 69213, 45010, 44764, 16967, 4867]

  5. Homework 10 Review lexical diversity: [text for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] [<Text: Personals Corpus>, <Text: Chat Corpus>, <Text: Monty Python and the Holy Grail>, <Text: Wall Street Journal>, <Text: The Man Who Was Thursday by G . K . Chesterton 1908>, <Text: Moby Dick by Herman Melville 1851>, <Text: Inaugural Address Corpus>, <Text: The Book of Genesis>, <Text: Sense and Sensibility by Jane Austen 1811>] ["{:.3f}".format(len(set(text))/len(text)) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] ['0.228', '0.135', '0.128', '0.123', '0.098', '0.074', '0.066', '0.062', '0.048'] [len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] [4867, 45010, 16967, 100676, 69213, 260819, 149797, 44764, 141576] len(set(text))/len(text)

  6. Homework 10 Review >>> import matplotlib.pyplot as plt >>> x = [len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] >>> y = [len(set(text))/len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] >>> plt.xlabel("Text length (in words)") Text(0.5, 0, 'Text length (in words)') >>> plt.ylabel("Lexical diversity") Text(0, 0.5, 'Lexical diversity') >>> plt.scatter(x,y) <matplotlib.collections.PathCollection object at 0x7f9dc0174dc0> >>> plt.show()

  7. Term Project Proposal Lecture 1: Term project (e.g. build some application) 25% of the grade Ask yourself: what are you interested in exploring? Must involve some use of what we've covered in terms of programming, e.g. straight Python or NLTK Propose some task, experiment or application you plan to prototype or build: (doesn't have to be a complete application) One page summary Send it to me (sandiway@email.arizona.edu) for project approval Soft deadline: due by end of this week

  8. On stylometry THE CHARACTERISTIC CURVES OF COMPOSITION by T. C. Mendenhall (1887). Course website: Mendenhall1887.pdf

  9. On stylometry Charles Dickens' Oliver Twist

  10. On stylometry: task 1 Let's test Mendenhall's hypothesis on Moby Dick (text1) vs. Sense and Sensibility (text2) Task 1: write a list comprehension each for text1 and text2, that transforms words into length of words. Save the resulting list of numbers as len1 and len2, respectively. Note: len(len1) and len(text1) should be the same, same for len2 and text2. Example: ['This', 'is', 'a', 'test', '.'] transforms into [4, 2, 1, 4, 1]

  11. On stylometry: task 1 Task 1: write a list comprehension each for text1 and text2, that transforms words into length of words. Save the resulting list of numbers as len1 and len2, respectively. from nltk.book import * len1 = [len(word) for word in text1] len(len1) 260819 len(text1) 260819 len2 = [len(word) for word in text2]

  12. On stylometry: task 2 Task 2: Use nltk's FreqDist() to plot corpora len1 and len2 Name some reasons why it's difficult to compare the two graphs.

  13. On stylometry: task 2 Task 2: Use nltk's FreqDist() to plot corpora len1 and len2 Name some reasons why it's difficult to compare the two graphs. fd1 = FreqDist(len1) fd2 = FreqDist(len2) fd1.tabulate(10) 3 1 4 2 5 6 7 8 9 10 50223 47933 42345 38513 26597 17111 14399 9966 6428 3528 fd2.tabulate(10) 3 2 1 4 5 6 7 8 9 10 28839 24826 23009 21352 11438 9507 8158 5676 3736 2596 fd1.plot() fd2.plot()

  14. On stylometry: task 2 [3,1,4,2,5,6,7, ] [3,2,1,4,5,6,7, ]

  15. On stylometry: task 2 Let's use matplotlib.pyplot to plot them together. import matplotlib.pyplot as plt mx = max(max(fd2),max(fd1)) mx (largest length of word across the two fds) 20 plt.hist(len1,range(1,mx+1),histtype='step') plt.hist(len2,range(1,mx+1),histtype='step') ax = plt.gca() ax.set_xticks(range(1,mx+1)) plt.show()

  16. On stylometry: task 2 y-axis: raw counts x-axis: word length

  17. On stylometry: task 2 https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.hist.html

  18. On stylometry: task 2 Using density=True gives us a normalized (proportional) y-axis instead of the raw counts plt.hist(len1,range(1,mx+1),histtype='step',density=True) plt.hist(len2,range(1,mx+1),histtype='step',density=True) ax = plt.gca() ax.set_xticks(range(1,mx+1)) plt.show()

  19. On stylometry: task 2 y-axis: proportion x-axis: word length

  20. On stylometry: task 3 Task 3: Mendenhall's method uses groups of words, e.g. 10,000 at a time. text1 contains >260,000 words Using only the 1st 100,000 words of text1, let's divide them into 10 groups of 10,000. Produce lists l1, l2, , l10 (length of words for each group of 10,000) from len1

  21. On stylometry: task 3 Let's do it manually (first): >>> l1 = len1[0:10000] >>> l2 = len1[10000:20000] >>> l3 = len1[20000:30000] etc. Let's do it with a loop: >>> l = [] >>> for i in range(0,100000,10000): ... l.append(len1[i:i+10000]) ... >>> len(l) 10

  22. On stylometry: task 4 Task 4: let's overlay the frequency plots to see if we see a consistent characteristic signature between the groups of 10,000 for text1.

  23. On stylometry: task 4 Task 4: let's overlay the frequency plots to see if we see a consistent characteristic signature between the groups of 10,000 for text1. Let's do it with a loop on the list l (from task 3) >>> mx = max(len1) >>> mx 20 >>> for list in l: ... plt.hist(list,bins=range(1,mx+1),histtype='step',density=True) ... >>> ax = plt.gca() >>> ax.set_xticks(range(1,mx+1)) >>> plt.show()

  24. On stylometry: task 4

  25. Homework 11 >>> from nltk.book import * text2: Sense and Sensibility by Jane Austen 1811 text7: Wall Street Journal Compute task 4 for the first 100,000 words of text2 and text7. Divide up each text into groups of 10,000 words (as shown in class) In your opinion, do the characteristic curves of text2 seem significantly different from that of text7? Show your python work Show your graphs Due date: Sunday midnight

Related


More Related Content