Text and Text Processing CC 2007, 2011 Attribution - R.B. Allen

Text and Text Processing CC 2007, 2011 Attribution - R.B. Allen
Slide Note
Embed
Share

Fonts, OCR, Reading, Authorship, Text Processing, Information Extraction, Document Retrieval, Vector Model - various aspects of text processing and analysis are explored in this content, discussing topics such as readability, character recognition, linguistic knowledge, author identification, spell-checking, summarization, information extraction, document retrieval, and vector representation of documents and queries. The importance of clear fonts for highway signs, the role of linguistic and world knowledge in OCR, the authorship analysis of the Federalist Papers, and the use of vector models for text understanding are among the key concepts covered.

  • Text Processing
  • OCR
  • Reading
  • Authorship
  • Information Extraction

Uploaded on Feb 18, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Text and Text Processing CC 2007, 2011 attrbution - R.B. Allen

  2. Fonts Clearview (top) is a new font developed to make highway signs more readable. At highway speeds using headlights, the Clearview font is significantly more readable than the font traditionally used for highway signs. CC 2007, 2011 attrbution - R.B. Allen

  3. OCR Optical Character Recognition Recognition Process Looking for features Versus matching a template How much linguistic and world knowledge is needed for processing Collaborative corrections CC 2007, 2011 attrbution - R.B. Allen

  4. Reading Teaching reading. Close reading. Literacy CC 2007, 2011 attrbution - R.B. Allen

  5. Authorship of the Federalist Papers The Federalist Papers are a series of 30 essays published in newspapers to argue for the adoption of the U.S. Constitution. James Monroe and Alexander authored almost all of them but for some of the essays, the identity of the author them was lost. Because each author used a distinctive set of terms, the authorship was able to be determined by a Bayesian statistical analysis of the words. CC 2007, 2011 attrbution - R.B. Allen

  6. Text Processing Spell-checking Edit distance Parsing Summarization Text categorization CC 2007, 2011 attrbution - R.B. Allen

  7. Information Extraction Texts (e.g., Web pages) have a lot of information but it not well structured. If we could extract that information, we could develop better question answering systems. Named-Entity Extraction Template Matching CC 2007, 2011 attrbution - R.B. Allen

  8. Text Document Retrieval Literally CC 2007, 2011 attrbution - R.B. Allen

  9. Vector Model Words carry a lot of the meaning of documents. Thus, we can represent the meaning of a document fairly well with a list (i.e. a vector) of terms. Queries can be also be represented as vectors. Weighting terms with term frequency or document frequency. CC 2007, 2011 attrbution - R.B. Allen

  10. Other Text-Retrieval Techniques For Web pages, the hyperlinks are also an indication of similarity. This was captured in Google s PageRank Algorithm Learning from users. Social network links (what your friends are looking for) CC 2007, 2011 attrbution - R.B. Allen

  11. Retrieval Interfaces CC 2007, 2011 attrbution - R.B. Allen

  12. Indexing the Web Spidering CC 2007, 2011 attrbution - R.B. Allen

  13. Search Engine Business Models Advertising Ad-words Search engines linked to other services CC 2007, 2011 attrbution - R.B. Allen

  14. Automated Question Answering Recall the discussion of answering reference questions Automated question answering Question categorization Finding the answers From a knowledgebase Synthesizing answers from the Web CC 2007, 2011 attrbution - R.B. Allen

  15. Sentiment Analysis and Blog Retrieval There is can be a great advantage to knowing what s the populace is thinking. Example of difficulty. Valence detection CC 2007, 2011 attrbution - R.B. Allen

  16. Summarization What do we mean by a summary Techniques Extractive summarization Teaching summarization CC 2007, 2011 attrbution - R.B. Allen

  17. Translation Surface translation Pair-wise translations versus a common, language-neutral representation. Try a round-trip translation Increasingly, statistical methods are used for improving translations. CC 2007, 2011 attrbution - R.B. Allen

  18. Speech Processing Representing speech with Phonemes Basic sound units. In English there are about 56 phonemes Vowels vs. consonants Types of consonants: plosives, fricatives Many applications Speaker identification Word spotting Language recognition CC 2007, 2011 attrbution - R.B. Allen

  19. Speech Recognition Digitize the sound waves Spectrograms: From waveform to frequency Can we find the phonemes? Look for formants CC 2007, 2011 attrbution - R.B. Allen

  20. Automatically processing speech: Creating a spectrogram Original Sound Wave Sampled Sound Wave Wave Representation Frequency Representation CC 2007, 2011 attribution - R.B. Allen

  21. Phonemes Sounds which differentiate meaning Bit, But, Bat, Bet, Robot Types of Phonemes Vowels Consonants Fricatives f, s Nasals m, n Plosives p, t, k Flap tt (utter) Non-English sound Trill (Spanish perro) Click (!kung) CC 2007, 2011 attrbution - R.B. Allen

  22. Processing Speech to Find Formats CC 2007, 2011 attrbution - R.B. Allen

More Related Content