Text and Text Processing CC 2007, 2011 Attribution - R.B. Allen
Fonts, OCR, Reading, Authorship, Text Processing, Information Extraction, Document Retrieval, Vector Model - various aspects of text processing and analysis are explored in this content, discussing topics such as readability, character recognition, linguistic knowledge, author identification, spell-checking, summarization, information extraction, document retrieval, and vector representation of documents and queries. The importance of clear fonts for highway signs, the role of linguistic and world knowledge in OCR, the authorship analysis of the Federalist Papers, and the use of vector models for text understanding are among the key concepts covered.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Text and Text Processing CC 2007, 2011 attrbution - R.B. Allen
Fonts Clearview (top) is a new font developed to make highway signs more readable. At highway speeds using headlights, the Clearview font is significantly more readable than the font traditionally used for highway signs. CC 2007, 2011 attrbution - R.B. Allen
OCR Optical Character Recognition Recognition Process Looking for features Versus matching a template How much linguistic and world knowledge is needed for processing Collaborative corrections CC 2007, 2011 attrbution - R.B. Allen
Reading Teaching reading. Close reading. Literacy CC 2007, 2011 attrbution - R.B. Allen
Authorship of the Federalist Papers The Federalist Papers are a series of 30 essays published in newspapers to argue for the adoption of the U.S. Constitution. James Monroe and Alexander authored almost all of them but for some of the essays, the identity of the author them was lost. Because each author used a distinctive set of terms, the authorship was able to be determined by a Bayesian statistical analysis of the words. CC 2007, 2011 attrbution - R.B. Allen
Text Processing Spell-checking Edit distance Parsing Summarization Text categorization CC 2007, 2011 attrbution - R.B. Allen
Information Extraction Texts (e.g., Web pages) have a lot of information but it not well structured. If we could extract that information, we could develop better question answering systems. Named-Entity Extraction Template Matching CC 2007, 2011 attrbution - R.B. Allen
Text Document Retrieval Literally CC 2007, 2011 attrbution - R.B. Allen
Vector Model Words carry a lot of the meaning of documents. Thus, we can represent the meaning of a document fairly well with a list (i.e. a vector) of terms. Queries can be also be represented as vectors. Weighting terms with term frequency or document frequency. CC 2007, 2011 attrbution - R.B. Allen
Other Text-Retrieval Techniques For Web pages, the hyperlinks are also an indication of similarity. This was captured in Google s PageRank Algorithm Learning from users. Social network links (what your friends are looking for) CC 2007, 2011 attrbution - R.B. Allen
Retrieval Interfaces CC 2007, 2011 attrbution - R.B. Allen
Indexing the Web Spidering CC 2007, 2011 attrbution - R.B. Allen
Search Engine Business Models Advertising Ad-words Search engines linked to other services CC 2007, 2011 attrbution - R.B. Allen
Automated Question Answering Recall the discussion of answering reference questions Automated question answering Question categorization Finding the answers From a knowledgebase Synthesizing answers from the Web CC 2007, 2011 attrbution - R.B. Allen
Sentiment Analysis and Blog Retrieval There is can be a great advantage to knowing what s the populace is thinking. Example of difficulty. Valence detection CC 2007, 2011 attrbution - R.B. Allen
Summarization What do we mean by a summary Techniques Extractive summarization Teaching summarization CC 2007, 2011 attrbution - R.B. Allen
Translation Surface translation Pair-wise translations versus a common, language-neutral representation. Try a round-trip translation Increasingly, statistical methods are used for improving translations. CC 2007, 2011 attrbution - R.B. Allen
Speech Processing Representing speech with Phonemes Basic sound units. In English there are about 56 phonemes Vowels vs. consonants Types of consonants: plosives, fricatives Many applications Speaker identification Word spotting Language recognition CC 2007, 2011 attrbution - R.B. Allen
Speech Recognition Digitize the sound waves Spectrograms: From waveform to frequency Can we find the phonemes? Look for formants CC 2007, 2011 attrbution - R.B. Allen
Automatically processing speech: Creating a spectrogram Original Sound Wave Sampled Sound Wave Wave Representation Frequency Representation CC 2007, 2011 attribution - R.B. Allen
Phonemes Sounds which differentiate meaning Bit, But, Bat, Bet, Robot Types of Phonemes Vowels Consonants Fricatives f, s Nasals m, n Plosives p, t, k Flap tt (utter) Non-English sound Trill (Spanish perro) Click (!kung) CC 2007, 2011 attrbution - R.B. Allen
Processing Speech to Find Formats CC 2007, 2011 attrbution - R.B. Allen