Natural Language Processing Essentials

ece467 natural language processing words n.w

1 / 30

Embed Share

Explore the fundamentals of natural language processing (NLP) including tokenization, words, morphology, subwords, and lemmatization. Understand the importance of preprocessing text for NLP applications and the nuances of characterizing words, wordforms, and tokens in linguistic analysis.

kyah Follow

Uploaded on Jul 02, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ECE467: Natural Language Processing Words, Morphology, Subwords, and Tokenization

Tokenization As specified in the intro lecture, the field of natural language processing (NLP) deals with tasks that involve the processing of human languages, a.k.a. natural languages Statistical approaches to NLP rely on machine learning; such approaches have dominated symbolic approaches for decades Conventionally, most statistical NLP applications were trained on labeled natural language text Modern NLP systems are often pre-trained on unlabeled text, and then they are often fine-tuned on smaller, labeled datasets Regardless, the text is often contained in documents (loosely speaking), which might have the form of text files, articles in various formats, web pages, etc. Even after stripping metadata, images, non-relevant sections, etc., the text itself is often in the form of paragraphs divided into sentences Sentences contain not only words, but also whitespace, punctuation marks, and sometimes other typographical symbols Typically, the text needs to be preprocessed such that the input to NLP applications is in the form of a sequence of tokens; the task that achieves this is called tokenization This topic is partly based on Ch. 2 of the textbook, titled "Regular Expressions, Tokenization, Edit Distance"

Words, wordforms, and tokens Tokenization is sometimes casually defined as the splitting of text into words, but the meaning of "words" in this sense is ambiguous A wordform is a specific form of a word (e.g., a pluralized noun, a conjugated verb, etc.) In conventional NLP, tokenization often involved splitting natural language text into a sequence of instances of wordforms There are various possible ways to deal with punctuation Some applications may just use whitespace to separate wordforms; however, the results are generally not clean Some may strip punctuation altogether, which is generally OK if you don't care about sentences, but there will be some oddities Some may count certain punctuation (e.g., certain periods, questions marks, quotation marks, etc.) as separate tokens Some applications may do more complicated splitting; e.g., some split a possessive -s ('s) into a separate token There is also often postprocessing applied to the wordforms to form the final tokens For some conventional applications, tokens are kept in their original wordforms Other applications may convert all letters to lowercase to be case-insensitive Other applications may perform lemmatization or stemming (these topics are discussed on the next two slides)

Lemmatization A lemma is the base form of a word (sometimes called the canonical form or dictionary form) Often, multiple wordforms (e.g., "sing", "sang", "sung", "singing") share the same lemma The set of words that share the same lemma must all have the same major part of speech (POS) We'll discuss parts of speech in more detail during our topic on linguistic concepts; the concept will also be mentioned briefly again later in this topic Algorithms to perform lemmatization can involve morphological parsing of wordforms (we'll learn a bit more about this later in the topic) However, it was more common in conventional NLP to use a resource to look up the lemma; e.g., a very popular resource for conventional NLP used to be WordNet WordNet also had various other important uses in conventional NLP

Stemming Stemming is simpler than lemmatization Stemming involves the use of a sequence of rules to convert a wordform to a simpler form For example, the Porter stemmer was very popular in conventional NLP; sample rules are: In theory, words that share the same lemma should map to the same stem, and words that do not share the same lemma should not In practice, it sometimes works but sometimes does not

Sentence Segmentation Another component of many NLP applications is sentence segmentation (which is not trivial) It may seem intuitive to split a document into sentences first, and then to tokenize sentences, and some NLP applications do this More often, the opposite occurs, since the result of tokenization aids sentence segmentation One complication is that end-of-sentence markers are also used for other purposes For example, periods are also used for acronyms (e.g., "U.S.A.", "m.p.h."), abbreviations (e.g., "Mr.", "Corp."), and decimal points (e.g., "$50.25") Note that acronyms and some abbreviations ending in periods can end sentences at times The process of tokenization, optionally followed by lemmatization or stemming and/or sentence segmentation, is often referred to as text normalization

The Chomsky Hierarchy The Chomsky hierarchy (not mentioned in the textbook) defines four types of formal grammars that are useful for various tasks These are unrestricted grammars (type 0), context-sensitive grammars (type 1), context-free grammars (type 2), and regular grammars (type 3) The types are numbered from the most powerful / least restrictive (type 0) to the least powerful / most restrictive (type 3) When choosing one of these formalisms for a task, it is often useful (i.e., simpler and more efficient) to use the most restrictive type of grammar that suits your purpose Regular grammars are generally powerful enough to specify rules for allowable tokens There are various equivalent ways to define a specific instance of each type of grammar that is part of the Chomsky hierarchy For example, when we talk about context-free grammars during our topic on linguistic concepts, we will define them using productions, a.k.a. rewrite rules Regular grammars can also be defined using rewrite rules; or they can be defined using finite state automata; however, in this topic, we will define them with regular expressions

Regular Expressions A regular expression (RE) is a grammatical formalism useful for defining regular grammars Each regular expression is a formula in a special language that specifies simple classes of strings (a string is a sequence of symbols) The textbook covers regular expressions in Section 2.1; we will briefly discuss them now Here is the precedence hierarchy for defined operators in regular expressions, according to our textbook (I'll discuss the meanings of the operators in class): 1. Parenthesis: ( ) 2. Counters: * + ? { } 3. Sequences and anchors (e.g., the, ^my, end$, etc.) 4. Disjunction: | Although the above list is fairly typical, some sources list brackets as having higher precedence than parentheses, and a backslash to indicate special characters above that We'll look at an example of a regular expression using some of these symbols on the next slide Most modern programming languages provide libraries allowing users to define and use regular expressions effectively; they typically also allow registers, which increases the power of REs

Regular Expression Example The textbook steps through an example of searching for the word "the" in a text without any false positives or false negatives Although the current draft does not state it, the examples in the textbook use Perl syntax; Perl used to be a very popular programming language in the field of NLP The book speculates that we want any non-letter to count as a word separator That is, to the immediate left of the 't' and to the immediate right of the 'e', any non-letter may be present We also need to consider that "the" may occur at the start or end of a line Finally, they c the 't' to be capital or lowercase, but the 'h' and 'e' must be lowercase They end up with the following expression: /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/ Regular expressions can easily be used to define rules for allowable tokens, and they are useful for implementing word-based tokenizers Regular expressions are also useful for implementing lexical analyzers (e.g., for compilers), search/replace operations, and input validation.

Morphology Book: "Morphology is the study of the way words are built up from smaller meaning-bearing units called morphemes." The current edition of the textbook has dropped almost all its discussion of morphology We will discuss it in more detail than the book, but significantly less than I used to; the examples generally come from the previous edition of the textbook A morpheme is the smallest part of the word that has a semantic meaning; for example, given the wordform, "going", the parsed form can be represented as: "VERB-go + GERUND-ing" Morphological parsing is the task that takes a wordform, and splits it into its component morphemes The current draft of the textbook very briefly discusses morphological parsing, which they say is necessary for "sophisticated methods for lemmatization" It was mentioned earlier that, in practice, lemmatization could also use an external resource, such as WordNet, to look up a wordform and retrieve the lemma; however, keeping such resources current involved a lot of manual effort It was also mentioned that we can avoid lemmatization by applying stemming instead, which is much simpler, but it doesn't always work as well Conventionally, morphological parsing sometimes played an important role for POS tagging (a task in which each word is automatically labeled with its POS) For morphologically complex languages (we'll briefly discuss what this means soon), it can also play an important role for applications such as web search

Stems and Affixes Two board classes of morphemes are stems and affixes The stem is the main (i.e., the central, most important, or most significant) morpheme of the word Affixes add additional meanings of various kinds Affixes can be further divided into prefixes, suffixes, infixes, and circumfixes English debatably does not have circumfixes and proper English probably does not have any infixes The second edition of the textbook, as well as some other sources I have seen, states that "English doesn't really have circumfixes" I have seen some other sources disagree, listing possible circumfixes such as "a-ing" (e.g., "a-going"), "en-en" (e.g., "enlighten"), and "em-en" (e.g., "embolden") The second edition of the textbook also specifies two "taboo" infixes that occur in "certain dialects of English") These are "f**king" (e.g., "Man-f**king-hattan") and "bl**dy" (e.g., "abso-bl**dy-lutley"); the asterisks are from the textbook (I would have been comfortable leaving them out of "bloody") A word can have more than one affix; for example, the word "unbelievably" has a stem ("believe"), a prefix ("un-"), and two suffixes ("-able" and "-ly") English rarely stacks more than four or fix affixes to a word, but some morphologically complex languages have much more; for example, Turkish may commonly contain words with nine or ten affixes We will briefly discuss morphologically complex languages soon

Combining Morphemes Four methods of combining morphemes to create new words include infection, derivation, compounding, and cliticization Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same basic POS as the stem, usually filling some syntactic function like agreement Examples of inflection in English include pluralizing a noun or changing the tense of a verb Orthographical rules specify typical spelling changes due to inflection; for example, to pluralize a noun ending in "y", change the "y" to an "i" and add "es" (e.g., "bunnies") Exceptions must be accounted for separately (e.g., the plural of "fish" is "fish", and the plural of "goose" is "geese) In some languages, inflection can also be used to account for case, grammatical gender, politeness level, etc. Derivation is the combination of a word stem with a morpheme, usually resulting in word of a different class, often with a meaning that is harder to predict Examples from the textbook (previous edition) include "appointee" from "appoint and "clueless" from "clue" An example from Wikipedia is "happiness" from "happy" Compounding is the combination of multiple word stems together; for example, "doghouse" Cliticization is the combination of a word stem with a clitic A clitic is a morpheme that acts syntactically like a word but is reduced in form and attached to another word An example is the English morpheme "'ve" in words such as "I've" (substituting for "have") or the French definite article "l'" in words such as "l'opera" (substituting for "le")

Morphologically Complex Languages As alluded to a few times already, relative to English, some languages are morphologically complex, meaning that words can contain significantly more morphemes than is typical in other languages Languages that the textbook calls morphologically complex include Arabic, Polish, Siberian Yupik, Czech, Hungarian and Turkish (these references are scattered throughout the textbook) Extreme examples of morphologically complex languages are called polysynthetic languages (the current draft of the textbook mentions this term once) One example of a polysynthetic language comes from chapter 4 of "The Atoms of Language", by Mark C. Baker; Chapter 4 is titled "Baking a Polysynthetic Language" The chapter's title stems from an analogy; adding yeast to a recipe makes bread seem completely different from a cracker The example discussed in the chapter is the language Mohawk An example of a long word in the language is: "Washakotya'tawitsherahetkvhta'se'" According to Baker, this roughly translates to the English sentence: "He made the thing that one puts on one's body [e.g., a dress] ugly for her." Baker points out that a single word is used in Mohawk for any garment that covers the torso The apostrophes in the Mohawk represent a glottal stop, a sound not commonly found in American English Word-based tokenization for morphologically complex languages is problematic for multiple reasons (I'll discuss this a bit more in class)

The Language Instinct Steven Pinker is a cognitive psychologist, a psycholinguist, and a Professor in the Department of Psychology as Harvard University He has written several books for general audiences, a few of which deal directly with language and language learning I am basing the next few slides of this topic on material from chapter 5 of "The Language Instinct", by Pinker The book, in general, discusses Pinker's theories related to how humans process language The title of Chapter 5 is, "Words, Words, Words" Pinker explains that humans remember stems, rules related to inflection, rules related to derivation (but we will see that these are less predictable), and irregular words According to Pinker, the fact that rules of inflection are remembered has been demonstrated in psychological studies with children Pinker describes one such experiment known as "the wug test" (I'll discuss it in class)

Pinker on Derivation When Pinker talks about derivation, he says the original word is called a root For example, the stem "electricity" is based on the root "electric" (but note that the pronunciation of the "c" has changed) However, when new stems from roots and an affix, the rules are not always consistent This relates to meanings, pronunciations, and which roots can be converted to which stems Example: "complexity" is the state of being complex, but "electricity" is not the state of being electric (you wouldn't say that the electricity of a can opener makes it convenient) Pinker also points out that there are no such words as "academicity", "acrobaticity", or "alcoholicity" Actually, Pinker's third example is valid according to MS Word/PowerPoint and dictionary.com, with the apparent meaning, "alcoholic quality or strength" Pinker also points out that a newly formed stem can often accept additional inflectional affixes (e.g., we can form the word "complexities")

Ordering Suffixes Pinker also points out that there are rules that govern the order that suffixes must follow when multiple suffixes are added to a single stem Consider, for example, the stems "Darwinian" and "Darwinism", both based on the same root (in this case, a name), "Darwin" The word "Darwinian" might be defined as "pertaining to Darwin or his theories" or "relating to Darwin or his theories" The word "Darwinism" might be defined as "something related to Darwin or his theories" (you may find more specific definitions of this word, these days) From the stem "Darwinian", we can derive "Darwinianism" (it sounds good), and then using inflection, we could form "Darwinianisms" However, from "Darwinism", we cannot derive the word "Darwinismian" (such a word might mean "pertaining to Darwinism", but it sounds bad) Pinker also points out that you can't add form a new stem from a root that has already been inflected; for example, "Darwinism" sounds fine, but "Darwinsism" does not I add: Note that humans never explicitly learn these rules, but we seem to abide by them

Listemes Pinker proposes that stems (including those produced through derivation) generally must be memorized separately; they are part of our mental lexicon As mentioned earlier, Pinker claims that in addition to stems, irregular forms of words also must be memorized Pinker claims that humans also memorize names, some foreign words, acronyms, etc. Pinker refers to each unit of the memorized list as a listeme He cites one study that estimated the average high school graduate knows about 45,000 listemes First, the researchers took an unabridged dictionary, and removed all inflections and some derivations if the meanings could be deduced from the parts Then, they quizzed volunteers on random samples from the remaining list, and based on the percentage of words known, they calculated their estimate Pinker considers this an underestimate of the true number of memorized listemes, because it does not include proper names, numbers, foreign words, acronyms, etc. Pinker ultimately increases his estimate to 60,000 listemes; of course, from these listemes, many other inflected (and in some cases, derived) words can be created

That's Amazing! Assuming that humans start to learn words at age 1, Pinker points out that young humans learn new words at the approximate rate of one every 90 waking minutes, on average! To indicate how amazing this is, Pinker discusses a thought experiment from the logician and philosopher Quine He asks us to imagine a linguist studying the language a newly discovered tribe; a rabbit scurries by, and a native shouts, "Gavagai!" The linguist will probably conclude that "gavagai" means "rabbit", but why? Pinker lists several logically plausible possibilities of what "gavagai" could mean (e.g., a rabbit's foot, a furry thing, an animal, a thing that scurries, scurrying in general, "anything that is either a rabbit or a Buick", etc.) Pinker argues that children learning words make sub-conscious assumptions about the type of things that words are likely to refer to, staying away from the very specific and the very general Ultimately, though, we do learn general words such as "animal" (as well as specific words, when appropriate) Pinker discusses experiments that show that children subconsciously assume that new words do not share meanings with common words that they already know as a partial explanation for this

Out-of-vocabulary words Words that have not been seen in a training set are called unknown words, a.k.a. out-of-vocabulary (OOV) words For some machine learning applications, OOV words can be ignored, but this does not make sense for applications such as machine translation One possibility often used in conventional NLP was to convert rare words in a training set to a special token, such as <UNK>, and to replace OOV words in later documents with the same token Reasons for OOV words include: New words are frequently introduced to the language; once such words become common, they are sometimes referred to as neologisms It is also common for named entities to introduce new terms; e.g., people, organizations such as companies, names of games, etc. Some words will be misspelled Some words are borrowed from other languages; these are sometimes called loanwords A related notion is transliteration, when a word from one language is converted to another alphabet, attempting to keep the pronunciation the same The textbook used to discuss OOV words in the chapter on N-grams, but as of the August 2024 draft, it dropped this discussion, although the chapter on N-grams remains I have dropped the topic in N-grams this semester, but I will probably briefly mention N-grams when we first talk about language models The current draft of the textbook does, however, make brief references to unknown words as a source for potential problems or difficulties throughout various topics

Problems with Word-based Tokenization As we have seen, after word-based tokenization, other text normalization steps are often applied For example, we previously mentioned case conversion, stemming, or lemmatization The best techniques may differ from task to task, and from language to language Once a choice is made, the same steps must be applied consistently (i.e., whatever techniques were used for training must be used for all future documents) As we have learned, some languages are morphologically complex, with instances of words typically containing relative more morphemes This adds to the number of distinct wordforms in a dataset of a certain size, so each specific wordform occurs less frequently Statistics learned about specific wordforms will therefore generally be less accurate Word-based tokenization makes it difficult to deal with unknown words, a.k.a. out-of-vocabulary (OOV) words, which were discussed on the previous slide For some applications, OOV words can be ignored, but this does not make sense for applications such as machine translation Often, conventional NLP systems had to apply a hacky solution, such as replacing unknown words with a special <UNK> token (rare words in the training set would be replaced with the same token) In some languages, words are not separated by spaces, making it difficult to separate words algorithmically The book mentions Chinese, Japanese, and Thai as languages that do not separate words with spaces For example, they state that in Chinese, words are composed of Hanzi characters Each character generally represents a single morpheme and is pronounced as a single syllable

Subwords Most modern NLP systems and some conventional NLP systems tokenize based on smaller pieces of words, generally known as subwords Possible methods for tokenizing based on subwords include: As previously discussed, morphological parsing can be used to split words into morphemes, and these morphemes can be used as tokens However, morphological parsing can be expensive, and proper methods differ between languages Individual characters can be used as tokens; some deep-learning systems have successfully used this approach, mapping characters to character embeddings (as opposed to word embeddings) Another approach to creating subword embeddings is known as fasttext; the book (current draft) has one paragraph on fasttext in Section 6.8.3, but we will not cover this approach Language-specific rule-based methods can be used to split words into subwords; the book covers this sort of approach in Section 2.5.1, but we will not cover such approaches An algorithm to learn a vocabulary of subwords from a training set can be applied One such methods known as by byte-pair encoding (BPE) is discussed in Section 2.5.2 of the textbook, and we will spend the rest of the current topic covering this method An alternative method is known as WordPiece; the textbook briefly describes wordpiece in Section 13.2.1 (and mentions it in a few other locations), but I am not currently planning to cover this method in the course

Byte-pair Encoding Overview Byte-pair encoding (BPE) was originally developed in 1994 by Gage for use as a compression algorithm BPE has been around for decades, but it was adapted for tokenization used to generate subword embeddings, starting with a paper by Sennrich et al. in 2016 Our textbook discusses BPE as a method for tokenization in Section 2.5.2 BPE includes a preprocessing step in which a document is split into words, generally involving whitespace Then, an extra end-of-word symbol is added to each word BPE, as well as the previously mentioned WordPiece, lead to subwords that do not cross word boundaries There is also a more general library called SentencePiece, which includes implementations of BPE and WordPiece (and other subword tokenization methods) SentencePiece can be also trained on raw sentences in a way that allows tokens to cross word boundaries I do not plan to discuss SentencePiece further in this course

BPE Algorithm Based on the preprocessing step, the BPE algorithm starts with a list of words, with counts, each split into characters and ending in the end-of-word symbol The algorithm starts with a separate vocabulary symbol for each character in the text being processed (e.g., every ASCII character or every Unicode character) There is also a special end-of-word symbol (we'll denote it as '_') concatenated to the end of each word in the corpus being used to generate the BPE tokens that will be recognized Thus, the initial vocabulary (i.e., recognized set of tokens) is the set of characters in the encoding plus the end-of-word symbol The algorithm then iteratively combines the most frequent sequential pair of symbols to form a new symbol The new symbol gets added to the vocabulary Words containing the new symbol are updated in the corpus being processed This proceeds for a fixed number of iterations, k, where k is a parameter of the algorithm The algorithm only considers pairs of symbols within words (i.e., not crossing word boundaries) We will step through an example to help explain this

BPE Example: Initial State We will step through an example from textbook to help explain this The example is based on a corpus with 5 distinct words, 18 total instances of words, and initially 11 distinct symbols (including the end-of-word symbol, represented as an underscore) Initially, the vocabulary consists only of the 11 original symbols, and the counts of all the words is known

BPE Example: Step 1 During the first iteration of the algorithm, it is determined that the most frequent symbol pair is 'e' followed by 'r', which occurs 9 times Therefore, a new symbol for 'er' is created and added to the vocabulary All instances of this pair are replaced by the new symbol Note that 'r' followed be '_' ties with this, so that also would have been a valid choice

BPE Example: Step 2 During the second iteration, the most frequent pair is 'er' followed by '_', which occurs 9 times Therefore, a new symbol for 'er_' is created and added to the vocabulary All instances of this pair are replaced by the new symbol Note that 'e', 'r', and 'er' are also still members of the vocabulary

BPE Example: Step 3 During the third iteration, the most frequent pair is 'n' followed by 'e', which occurs 8 times This time, a new symbol for 'ne' is created and added to the vocabulary All instances of this pair are replaced by the new symbol Note that 'e' followed be 'w' ties with this, so that also would have been a valid choice

BPE Example: Five More Merges The book shows the merged pairs and the updated vocabularies for the next five steps Note that the algorithm remembers the order in which the merges have been generated

Using BPE to Tokenize We can apply the result of the BPE training algorithm to tokenize future texts as follows: Start with all characters represented by separate symbols Then apply the merges greedily in the order that they were learned Note that when we apply the tokenization, the frequencies in the data being tokenized do not play a role Based on the example, the word "newer" in the document being tokenized would become a single token, "newer_" The word "lower" would become "low er_" (two separate tokens)

BPE Advantages As mentioned earlier, the training algorithm for BPE runs for a fixed number of iterations, k This determines the size of the vocabulary, which is typically significantly smaller than the size of a conventional vocabulary The book says that "in real settings BPE is run with many thousands of merges on a very large input corpus" The book claims that "most words will be represented as full symbols, and only the very rare words (and unknown words) will have to be represented by their parts" This is a bit different from what I ve read about the algorithm in the past; I learned that: Common words will be represented as single symbols Common morphemes (e.g., "un-", "-ies", etc.) will be represented as single symbols BPE (as well as other modern tokenization algorithms) can achieve something similar to morphological parsing with a much simpler algorithm, efficiently producing sensible tokens OOV words are likely to be tokenized into meaningful chunks BPE and similar algorithms work well across many languages (although typical implementations of BPE assume that words are space-separated)

Natural Language Processing Essentials

Download Presentation

Presentation Transcript

Related

More Related Content