Using Regular Expressions in Corpus Linguistics for Pattern Matching

1 / 19

Embed Share

"Discover the power of regular expressions in corpus linguistics for efficient pattern-matching tasks such as finding specific word sequences, adjectives, nouns, and more. Explore Chomsky's hierarchy and understand the language generated by a regular grammar."

stut_vea Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Course 2 Regular expressions and AntConc concordance tool University of Bucharest MMLE February 2020 Anca Dinu

Regular expressions In corpus linguistics much of what we want to do with a tool is pattern-matching over texts/corpora, like: Find all words that begin with k and end with a vowel. Find all words that have a sequence of three vowels. Find all three-syllable words. Find all adjectives ending in -ic. Find all plural nouns preceded by the in questions.

Regular expressions Regexes are expressions patterns. They are extremely useful in extracting information from any text by searching for matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters). Regular expression searches are the most popular, powerful, and easiest tool to use. They originated in Chomsky's formalized by mathematician Stephan Kleene (*). that represent string Hierarchy and

Chomsky Hierarchy From most to least strict, the four formal grammars in CH are: Regular grammars, which retain no past state knowledge from input string to output string. Context-free grammars, which knowledge from input string to output string. Context-sensitive grammars, knowledge from input string to output string. Unrestricted (or recursively enumerable) grammars, which have all state knowledge and thus can create every output string imaginable from a given input string. retain only recent state which keep all past state

Chomsky Hierarchy

Regular Grammar - linguistic flavour A regular grammar is a mathematical object, G, with four components, G = (N, , P, S), where N is a nonempty, finite set of nonterminal symbols, is a finite set of terminal symbols , or alphabet, symbols, P is a set of grammar rules, each of one having one of the forms A aB A a A , for A, B N, a , and the empty string, and S N is the start symbol.

The Language Generated by a Regular Grammar Let G be a regular grammar. The language generated by the regular grammar G= (N, , P, S) is L(G) = {w | S * w, where w *} Translation: the language of a regular grammar is the set of all strings over the alphabet that can be derived from the start symbol S by application of the grammar rules.

Regular grammar-regex equivalence A formal grammar (like regular grammar) generates and recognizes a language. Regexes do the same. Regular grammars are equivalent with regexes (approximately)

Regexes - CS flavour Regular expression is recursively defined as follows: The empty set is a regular expression. The empty string is a regular expression. For any character x in the input alphabet, x is a regular expression that produces the regular language {x}. Plus the following 3 operations:

Regexes - CS flavour Alternation: If x and y are regular expressions, then x | y is a regular expression. For example, the regular expression a|b produces the regular language {a,b}. Concatenation: If x and y are regular expressions, then x y is a regular expression. For example, the regular expression a b produces the regular language {ab}. Repetition (Kleene star): If x and y are regular expressions, then x* is a regular expression. For example, the regular language a b* produces the regular language {a, abb, abb, abbb, ...}.

Regexes - CS flavour There are some other operators derived from combinations of the three original operations on regexes (alternation, concatenation, repetition): +, *, etc (see regular expression cheat sheet of Michael Yoshitaka Erlewine) parenthesis add extra power w.r.t. Regular Grammars Special characters need to be escaped- preceded by \- to be interpreted literally.

Summary OR: A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey". Grouping: Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey". Quantification (after a token) ? indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour". * indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc". + indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", ..., but not "ac".

Summary {n} The preceding item is matched exactly n times. {min,} The preceding item is matched min or more times. {min,max} The preceding item is matched at least min times, but not more than max times. Wildcard: . matches any character. For example, a.b matches any string that contains an "a", then any other character and then a "b", a.*b matches any string that contains an "a" and a "b" at some later point. Take a look at examples https://medium.com/factory-mind/regex-tutorial-a-simple- cheatsheet-by-examples-649dc1c3f285

AntConc AntConc is a general purpose tool for analysing corpora. It is free and easy to download and use. It can be used for virtually any language. It supports plain and annotated corpora. Made by Anthony Laurence: http://www.laurenceanthony.net/software/antconc/

Basics Loading corpus files Viewing files Word list Concordance tool Tool preferences Global settings

Word list A word list produces a list of words, ordered in their frequency order they appear in a corpus; Sort by: the frequency (default), by the word (alphabetically), by the word end, by inverse order; The word list can be saved by AntConc as a text file; Tool preferences for Word list: Lemma list: a list with the inflections of words, for instance for be: is, are, been, was, were, etc. It returns the list of head words, accompanied (inflections) and their frequency. by their family words

Word list Word list range: use specific words (only the words the user is interested in) or use stop list (exclude the words in the stop list). The use of those two options depends on the goal of the analysis: if the user studies the stylom of an author or authorship identification, s/he could look only for function words, like prepositions or pronouns, because they are harder to mistify; if the user performs a semantic study, s/he might want to exclude the functional (stop) words.

Concordance tool Search for words and patterns Sort by left and right context Ex: report, reported, reporting, report on, to report Search with wildcards: Ex: report* (all wildcards are in Global Settings) Editing tricks: click on the highlighted words, using shift, alt, ctrl Search options: not word (rep, por), lower/upper case

Concordance tool Search for regular expressions: \br[a-z]+?t\b \bcat\b \bcat\w. (cat|dog) [aeiou][aeiou][aeiou] \d \b(\w+)er\b Advanced search: load multiple search words or a search list from a file and search for a context word in a window (said) Clone results for comparing 2 or more results Exporting the results: Tool preferences (adding a delimiter to the hit word, as tab, for copy-pasting into Excel spreadsheet).

Using Regular Expressions in Corpus Linguistics for Pattern Matching

Download Presentation

Presentation Transcript

Related

More Related Content