
Understanding Regular Expressions for Pattern Matching
Regular expressions, also known as regex, are used to describe patterns or sequences of characters without specifying the characters literally. They are essential for pattern matching and pattern evaluation, allowing for efficient text processing and manipulation. This article explores the basic concepts of regular expressions, different types, key components such as anchors, character sets, and modifiers, and the distinction between simple and extended regex. Understanding regex is crucial for efficient text processing in various programming languages and utilities.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Regular Expressions Reading: Appendix A 1
Regular Expressions Regular expressions describe patterns, or sequences of characters, without necessarily specifying the characters literally. You'll also hear this process referred to as "pattern matching . An expression is something not to be interpreted literally, but it is something that needs to be evaluated. 2
Understanding RE Arithmetic expressions 2+4 2 + 3 * 4 Literal constants and an operator Operator precedence regular expression, by contrast, is descriptive of a pattern or sequence of characters Concatenation is the basic operation implied in every regular expression. That is, a pattern matches adjacent characters 3
Regular Expressions Two main types of regular expressions: Simple/basic - vi, sed, grep, csplit, dbx, more, ed, expr, lex, and pg are commands that understand these REs extended - awk, nawk, and egrep are utilities that understand these REs The distinction between the two getting blurred Some versions of "simple REs" support extensions missing from extended regular expressions We first discuss basic RE, and then extended RE Different programming may support other regular expressions, such as Perl and Python 4
Regular Expressions Three important parts to a regular expression: Anchors are used to specify the position of the pattern in relation to a line of text Character sets match one or more characters in a single position Modifiers specify how many times the previous character set is repeated 5
The Anchor Characters: ^ and $ Searching for a pattern that is at one end or the other of a line, is accomplished by using anchors Caret (^) is the starting anchor, indicating beginning of a line dollar sign ($) is the end anchor, indicating end of a line Most UNIX text facilities are line-oriented Searching for patterns that span several lines is not easy to do the EOL character is a separator and it s not included in the block of text that is searched REs examine the text between the separators 6
The Anchor Characters: ^ and $ The RE ^A will match all lines that start with an uppercase A The RE A$ will match all lines that end with uppercase A If the anchor characters are not used at the corresponding end of the pattern, they no longer act as anchors Example: The expression $1 does not have an anchor Neither does 1^ To match a ^ at the beginning or a $ at the end of a line, escape the special character by typing a backslash (\) before it 7
The Anchor Characters: ^ and $ Anchor character examples ^A - an A at the beginning of a line A$ - an A at the end of a line A - an A anywhere on a line $A - a $A anywhere on a line ^\^ - a ^ at the beginning of a line ^^ - same as ^\^ \$$ - a $ at the end of a line $$ - same as \$$ Quote your RE properly Otherwise, this means the process ID (for shell expansion) 8
Matching a Character with a Character Set The simplest character set is a character The regular expression the contains three character sets: t, h, and e It will match any line that contains the string the, including the word other To prevent this, put spaces ( ) before and after the pattern: _the_ You can combine the string with an anchor The pattern ^From: will match the lines of a mail message that identify the sender % grep ^From: $MAIL % grep'^From: '/var/spool/mail/$USER Some characters have a special meaning in REs To search for a character as itself, escape it with a backslash (\) 9
Match any Character with . (Dot) The dot (.) is one of those special meta-characters By itself, it will match any character except the EOL character The pattern that will match a line with any single character is: ^.$ 10
Specifying a Range of Characters with [...] To match specific characters, use square brackets, [ ], to identify the exact characters in your search The pattern that will match any line of text that contains exactly one digit is: ^[0123456789]$ or ^[0-9]$ You can intermix explicit characters with character ranges. This pattern will match a single character that is a letter, digit, or underscore: [A-Za-z0-9_]. 11
Specifying a Range of Characters with [...] - Example Character sets can be combined by placing them next to one another. Search for a word: starts with an uppercase T, the first word on a line, the second letter is a lowercase letter, is three letters long (followed by a space character ( )), and the third letter was a lowercase vowel, the regular expression would be: ^T[a-z][aeiou] there is a space following right square bracket To be specific: A range is a contiguous series of characters, from low to high, in the ASCII chart. For example, [z-a] is not a range because it's backwards. The range [A-z] does match both uppercase and lowercase letters, but it also matches the six characters that fall between uppercase and lowercase letters in the ASCII chart: [, \, ], ^, _, and ` 12
Exceptions in a Character Set You can easily search for all characters except those in square brackets by putting a caret (^) as the first character after the left square bracket ([) To match all characters except lowercase vowels use: [^aeiou]. Like the anchors in places that can't be considered an anchor, the right square bracket (]) and dash (-) do not have a special meaning if they directly follow a [ 13
Exceptions in a Character Set Regular Expression Character Set Examples [0-9] Any digit [^0-9] Any character other than a digit [-0-9] Any digit or a [0-9-] Any digit or a [^-0-9] Any character except a digit or a [ ]0-9] Any digit or a ] [0-9]] Any digit followed by a ] [0-99-z] Any digit or any character between 9 and z [ ]0-9-] Any digit, a -, or a ] 14
Repeating Character Sets with * The third part of a RE is the modifier It is used to specify how many times you expect to see the previous character set. The special character * (asterisk) matches zero or more copies The RE 0* matches zero or more zeros The RE [0-9]* matches zero or more digits Think about the following RE grep ^#* example1.sh What it will match? This will match every line, because every line starts with zero or more #'s 15
Matching a Specific Number of Sets with \ { and \ } Modifier * specifies previous character repeats zero or more times How to specify the minimum and maximum number of repeats. Use meta characters \{ and \} Put two numbers in between, separately by a comma The regular expression to match four, five, six, seven, or eight lowercase letters is: [a-z]\{4,8\}. Any numbers between 0 and 255 can be used. Second number may be omitted, which removes the upper limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of times specified by the first number. You must remember that modifiers like * and \{1,5\} only act as modifiers if they follow a character set If they were at the beginning of a pattern, they would not be modifiers 16
Matching a Specific Number of Sets with \ { and \ } Examples and the exceptions: * Any line with a * \* Any line with a * \\ Any line with a \ ^* Any line starting with a * ^A* Any line ^A\* Any line starting with an A* ^AA* Any line starting with an A ^AA*B Any line starting with one or more A's followed by a B ^A\{4,8\}B Any line starting with four, five, six, seven, or eight A's followed by a B ^A\{4,\}B Any line starting with four or more A's followed by a B ^A\{4\}B Any line starting with an AAAAB \{4,8\} Any line with a {4,8} A{4,8} Any line with an A{4,8} 17
Matching a Specific Number of Sets with \ { and \ } CAUTION: Normally a backslash turns off the special meaning for a character For example, a literal period is matched by \. and a literal asterisk is matched by \* However, if a backslash is placed before a <, >, {, }, (, or ) or before a digit, the backslash turns on a special meaning In basic regular expressions This was done because these special functions were added late in the life of regular expressions Changing the meaning of {, }, (, ), <, and > would have broken old expressions View it as evolution 18
Matching Words with \ < and \ > The string the will match the word other You can put spaces before and after the letters and use this regular expression: _the_ However, this does not match words at the beginning or the end of the line and it does not match the case where there is a punctuation mark after the word 19
Matching Words with \ < and \ > Meta-characters \< and \> are similar to the ^ and $ anchors, as they don't occupy a position of a character. They do anchor the expression between to match only if it is on a word boundary. The pattern to search for the words the and The would be: \<[tT]he\> Word boundary (using the above example) The character before the t or T must be either a newline character or anything except a letter, digit, or underscore ( _ ) The character after the e must also be a character other than a digit, letter, or underscore, or it could be the EOL character 20
Remembering Patterns with \ ( \ ) and \1 Searching for repeated words The expression [a-z][a-z] will match any two lowercase letters If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way to remember what you found and see if the same pattern occurs again. you can mark part of a pattern using \( and \) In basic regular expression You can recall the remembered pattern with \ followed by a single digit. 21
Remembering Patterns with \ ( \ ) and \1 To search for two identical letters, use: \([a-z]\)\1 You can have nine different remembered patterns Each occurrence of \( starts a new pattern. The regular expression to match a five-letter palindrome (e.g., "radar") is: \([a-z]\)\([a-z]\)[a-z]\2\1. Some versions of some programs can't handle \( \) in the same regular expression as \1, etc. In all versions of sed, you're safe if you use \( \) on the pattern side of an s command, and\1, etc., on the replacement side 22
Extended Regular Expressions At least two programs use extended regular expressions: egrep and awk perl uses expressions that are even more extended With these extensions, those special meta-characters preceded by a backslash no longer have special meaning: \{, \}, \<, \>, \(, \), as well as \digit. There is a very good reason for this, if ( has a special meaning, then \( must be the ordinary character. This is the opposite of the basic regular expressions, where ( is ordinary and \( is special. 23
Extended Regular Expressions The question mark (?) matches zero or one instances of the character set before it The plus sign (+) matches one or more copies of the character set 24
Extended Regular Expressions The three important meta-characters in extended regular expressions are (, |, and ) Parentheses are used to group expressions The vertical bar acts as an OR operator Together, they let you match a choice of patterns. As an example, you can use egrep to print all From: and Subject: lines from your incoming mail: % egrep '^(From|Subject): ' /var/spool/mail/$USER Using simple RE, you would need something like: ^[FS][ru][ob][mj]e*c*t*: and hope you don't have any lines that start with Sromeet 25
Extended Regular Expressions There is no \<, \> in extended RE. Matching the word the in the beginning, middle, or end of a sentence or at the end of a line can be done with the extended RE: (^| )the([^a-z]|$) There are two choices before the word: a space or the beginning of a line Following the word, there must be either something except a lowercase letter or the end of the line Some programming languages support \b 26
Extended Regular Expressions You can use the *, +, and ? modifiers after a (...) grouping. Here are two ways to match "a simple problem", "an easy problem", as well as "a problem . % egrep "a[n]? (simple|easy)? ?problem" data % egrep "a[n]? ((simple|easy) )?problem" data the second expression is more exact The first one may match a simpleproblem 27
Getting Regular Expressions Right The process of writing a regular expression involves three steps: Knowing what you want to match and how it might appear in the text. Writing a pattern to describe what you want to match. Testing the pattern to see what it matches. 28
Dont confuse RE with wildcards regular expressions can be confusing as they look a lot like file matching patterns shell uses Both filename wildcards and RE have special meanings for asterisk (*) , question mark (?), parentheses (()) square brackets ([]), and vertical bar (|, the "pipe") How, shells/find/others use file-matching instead of regular expression Shell wildcards expanded before shell passes arguments to the program % ls * % ls * (to prevent shell wildcard expansion) Using the echo command to see what the shell is doing 29
Unix Commands Grep Display lines matching a pattern Supports both traditional RE and extended RE Grep ^a file1.txt Grep E (this|test) file1.txt Find Search files under a directory This command supports lots of options/functionalities find ~ -name 'proj1.cpp' print Ps Display current processes Ps elf Top Display real-time view of the system top 30
Reading Assignment Chapter 15 on sed 31