Harnessing the Power of Regular Expressions for Efficient String Manipulation

1 / 19

Embed Share

Explore the world of regular expressions, a versatile tool for flexible pattern matching in strings, widely supported in various programming languages and Unix/Linux systems. Learn from Professor John Carelli at Kutztown University as he delves into the intricacies of using Python's re module for tasks like splitting, compiling, matching, and searching strings. Discover how to optimize your code by compiling expressions and leveraging match and search functionalities to enhance efficiency and accuracy in text processing.

wlak Follow

Uploaded on Apr 23, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Regular Expressions Regular expressions enable flexible pattern matching in strings Widely supported in programming languages as well as in the Unix/Linux OS Python Regular Expressions (re module) Professor John Carelli, Kutztown University

Regular Expressions in Python The Python interface to regular expressions is the re module: Must be imported: import re Some re methods: split match search sub findall compile Professor John Carelli, Kutztown University

split example >>> import re >>> line = 'the quick brown fox jumped over a lazy dog' >>> re.split( ,line) ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog'] Note: this is comparable to line.split() >>> line.split( ) ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog'] Professor John Carelli, Kutztown University

Compiling a Regular Expression >>> import re >>> regex = re.compile('\s') >>> line = 'the quick brown fox jumped over a lazy dog' >>> regex.split(line) ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog ] Note: The \s pattern matches any whitespace character split() method is called on the compiled expression object This is useful (faster) if regular expression is to be reused Professor John Carelli, Kutztown University

match and search match looks for a pattern starting at the beginning of a string search looks for the occurrence of a pattern anywhere in a string Both return a match object Some match object attributes: start() index of the starting character (0 for match) end() index of the ending character group(0) the match string (more on groups later ) Professor John Carelli, Kutztown University

Comparisons to string methods >>> line.index('fox ) # string method 16 >>> regex = re.compile('fox') >>> result = regex.search(line) >>> result.start() 16 If no match is found, NoneType object is returned >>> result= re.match("fox",line) >>> type(result) <class 'NoneType'> Professor John Carelli, Kutztown University

sub and findall sub substitutes a string for a given pattern similar to string replace method findall to find all occurrences of a pattern Returns a list of all occurrences found Professor John Carelli, Kutztown University

Examples: sub and findall >>> line.replace('fox', 'BEAR ) # string method 'the quick brown BEAR jumped over a lazy dog' >>> re.sub('fox','BEAR', line) # no compile 'the quick brown BEAR jumped over a lazy dog' >>> shakespeare = "to be or not to be" >>> regex=re.compile("be") # compiled regex >>> regex.findall(shakespeare) ['be', 'be'] Professor John Carelli, Kutztown University

Why use regular expressions? Simple Strings are matched exactly >>> regex = re.compile('ion') >>> regex.findall('Great Expectations') ['ion'] But some characters have special meanings: ^ $ * + ? { } [ ] \ | ( ) They can be used to do more complex matching operations Professor John Carelli, Kutztown University

Common Special Characters . (dot) ^ (caret) match the start of a string $ match the end of a string * match zero or more occurrences of preceding regular expression + match one or more occurrences of preceding regular expression [] match any of a set of characters () used for grouping match a single character (except newline) Professor John Carelli, Kutztown University

Character Types The backslash can also be used to give normal characters special meaning character description "\d" match any digit "\D" match any non-digit "\s" match any whitespace "\S" match any non-whitespace "\w" match any alphanumeric char "\W" match any non-alphanumeric char Professor John Carelli, Kutztown University

Character sets Any one of a set of characters can be matched by enclosing the set in brackets [] A character range can be specified with a dash find all vowels in a string >>> re.findall("[aeiou]","vowel match") ['o', 'e', 'a ] a capital letter followed by a number >>> regex = re.compile('[A-Z][0-9] ) >>> regex.findall("123A123, B22, 212, C5") ['A1', 'B2', 'C5'] Professor John Carelli, Kutztown University

Matching Repeated Characters character description ? match zero or one * match zero or more + match one or more {n} match n repetitions {m,n} match between m and n >>> s = "abc123" >>> re.findall("\w+",s) ['abc123 ] >>> re.findall("\d+",s) ['123'] >>> regex = re.compile('\w{3}') >>> regex.findall('The quick brown fox') ['The', 'qui', 'bro', 'fox'] Professor John Carelli, Kutztown University

Example split on a pattern >>> line = 'the quick:brown # fox jumped, over a ?lazy dog' >>> regex = re.compile("[\s:#,?]+") >>> regex.split(line) ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog ] What is the above pattern? Professor John Carelli, Kutztown University

Escape Characters Special characters can be escaped with a preceding backslash Alternatively, the r preface to a string indicates a raw string All special behavior is removed >>> regex = re.compile('\$') >>> regex.findall("the cost is $20") ['$'] >>> print('a\tb\tc') a b c >>> print(r'a\tb\tc') a\tb\tc Professor John Carelli, Kutztown University

Email matcher email = re.compile(r'[\w.]+@\w+\.[a-z]{3} ') "[\w+]" one or more alphanumeric characters or periods "@" at sign "\w+" one or more alphanumeric characters "\." period "[a-z]" exactly three lower case characters >>> line = "John Carelli email address: carelli@kutztown.edu" >>> email = re.compile(r'[\w.]+@\w+\.[a-z]{3}') >>> mat = email.search(line) >>> mat <re.Match object; span=(28, 48), match='carelli@kutztown.edu'> >>> mat.group(0) 'carelli@kutztown.edu' Professor John Carelli, Kutztown University

Match object and grouping Groups can be defined in a pattern by enclosing them in parenthesis () When a match or a search is done, a Match object is returned If no match is found, NonType is returned The match string is available in group 0. Using the group method Individual group matches are in additional groups numbered 1 to the value in the lastindex attribute groups() method returns a tuple with the matches Note: __getitem__() is defined for Match So, [] can be used to access group items, including 0 Professor John Carelli, Kutztown University

Email matcher with groups >>> line = "John Carelli email address: carelli@kutztown.edu" >>> email = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})') >>> mat = email.search(line) >>> mat[0] 'carelli@kutztown.edu >>> mat.groups() ('carelli', 'kutztown', 'edu') Professor John Carelli, Kutztown University

More W3Schools Tutorial https://www.w3schools.com/python/python_regex.asp Professor John Carelli, Kutztown University

Harnessing the Power of Regular Expressions for Efficient String Manipulation

Download Presentation

Presentation Transcript

Related

More Related Content