
Exploring Regular Expression Evolution in Computer Science
Explore the evolution of regular expressions in computer science, focusing on its impact in software analysis, evolution, and reengineering. Learn about the importance of regex in various applications like search, validation, and pattern matching. Follow the evolution journey through examples and understand the challenges developers face with regex. Gain insights into automated tools and strategies for mastering regex evolution.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Computer Science Exploring Regular Expression Evolution Peipei Wang Gina R. Bai Kathryn T. Stolee North Carolina State University pwang7@ncsu.edu, rbai@ncsu.edu, ktstolee@ncsu.edu 1
Computer Science Evolution Of ... SANER ( Software Analysis, Evolution, and Reengineering ) Software evolution 2
Computer Science Evolution Of SANER ( Software Analysis, Evolution, and Reengineering ) Software evolution Source code evolution, clone evolution, etc Regular expression evolution? 3
Computer Science Regular Expressions Matter EVERYWHERE Search, find & replace, validation User information validation Search engines Network devices, security Pattern matching in DNA sequences 4
Computer Science Regex Evolution Example: Step 0 ^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$ joe@gmail.com joe@web.info smith@baidu.com smith@163.com pwang7@ncsu.edu 5
Computer Science Regex Evolution Example: Step 1 From To ^\w+@[a-zA-Z_]+?\.[a- zA-Z]{2,3}$ ^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,5}$ joe@web.info Top-level domain name between 2 to 3 characters smith@163.com 6
Computer Science Regex Evolution Example: Step 2 From To ^\w+@[a-zA-Z_0-9]+?\.[a-zA- Z]{2,5}$ ^\w+@[a-zA-Z_]+?\.[a- zA-Z]{2,5}$ joe@web.info Domain name contains alphabets and underscore smith@163.com 7
Computer Science Regex Evolution Example: Step 2, 3, From To ^\w+@[a-zA-Z_0- 9]+?\.[a-zA-Z]{2,5}$ THIS IS NOT THE END abc@zju.edu.cn 8
Computer Science Developers Need Help with Regex Regular expressions are error-prone Regular expressions are poorly tested Write once, read never 9
Computer Science My Goal Gain knowledge about regular expression evolution Understanding Why-What-How Automated tools 10
Computer Science Exploring Regex Evolution What are the characteristics of regular expression evolution? (RQ1) How similar is a regular expression to its predecessor syntactically and semantically? (RQ2) How do the features change in the evolution of a regular expression? (RQ3) 11
Computer Science Exploring Regex Evolution What are the characteristics of regular expression evolution? (RQ1) How similar is a regular expression to its predecessor syntactically and semantically? (RQ2) How do the features change in the evolution of a regular expression? (RQ3) 12
Computer Science Exploring Regex Evolution How data is collected GitHub dataset: coarse-grained, persistent in code Commit histories: git log command 3,962 regexes after filtering out duplicates and syntax errors Video dataset: fine-grained, developer activity 13
Computer Science RQ2: Syntactic Changes Regex for integers r1: [0-9]+ r2: [1-9][0-9]* Edit distance: 6 Mean Min 10% 25% 50% 75% Max GitHub 9.3 1.0 1.0 2.0 5.5 12.8 52
Computer Science RQ2: Semantic Changes Semantic Reduction Regex for integers r1: [0-9]+ r2: [0-9]|[1-9][0-9]* 1a 123 02 r2 Matching strings for r1: 1000 Matching strings for r1: 500 r1 1000 unique strings Matching strings for r2: 940 Matching strings for r2: 500 15
Computer Science RQ2: Semantic Changes Overlap Disjoint Equivalent r1 r2 r1 == r2 r1 r2 Expansion r1 r2Reduction r1 r2 16
Computer Science RQ2: Semantic Changes Disjoin t Overlap Equivalent Reduction Expansion 44 17 (8%) 22 20 106 (51%) GitHub (21%) (11%) (10%) Performance, Security 17
Computer Science RQ2: Regex Semantic Changes Disjoin t Overlap Equivalent Reduction Expansion 44 17 (8%) 22 20 106 (51%) GitHub (21%) (11%) (10%) 70% of edited regular expressions in GitHub are related to semantic correctness 18
Computer Science RQ2: Regex Semantic Changes Disjoin t Overlap Equivalent Reduction Expansion 44 17 (8%) 22 20 106 (51%) GitHub (21%) (11%) (10%) Software requirement? 19
Computer Science RQ2: Regex Semantic Changes Regex for integers r1: [0-9]+ r2: [0-9]|[1-9][0-9]* 94% Intersection (940/1000) Matching strings for r1: 1000 Matching strings for r1: 500 6% Removal (60/1000) 1000 unique strings 0% Addition Matching strings for r2: 940 Matching strings for r2: 500 (0/940) 20
Computer Science RQ2: Regex Semantic Changes Regular expressions are partial correct Testing strings retained in new regex Intersection Addition Deletion Mean Median Mean Median Mean Median GitHub 57% 70% 39% 26% 28% 0% 21
Computer Science RQ2: Regex Semantic Changes Regular expressions are partial correct Testing strings should be accepted in old regex Intersection Addition Deletion Mean Median Mean Median Mean Median GitHub 57% 70% 39% 26% 28% 0% 22
Computer Science RQ2: Regex Semantic Changes Regular expressions are partial correct Testing strings should be rejected in the old regex Intersection Addition Deletion Mean Median Mean Median Mean Median GitHub 57% 70% 39% 26% 28% 0% 23
Computer Science Key Facts Why regular expression evolves Semantic correctness (GitHub 70%) performance, security, software changes, etc (disjoint, equivalent) Regular expressions are partial correct Write once, read never ( GitHub 5% Test, test, test 24
Computer Science Thanks Peipei Wang (pwang7@ncsu.edu) 25