Exploring Regular Expression Evolution in Computer Science

computer science n.w
1 / 25
Embed
Share

Explore the evolution of regular expressions in computer science, focusing on its impact in software analysis, evolution, and reengineering. Learn about the importance of regex in various applications like search, validation, and pattern matching. Follow the evolution journey through examples and understand the challenges developers face with regex. Gain insights into automated tools and strategies for mastering regex evolution.

  • Computer Science
  • Regular Expressions
  • Evolution
  • Software Analysis
  • Development

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Computer Science Exploring Regular Expression Evolution Peipei Wang Gina R. Bai Kathryn T. Stolee North Carolina State University pwang7@ncsu.edu, rbai@ncsu.edu, ktstolee@ncsu.edu 1

  2. Computer Science Evolution Of ... SANER ( Software Analysis, Evolution, and Reengineering ) Software evolution 2

  3. Computer Science Evolution Of SANER ( Software Analysis, Evolution, and Reengineering ) Software evolution Source code evolution, clone evolution, etc Regular expression evolution? 3

  4. Computer Science Regular Expressions Matter EVERYWHERE Search, find & replace, validation User information validation Search engines Network devices, security Pattern matching in DNA sequences 4

  5. Computer Science Regex Evolution Example: Step 0 ^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$ joe@gmail.com joe@web.info smith@baidu.com smith@163.com pwang7@ncsu.edu 5

  6. Computer Science Regex Evolution Example: Step 1 From To ^\w+@[a-zA-Z_]+?\.[a- zA-Z]{2,3}$ ^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,5}$ joe@web.info Top-level domain name between 2 to 3 characters smith@163.com 6

  7. Computer Science Regex Evolution Example: Step 2 From To ^\w+@[a-zA-Z_0-9]+?\.[a-zA- Z]{2,5}$ ^\w+@[a-zA-Z_]+?\.[a- zA-Z]{2,5}$ joe@web.info Domain name contains alphabets and underscore smith@163.com 7

  8. Computer Science Regex Evolution Example: Step 2, 3, From To ^\w+@[a-zA-Z_0- 9]+?\.[a-zA-Z]{2,5}$ THIS IS NOT THE END abc@zju.edu.cn 8

  9. Computer Science Developers Need Help with Regex Regular expressions are error-prone Regular expressions are poorly tested Write once, read never 9

  10. Computer Science My Goal Gain knowledge about regular expression evolution Understanding Why-What-How Automated tools 10

  11. Computer Science Exploring Regex Evolution What are the characteristics of regular expression evolution? (RQ1) How similar is a regular expression to its predecessor syntactically and semantically? (RQ2) How do the features change in the evolution of a regular expression? (RQ3) 11

  12. Computer Science Exploring Regex Evolution What are the characteristics of regular expression evolution? (RQ1) How similar is a regular expression to its predecessor syntactically and semantically? (RQ2) How do the features change in the evolution of a regular expression? (RQ3) 12

  13. Computer Science Exploring Regex Evolution How data is collected GitHub dataset: coarse-grained, persistent in code Commit histories: git log command 3,962 regexes after filtering out duplicates and syntax errors Video dataset: fine-grained, developer activity 13

  14. Computer Science RQ2: Syntactic Changes Regex for integers r1: [0-9]+ r2: [1-9][0-9]* Edit distance: 6 Mean Min 10% 25% 50% 75% Max GitHub 9.3 1.0 1.0 2.0 5.5 12.8 52

  15. Computer Science RQ2: Semantic Changes Semantic Reduction Regex for integers r1: [0-9]+ r2: [0-9]|[1-9][0-9]* 1a 123 02 r2 Matching strings for r1: 1000 Matching strings for r1: 500 r1 1000 unique strings Matching strings for r2: 940 Matching strings for r2: 500 15

  16. Computer Science RQ2: Semantic Changes Overlap Disjoint Equivalent r1 r2 r1 == r2 r1 r2 Expansion r1 r2Reduction r1 r2 16

  17. Computer Science RQ2: Semantic Changes Disjoin t Overlap Equivalent Reduction Expansion 44 17 (8%) 22 20 106 (51%) GitHub (21%) (11%) (10%) Performance, Security 17

  18. Computer Science RQ2: Regex Semantic Changes Disjoin t Overlap Equivalent Reduction Expansion 44 17 (8%) 22 20 106 (51%) GitHub (21%) (11%) (10%) 70% of edited regular expressions in GitHub are related to semantic correctness 18

  19. Computer Science RQ2: Regex Semantic Changes Disjoin t Overlap Equivalent Reduction Expansion 44 17 (8%) 22 20 106 (51%) GitHub (21%) (11%) (10%) Software requirement? 19

  20. Computer Science RQ2: Regex Semantic Changes Regex for integers r1: [0-9]+ r2: [0-9]|[1-9][0-9]* 94% Intersection (940/1000) Matching strings for r1: 1000 Matching strings for r1: 500 6% Removal (60/1000) 1000 unique strings 0% Addition Matching strings for r2: 940 Matching strings for r2: 500 (0/940) 20

  21. Computer Science RQ2: Regex Semantic Changes Regular expressions are partial correct Testing strings retained in new regex Intersection Addition Deletion Mean Median Mean Median Mean Median GitHub 57% 70% 39% 26% 28% 0% 21

  22. Computer Science RQ2: Regex Semantic Changes Regular expressions are partial correct Testing strings should be accepted in old regex Intersection Addition Deletion Mean Median Mean Median Mean Median GitHub 57% 70% 39% 26% 28% 0% 22

  23. Computer Science RQ2: Regex Semantic Changes Regular expressions are partial correct Testing strings should be rejected in the old regex Intersection Addition Deletion Mean Median Mean Median Mean Median GitHub 57% 70% 39% 26% 28% 0% 23

  24. Computer Science Key Facts Why regular expression evolves Semantic correctness (GitHub 70%) performance, security, software changes, etc (disjoint, equivalent) Regular expressions are partial correct Write once, read never ( GitHub 5% Test, test, test 24

  25. Computer Science Thanks Peipei Wang (pwang7@ncsu.edu) 25

More Related Content