Scanner, Automata, and Compiler Construction Overview

cse p501 compiler construction n.w
1 / 58
Embed
Share

Explore the concepts of scanner generation, hand-written scanners, grammar creation, token definition, and token types in the context of compiler construction. Learn about automata, intermediate representations, abstract syntax trees, and machine code generation. Spring 2014 session by Jim Hogg at UW.

  • Scanner
  • Compiler
  • Automata
  • Grammar
  • Token

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CSE P501 Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014 Jim Hogg - UW - CSE P501 B-1

  2. Scanner Target Source Middle End Back End Front End chars IR IR Select Instructions Scan tokens IR Optimize Allocate Registers Parse IR AST Semantics Emit IR IR Machine Code AST = Abstract Syntax Tree IR = Intermediate Representation Spring 2014 Jim Hogg - UW - CSE P501 A-2

  3. Automatic or Hand-Written? Use a scanner-generator - JFlex regex define tokens JFlex Scanner .jflex .java OR Write a scanner, in Java, by hand Easy and enlightening Will see an outline of how, later Spring 2014 Jim Hogg - UW - CSE P501 B-3

  4. Reminder: a token is . . . class C { public int fac(int n) { // factorial int nn; if (n < 1) nn = 1; else nn = n * this.fac(n-1); return nn; } } Key for Char Stream: newline \n space class C { public int fac(int n) { // factorial int nn; if(n < 1 ) nn = 1; else nn = n * (this.fac(n- 1)); return nn; } } CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE Spring 2014 Jim Hogg - UW - CSE P501 A-4

  5. A Token in your Java scanner class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; public int column; // for debugging/diagnostics public String lexeme; public int value; // attribute of ILIT } // for debugging/diagnostics // eg: x , Total , ( , 42 Obviously this Token is wasteful of memory: lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et value is only required for ILIT But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants! Spring 2014 Jim Hogg - UW - CSE P501 B-5

  6. Typical Tokens Operators & Punctuation Single chars: + - * = / ( ] ; : Double chars: :: <= == != Keywords if while for goto return switch void Identifiers A single ID token kind, parameterized by lexeme Integer constants A single ILIT token kind, parameterized by int value See jflex-1.5.0\examples\java\java.flex for real example Spring 2014 Jim Hogg - UW - CSE P501 B-6

  7. Token Spotting if(a<=3)++grades[1]; // what are the tokens? (no spaces) public int fac(int n) { // what are the tokens? (need spaces?) Counter-example: fixed-format FORTRAN: DO 50 I = 1,99 DO 50 I = 1.2 // DO loop // assignment: DO50I = 1.2 Spring 2014 Jim Hogg - UW - CSE P501 B-7

  8. Principle of Longest Match Scanner should pick the longest possible string to make up the next token ( greedy algorithm) Example return idx <= iffy; should be scanned into 5 tokens: RETURN ID:idx LEQ ID:iffy SEMI <= is one token, not two iffy is an ID, not IF followed by ID:fy Spring 2014 Jim Hogg - UW - CSE P501 B-8

  9. Regex The syntax, of most programming languages can be specified using Regular Expressions REs in Cooper&Torczon regex is more common Tokens can be recognized by a deterministic finite automaton (DFA) DFA (a Java class) is almost always generated from regex using a software tool, such as JFlex Spring 2014 Jim Hogg - UW - CSE P501 B-9

  10. Regex Cheat Sheet Pattern a a* a+ a? a|b ab Matches? a zero or more a s one or more a s zero or one a a or b a followed by b Pattern [c-f] [^0-3] . Matches? one of c or d or e or f any one character except 0-3 any character, except newline Precedence: * (highest), concatenation, | (lowest) Parentheses can be used to group regexs as needed Notice meta-characters, in red Escaped characters: \* \+ \? \| \. \t \n Spring 2014 Jim Hogg - UW - CSE P501 B-10

  11. Regex Examples regex [abc]+ Meaning? [abc]* (Kleene closure) [0-9]+ [1-9][0-9]* [a-zA-Z_][a-zA-Z0-9_]* (0|1)* 0 (a|b)*aa(a|b)* Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/ Spring 2014 Jim Hogg - UW - CSE P501 B-11

  12. regex Defined over some alphabet For programming languages, alphabet is ASCII or Unicode If re is a regular expression, L(re ) is the language (set of strings) generated by re Spring 2014 Jim Hogg - UW - CSE P501 B-12

  13. regex macros Possible syntax for numeric constants Digit = [0-9] Digits = Digit+ Number = Digits ( . Digits )? ( [eE] (+ | -)? Digits ) ? How would you describe this set in English? What are some examples of legal constants (strings) generated by Number? Tools like JFlex accept these convenient macros Spring 2014 Jim Hogg - UW - CSE P501 B-13

  14. Automata Finite automata (state machines) can be used to recognize strings generated by regular expressions Can build automaton by-hand or automagically Will not build by-hand in this course Will use the JFlex tool: given a set of regex, it generates an automaton recognizer (a Java class) Spring 2014 Jim Hogg - UW - CSE P501 B-14

  15. Finite Automata Terminology Phrase Finite Automaton Deterministic Finite Automaton Non-deterministic Finite Automaton Finite-State Automaton Abbreviation FA DFA NFA FSA = {DFA, NFA} Spring 2014 Jim Hogg - UW - CSE P501 B-15

  16. DFA for cat regex = cat c a t Accepting State (double circles) Start State Spring 2014 Jim Hogg - UW - CSE P501 B-16

  17. DFA for ILIT regex = [0-9][0-9]* = [0-9]+ 0-9 0-9 1 2 We have labelled the states Spring 2014 Jim Hogg - UW - CSE P501 B-17

  18. DFA for ID regex = [a-zA-Z_][a-zA-Z0-9_]* _ a-z A-Z _ a-z A-Z 1 0 0-9 Spring 2014 Jim Hogg - UW - CSE P501 B-18

  19. DFAs work like this . . . scan the input text string, character-by-character 1. following the arc/edge corresponding to the character just read 2. if there is no arc for the character just read, then, either: 3. if you are in an accepting state: you're done. Success! if you are not in an accepting state: you're done. Failure! a. b. Spring 2014 Jim Hogg - UW - CSE P501 B-19

  20. DFAs work like this - examples Scan "fac(int n);" for the regex, alphaid = [a-z]+ (lower-case alphas) We hit "(" and are already in state 1. Success 1. Scan "23;" for regex alphaid There is no arc for "2". We are still in state 0. Failure 2. Scan "today" for regex alphaid We hit end-of-string and are already in state 1. Success 3. a-z a-z 0 1 Note: no need to add arcs to the DFA for all error cases - they are implicit Spring 2014 Jim Hogg - UW - CSE P501 B-20

  21. Thompsons Construction: Combining DFAs b a DFA for: a DFA for: b a b NFA for: ab a NFA for a|b b Spring 2014 Jim Hogg - UW - CSE P501 B-21

  22. Combining DFAs, contd b a DFA for: a DFA for: b a NFA for: a* Spring 2014 Jim Hogg - UW - CSE P501 B-22

  23. Exercise Draw the NFA for: b(at|ag) | bug b a t a g u b g Spring 2014 Jim Hogg - UW - CSE P501 B-23

  24. Exercise Draw the NFA for: b(at|ag) | bug a t b a g u b g B-24 Spring 2014 Jim Hogg - UW - CSE P501

  25. NFA for a(b|c)* To recognize "acb" successfully, we need to: a guess the future correctly backtrack and retry if we fail to recognize somehow execute all possible paths b c None of these is attractive! Can we construct an equivalent DFA? b a c Spring 2014 Jim Hogg - UW - CSE P501 B-25

  26. Finite State Automaton (FSA) A finite set of states One marked as initial state One or more marked as final states States sometimes labeled or numbered A set of transitions from state to state Each labeled with symbol from , or Operate by reading input symbols (usually characters) Transition can be taken if labeled with current symbol -transition can be taken at any time (free bus ride) Accept when final state reached & no more input Scanner uses an FSA as a subroutine accept longest match from current location each time called, even if more input Reject if no transition possible, or no more input and not in final state (DFA) Spring 2014 Jim Hogg - UW - CSE P501 B-26

  27. DFA vs NFA Deterministic Finite Automata (DFA) No choice of which transition to take In particular, no transitions No guessing Non-deterministic Finite Automata (NFA) Choice of transition in at least one case Accepts if some way to reach final state on given input Reject if no possible way to final state How to implement in software? Spring 2014 Jim Hogg - UW - CSE P501 B-27

  28. DFAs in Scanners We really want DFA for speed: no backtracking, no guessing, no foretelling the future Conversion from regex to NFA is easy, right? But how to turn an NFA into an equivalent DFA? Turns out to be obvious (once seen) and easy Spring 2014 Jim Hogg - UW - CSE P501 B-28

  29. NFA to DFA NFA for a(b|c)* b 4 5 a 9 8 1 2 3 0 c 6 7 Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with: b a 1 0 c B-29 Spring 2014 Jim Hogg - UW - CSE P501

  30. NFA to DFA NFA for a(b|c)* b 4 5 a 9 8 1 2 3 0 c 6 7 Begin in the Start state Foreach labelled arc leaving that state, what set of states can I reach, along labelled arc, or along transitions? Spring 2014 Jim Hogg - UW - CSE P501 B-30

  31. NFA to DFA NFA for a(b|c)* b n4 n5 a n9 n8 n1 n2 n3 n0 c n6 n7 NFA State a b c d0 = n0 d1 = {1,2,3,4,6,9} none none d1 = {1,2,3,4,6,9} none d2 = {3,4,5,6,8,9} d3 = {3,4,6,7,8,9} d2 = {3,4,5,6,8,9} none d2 = {3,4,5,6,8,9} d3 = {3,4,6,7,8,9} d3 = {3,4,6,7,8,9} none d2 = {3,4,5,6,8,9} d3 = {3,4,6,7,8,9} Spring 2014 Jim Hogg - UW - CSE P501 B-31

  32. NFA to DFA DFA for a(b|c)* d2 b b a b c d1 d0 c c d3 NFA State a b c d0 d1 - - d1 - d2 d3 d2 - d2 d3 d3 - d2 d3 Spring 2014 Jim Hogg - UW - CSE P501 B-32

  33. NFA to DFA - Even Better DFA for a(b|c)* b a d1 d0 c Can reduce number of states further, to yield above result If interested, see books for details States minimization is not examined in P501 Spring 2014 Jim Hogg - UW - CSE P501 B-33

  34. From NFA to DFA Subset construction (equivalence class) Construct DFA from NFA, where each DFA state represents a set of NFA states Key idea State of DFA after reading some input is the set of all states the NFA could have reached after reading the same input Algorithm: example of a fixed-point computation If NFA has n states, DFA has at most 2n states => DFA is finite, can construct in finite # steps Spring 2014 Jim Hogg - UW - CSE P501 B-34

  35. Build DFA for: b(at|ag) | bug from its NFA a t 2 3 4 b 0 1 a g 12 5 6 7 u b g 8 9 10 11 NFA State a b g t u d0 = 0 - {1,2,5,9} - - - d1 = {1,2,5,9} ? ? ? ? ? ? ? ? ? ? ? B-35 Spring 2014 Jim Hogg - UW - CSE P501

  36. Build DFA for: b(at|ag) | bug from its NFA a t 2 3 4 b 0 1 a g 12 5 6 7 u b g 8 9 10 11 NFA State a b g t u d0={0} - d1={1,2,5,9} - - - d1 = {1,2,5,9} d2={3,6} - - - d3={10} d2 = {3,6} - - d4={7} d5={4,12} - d3 = {10} - - d6={11,12} - - TBD ? ? ? ? ? B-36 Spring 2014 Jim Hogg - UW - CSE P501

  37. Hand-Written Scanner Idea: show a hand-written DFA for some typical tokens Then use to construct hand-written scanner Setting: Parser calls scanner whenever it wants next token JFlex provides next_token Scanner stores current position in input For illustration only. Course project will use JFlex scanner-generator Note - most commercial compilers use hand-written scanners - generally faster Spring 2014 Jim Hogg - UW - CSE P501 B-37

  38. Scanner DFA Example Part 1 whitespace or comments 0 end of input Accept EOF 1 ( Accept LPAREN 2 ) Accept RPAREN 3 ; Accept SEMI 4 Spring 2014 Jim Hogg - UW - CSE P501 B-38

  39. Scanner DFA Example Part 2 = ! 5 6 Accept NEQ [other ] Accept NOT 7 = < 8 9 Accept LEQ [other ] Accept LESS 10 Spring 2014 Jim Hogg - UW - CSE P501 B-39

  40. Scanner DFA Example Part 3 [0-9] [0-9] 11 [other ] Accept ILIT 12 Spring 2014 Jim Hogg - UW - CSE P501 B-40

  41. Scanner DFA Example Part 4 [a-zA-Z0-9_] [a-zA-Z] 13 [other ] Accept ID or keyword 14 Strategies for handling identifiers vs keywords Hand-written scanner: look up identifier-like things in table of keywords Machine-generated scanner: generate DFA with appropriate transitions to recognize keywords Spring 2014 Jim Hogg - UW - CSE P501 B-41

  42. Scanner class, ctor, skipWhite public class Scanner { private String prog; // the MiniJava program to be scanned private int p; // index in 'prog' of current char public Scanner(String prog) { this.prog = prog; p = 0; } private void skipWhite() { char c = prog.charAt(p); while ( Character.isWhitespace(c) ) c = prog.charAt(++p); } Spring 2014 Jim Hogg - UW - CSE P501 B-42

  43. Scanner- id private Token id() { int pBegin = p; // remember begin index of id char c = prog.charAt(p); // current char - alphabetic while ( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return new Token(ID, prog.substring(pBegin, p)); } Spring 2014 Jim Hogg - UW - CSE P501 B-43

  44. Scanner - iLit private Token iLit() { int pBegin = p; // remember begin index of lexeme char c = prog.charAt(p); // current char int val = Character.getNumericValue(c); // convert to int while ( Character.isDigit(c) ) { c = prog.charAt(++p); val = 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return new Token(ID, lex, val); } // step thru chars of number Spring 2014 Jim Hogg - UW - CSE P501 B-44

  45. Scanner - nextToken public Token nextToken() { skipWhitespace(); // returns at prog[p] char c = prog.charAt(p); // current char in 'prog' char n = prog.charAt(p + 1); // next char in 'prog' switch (c) { case >': if (n == '=') { p++; p++; return new Token(GEQ, >="); } else { p++; return new Token(GT, >"); } // . . . case '+': p++; return new Token(PLUS, "+"); // . . . } // end of switch Spring 2014 Jim Hogg - UW - CSE P501 B-45

  46. Scanner nextToken, contd if (Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return new Token(BAD, ""); } } // end of nextToken } // end of class Scanner An entire hand-written scanner for MiniJava takes ~100 lines of Java Spring 2014 Jim Hogg - UW - CSE P501 B-46

  47. Grammars & BNF Since the 60s, the syntax of every significant programming language has been specified by a formal grammar First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax Borrowed from the linguistics community (Noam Chomsky) Spring 2014 Jim Hogg - UW - CSE P501 B-47

  48. Grammar for a Tiny Language statement | program statement assignStmt | ifStmt program statement assignStmt id = expr ; ifStmt expr id ilit if ( expr ) statement id | ilit | expr + expr a | b | c | i | j | k | n | x | y | z 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Note: often see ::= used instead of Spring 2014 Jim Hogg - UW - CSE P501 B-48

  49. program ::= statement | program statement statement ::= assignStmt | ifStmt assignStmt ::= id = expr ; ifStmt ::= if ( expr ) statement expr ::= id | ilit | expr + expr id ::= a | b | c | i | j | k | n | x | y | z ilit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Example Derivation P S | P S S A | I A id = E ; I if ( E ) S E id | ilit | E + E id [a-z] ilit [0-9] a = 1 ; if ( a + 1 ) b = 2 ; Spring 2014 Jim Hogg - UW - CSE P501 B-49

  50. Parse Tree - First Few Steps P S | P S S A | I A id = E ; I if ( E ) S E id | ilit | E + E id [a-z] ilit [0-9] P P S S A E = ; id ilit a = 1 ; if ( a + 1 ) b = 2 ; B-50

More Related Content