
Introduction to Compilers: Understanding the Basics
Explore the fundamentals of compilers through this compilation from a Compiler course. Learn about the process of compiling, data structures, symbol tables, and more essential concepts in compiler design.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Short introduction to compilers Prabhas Chongstitvatana This is a compilation from the slides used in Compiler course at Dept. of Math and CS, Chulalongkorn. It is generously granted to be used in this class.
Introduction Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University Chapter 1 2301373: Introduction 2
What is a Compiler? A compiler is a computer program that translates a program in a source language into an equivalent program in a target language. A source program/code is a program/code written in the source language, which is usually a high- level language. Source program Target program compiler A target program/code is a program/code written in the target language, which often is a machine language or an intermediate code. Error message Chapter 1 2301373: Introduction 3
Process of Compiling Stream of characters scanner Stream of tokens parser Parse/syntax tree Semantic analyzer Annotated tree Intermediate code generator Intermediate code Code optimization Intermediate code Code generator Target code Code optimization Target code Chapter 1 2301373: Introduction 4
Some Data Structures Symbol table Literal table Parse tree Chapter 1 2301373: Introduction 5
Symbol Table Identifiers are names of variables, constants, functions, data types, etc. Store information associated with identifiers Information associated with different types of identifiers can be different Information associated with variables are name, type, address,size (for array), etc. Information associated with functions are name,type of return value, parameters, address, etc. Chapter 1 2301373: Introduction 6
Symbol Table (contd) Accessed in every phase of compilers The scanner, parser, and semantic analyzer put names of identifiers in symbol table. The semantic analyzer stores more information (e.g. data types) in the table. The intermediate code generator, code optimizer and code generator use information in symbol table to generate appropriate code. Mostly use hash table for efficiency. Chapter 1 2301373: Introduction 7
Literal table Store constants and strings used in program reduce the memory size by reusing constants and strings Can be combined with symbol table Chapter 1 2301373: Introduction 8
Parse tree Dynamically-allocated, pointer-based structure Information for different data types related to parse trees need to be stored somewhere. Nodes are variant records, storing information for different types of data Nodes store pointers to information stored in other data structure, e.g. symbol table Chapter 1 2301373: Introduction 9
Introduction A scanner, sometimes called a lexical analyzer A scanner : gets a stream of characters (source program) divides it into tokens Tokens are units that are meaningful in the source language. Lexemes are strings which match the patterns of tokens. Scanner 2301373 Introduction to Compilers 11
Examples of Tokens in C Tokens identifier number string open parentheses close parentheses Semicolon reserved word if Lexemes Age, grade,Temp, zone, q1 3.1416, -498127,987.76412097 A cat sat on a mat. , 90183654 ( ) ; IF, if, If, iF Scanner 2301373 Introduction to Compilers 12
Scanning When a token is found: It is passed to the next phase of compiler. Sometimes values associated with the token, called attributes, need to be calculated. Some tokens, together with their attributes, must be stored in the symbol/literal table. it is necessary to check if the token is already in the table Examples of attributes Attributes of a variable are name, address, type, etc. An attribute of a numeric constant is its value. Scanner 2301373 Introduction to Compilers 13
How to construct a scanner Define tokens in the source language. Describe the patterns allowed for tokens. Write regular expressions describing the patterns. Construct an FA for each pattern. Combine all FA s. Scanner 2301373 Introduction to Compilers 14
Regular Expression a character or symbol in the alphabet : an empty string : an empty set if r and s are regular expressions r | s r s r * (r ) Scanner 2301373 Introduction to Compilers 15
Extension of regular expr. [a-z] any character in a range from a to z . any character r + one or more repetition r ? optional subexpression ~(a | b | c), [^abc] any single character NOT in the set Scanner 2301373 Introduction to Compilers 16
Examples of Patterns (a | A) = the set {a, A} [0-9]+ = (0 |1 |...| 9) (0 |1 |...| 9)* (0-9)? = (0 | 1 |...| 9 | ) [A-Za-z] = (A |B |...| Z |a |b |...| z) A . = the string with A following by any one symbol ~[0-9] = [^0123456789] = any character which is not 0, 1, ..., 9 Scanner 2301373 Introduction to Compilers 17
Describing Patterns of Tokens reservedIF = (IF| if| If| iF) = (I|i)(F|f) letter = [a-zA-Z] digit =[0-9] identifier = letter (letter|digit)* numeric = (+|-)? digit+ (. digit+)? (E (+|-)? digit+)? Comments { (~})* } /* ([^*]*[^/]*)* */ ;(~newline)* newline Scanner 2301373 Introduction to Compilers 18
Disambiguating Rules IF is an identifier or a reserved word? A reserved word cannot be used as identifier. A keyword can also be identifier. <= is < and = or <=? Principle of longest substring When a string can be either a single token or a sequence of tokens, single-token interpretation is preferred. Scanner 2301373 Introduction to Compilers 19
FA Recognizing Tokens Identifier Numeric letter letter,digit E . digit digit +,-,e +,-,e digit E digit digit digit Comment ~/ / / * * ~* Scanner 2301373 Introduction to Compilers 20
Lookahead letter, digit I,i F,f Return ID [other] Return IF Scanner 2301373 Introduction to Compilers 21
Nested IF switch (state) { case 0: { if isletter(nxt) state=1; elseif isdigit(nxt) state=2; else state=3; break; } case 1: { if isletVdig(nxt) state=1; else state=4; break; } } letter, digit other 1 4 digit 0 2 3 Scanner 2301373 Introduction to Compilers 22
Transition table St 0 1 2 3 letter, digit ch letter 1 1 .. .. other 1 4 digit 2 1 .. .. digit 0 2 3 4 .. 3 Scanner 2301373 Introduction to Compilers 23
Simulating a DFA initialize current_state=start while (not final(current_state)) { next_state=dfa(current_state, next) current_state=next_state; } Scanner 2301373 Introduction to Compilers 24
Context-Free Grammars Using grammars in parsers Jaruloj Chongstitvatana Department of Mathametics and Computer Science Chulalongkorn University 25 2301373 Chapter 3 Context-free Grammar
Parsing Process Call the scanner to get tokens Build a parse tree from the stream of tokens A parse tree shows the syntactic structure of the source program. Add information about identifiers in the symbol table Report error, when found, and recover from thee error 26 2301373 Chapter 3 Context-free Grammar
Context-Free Grammar a quintuple (V, T, P, S) where V is a finite set of nonterminals, containing S, T is a finite set of terminals, P is a set of production rules in the form of where is in V and is in (V UT )*, and S is the start symbol. Any string in (V U T)* is called a sentential form. 27 2301373 Chapter 3 Context-free Grammar
Examples E E O E E (E) E id O + O - O * O / S SS S (S)S S () S 28 2301373 Chapter 3 Context-free Grammar
Backus-Naur Form (BNF) Nonterminals are in < >. Terminals are any other symbols. ::= means . | means or. Examples: <E> ::= <E><O><E>| (<E>) | ID <O> ::= + | - | * | / <S> ::= <S><S> | (<S>)<S> | () | 29 2301373 Chapter 3 Context-free Grammar
Examples S SS | (S)S | () | S SS (S)SS (S)S(S)S (S)S(())S ((S)S)S(())S ((S)())S(())S ((())())S(())S ((())()) (())S ((())())(()) E E O E | (E) | id O + | - | * | / E E O E (E) O E (E O E) O E *((E O E) O E) O E ((id O E)) O E) O E ((id + E)) O E) O E ((id + id)) O E) O E *((id + id)) * id) + id 2301373 Chapter 3 Context-free Grammar 30
Parsing Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University
Introduction Parsing is a process that constructs a syntactic structure (i.e. parse tree) from the stream of tokens. We already learn how to describe the syntactic structure of a language using (context-free) grammar. So, a parser only need to do this? Stream of tokens Parse tree Parser Context-free grammar 32
Parse Trees and Derivations E E E + E id + E id + E * E id + id * E id + id * id E E + E * E E + E * id E + id * id id + id * id E E + E E id * id id Top-down parsing E E + E E E + E E id * id id Bottom-up parsing 33
Recursive-Descent Write one procedure for each set of productions with the same nonterminal in the LHS Each procedure recognizes a structure described by a nonterminal. A procedure calls other procedures if it need to recognize other structures. A procedure calls match procedure if it need to recognize a terminal. 34
Recursive-Descent: Example For this grammar: We cannot decide which rule to use for E, and If we choose E E O F, it leads to infinitely recursive loops. Rewrite the grammar into EBNF E E O F | F O + | - F ( E ) | id E ::= F {O F} O ::= + | - F ::= ( E ) | id procedure F { switch token { case (: match( ( ); case id: match(id); default: error; } } procedure E { E; O; F; } E; match( ) ); procedure E { F; while (token=+ or token=-) { O; F; } } 35
Match procedure procedure match(expTok) { if (token==expTok) then getToken else error } The token is not consumed until getToken is executed. 36
Problems in Recursive-Descent Difficult to convert grammars into EBNF Cannot decide which production to use at each point Cannot decide when to use -production A 37
LL(1) Parsing LL(1) Read input from (L) left to right Simulate (L) leftmost derivation 1 lookahead symbol Use stack to simulate leftmost derivation Part of sentential form produced in the leftmost derivation is stored in the stack. Top of stack is the leftmost nonterminal symbol in the fragment of sentential form. 38
Concept of LL(1) Parsing Simulate leftmost derivation of the input. Keep part of sentential form in the stack. If the symbol on the top of stack is a terminal, try to match it with the next input token and pop it out of stack. If the symbol on the top of stack is a nonterminal X, replace it with Y if we have a production rule X Y. Which production will be chosen, if there are both X Y and X Z ? 39
Example of LL(1) Parsing F N n E TX FNX (E)NX (TX)NX (FNX)NX (nNX)NX (nX)NX (nATX)NX (n+TX)NX (n+FNX)NX (n+(E)NX)NX (n+(TX)NX)NX (n+(FNX)NX)NX (n+(nNX)NX)NX (n+(nX)NX)NX (n+(n)NX)NX (n+(n)X)NX (n+(n))NX (n+(n))MFNX (n+(n))*FNX (n+(n))*nNX (n+(n))*nX (n+(n))*n T ( E X ( n + ( n ) ) * n $ F n A + F ) E TX X ATX| A + | - T FN N MFN| M * F ( E ) | n ( T N T N Finished Finished E X X M * F ) F n T N N E X $ 40