Compilers: Structure, Functions, and Tools

1 / 21

Embed Share

Delve into the world of compilers, exploring their role in translating source code, the differences between compilers and interpreters, hybrid compilers, bootstrapping, and the structure of a compiler. Learn about compiler construction tools and the pivotal role of the lexical analyzer in breaking down source code into tokens for further processing.

westover_h Follow

Uploaded on Mar 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Lecture 01 Sharun Akter Khushbu Lecturer, CSE, DIU

What is a Compiler? A compiler translate texts from one language to other language. Target program Source program Compiler Output Input Target Program

Interpreter Source program Output Interpreter Input

Hybrid Compiler Intermediate program (byte codes for Java) Source program Translator Intermediate program Virtual Machine Output Input Some java compiler, just-in-time compilers compile intermediate program just before start the program for faster processing.

Bootstrapping Develop a system using itself

A Language Processing System Source Program Modified Source Program Preprocessor Compiler Target Assembly Program Library Files, Relocatable Object Files Assembler Relocatable Machine Code Target Machine Code Linker/Loader

Structure of a Compiler S Y M B O L Front End Lexical Analyzer E R R O R Syntax Analyzer T A B L E Semantic Analyzer H A N D L E R Int. Code Generator Code Optimizer M A N G E R Code Generator Back End

Compiler Construction Tools 1. Scanner generators generate lexical analyzers 2. Parser generators generate syntax generators 3. Syntax-directed translation engines generate collection of procedures that walk through the parse tree. 4. Automatic code generators 5. Data-flow engines

The Role of the Lexical Analyzer Roles Primary role: Scan a source program (a string) and break it up into small, meaningful units, called tokens. Example: position := initial + rate * 60; Transform into meaningful units: identifiers, constants, operators, and punctuation. Other roles: Removal of comments Case conversion Removal of white spaces Interpretation of compiler directives or pragmas: For instance, in Turbo Pascal {$R+ means range checking is enabled. Communication with symbol table: Store information regarding an identifier in the symbol table. Not advisable in cases where scopes can be nested. Preparation of output listing: Keep track of source program, line numbers, and correspondences between error messages and line numbers. Why separate LA from parser? Simpler design of both LA and parser More efficient compiler More portable compiler

Tokens Examples of Tokens Operators = + > ( { := == <> if while for int double 43 6.035 -3.6e10 0x13F3A a ~ \ 3.142 aBcDe \ Keywords Numeric literals Character literals String literals Examples of non-tokens White space space( ) tab( \t ) eoln( \n ) /*this is not a token*/ Comments

Interaction of Lexical analyzer and parser token Example Source program Lexical analyzer parser Nexttoken() symbol table

How it works The Lexical analyzer perform certain other tasks besides identification of lexemes. One such task is stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input). Sometimes, lexical analyzers are divided into two processes: a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one. b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

Two issues in lexical analysis. How to specify tokens (patterns)? How to recognize the tokens given a token specification (how to implement the nexttoken() routine)? How to specify tokens: all the basic elements in a language must be tokens so that they can be recognized. main() { int i, j; for (i=0; i<50; i++) { printf( i = %d , i); } } There are not many types of tokens in a typical programming language: constant, identifier, reserved word, operator and misc. symbol.

Type of tokens in C++: Constants: main() { int i, j; for (I=0; I<50; I++) { printf( I = %d , I); } } char constants: a string constants: I=%d int constants: 50 float point constants Identifiers: i, j, counter, Reserved words: main, int, for, Operators: +, =, ++, /, Misc. symbols: (, ), {, }, Tokens are specified by regular expressions.

Lexical Analysis vs Parsing There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing (syntax analysis) phases. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would be considerably more complex than one that can assume comments and whitespace have already been removed by the lexical analyzer. Compiler efficiency is improved. a separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer

Tokens, Patterns, and Lexemes Token: a certain classification of entities of a program. four kinds of tokens in previous example: identifiers, operators, constraints, and punctuation. Lexeme: A specific instance of a token. Used to differentiate tokens. For instance, both position and initial belong to the identifier class, however each a different lexeme. Lexical analyzer may return a token type to the Parser, but must also keep track of attributes that distinguish one lexeme from another. Examples of attributes: Identifiers: string, Numbers: value Attributes are used during semantic checking and code generation. They are not needed during parsing. Patterns: Rule describing how tokens are specified in a program. Needed because a language can contain infinite possible strings. They all cannot be enumerated (calculated/specified) Formal mechanisms used to represent these patterns. Formalism helps in describing precisely (i) which strings belong to the language, and (ii) which do not. Also, form basis for developing tools that can automatically determine if a string belongs to a language.

How are patterns specified? Using a meta-language, called regular expressions. Alphabet: finite set of symbols. Use term for specifying an alphabet. Sentence or term: string. Empty string: denoted , string of length 0. Language: Any set of strings defined over an alphabet. From the lexical analyzer point of view, this language denotes the set of all tokens in programming language. Define following operators over sets of strings: 1. Union: L U S = L U = {s|(s L) (s U)} 2. Concatenation: LU or L.U S = L U = {s t|(s L) (t U)} 3. Kleene closure: L , set of all strings of letters, including , S = L = i=0 Li 4. Positive closure: L+. S = LL Regular expression: a notation for defining the set of tokens that normally occur in programming languages. For each regular expression r, there is a corresponding set of strings, say L(r) that is said to be derived from regular expressions. Also called regular set.

Examplecntd printf ( Total = %d\n , score) ;

Example 1. One token for each keyword. The pattern for a keyword is the same as the keyword itself. 2. Tokens for the operators, either individually or in classes 3. One token representing all identifiers. 4. One or more tokens representing constants, such as numbers and literal strings. 5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.

Lexical errors fi (a==f(x)) - fi is misspelled or keyword? Or undeclared function identifier? If fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler - handle the error How? 1. Delete one character from the remaining input. 2. Insert a missing character into the remaining input. 3. Replace a character by another character. 4. Transpose two adjacent characters.

To do... Regular expressions Recognizing tokens Deterministic finit automata

Compilers: Structure, Functions, and Tools

Download Presentation

Presentation Transcript

Related

More Related Content