
Clang: An Overview of Code Analysis Techniques
Explore the motivation behind learning code analysis techniques, the structure of Clang's abstract syntax tree (AST), and practical examples of analyzing and modifying C/C++ code using Clang. Discover how Clang enables automated analysis and modification of program code through its AST manipulation capabilities.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
/22 Clang Tutorial Moonzoo Kim School of Computing KAIST The original slides were written by Yongbae Park, yongbae2@gmail.com
1 /22 Content Motivation of learning code analysis technique Overview of Clang AST structure of Clang Decl class Stmt class Traversing Clang AST
Motivation for Learning Code Analysis Technique Biologists know how to analyze laboratory mice. In addition, they know how to modify the mice by applying new medicine or artificial organ Mechanical engineers know how to analyze and modify mechanical products using CAD tools. Software engineers also have to know how to analyze and modify software code which is far more complex than any engineering product. Thus, software analysis/modification requires automated analysis tools. Using source level analysis framework (e.g., Clang, C Intermediate Language (CIL), EDG parser) Using low-level intermediate representation (IR) analysis framework (e.g., LLVM IR) 2025-04-04 2 / 33
3 /22 Overview There are frequent chances to analyze/modify program code mechanically/automatically Ex1. Refactoring code for various purposes Ex2. Generate test driver automatically Ex3. Insert probes to monitor target program behavior Clang is a library to convert a C program into an abstract syntax tree (AST) and manipulate the AST Ex) finding branches, renaming variables, pointer alias analysis, etc Clang is particularly useful to simply modify C/C++ code Ex1. Add printf( Branch Id:%d\n ,bid)at each branch Ex2. Add assert(pt != null)right before referencing pt
4 /22 Example C code 2 functions are declared: myPrint and main main function calls myPrint and returns 0 myPrint function calls printf myPrint contains if and for statements 1 global variable is declared: global //Example.c #include <stdio.h> int global; void myPrint(int param) { if (param == 1) printf("param is 1"); for (int i = 0 ; i < 10 ; i++ ) { global += i; } } int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; }
5 /22 Example AST Clang generates 3 ASTs for myPrint(), main(), and global A function declaration has a function body and parameters ASTs for myPrint() FunctionDecl myPrint 'void (int)' ParmVarDecl param 'int' VarDecl global 'int' AST for global CompoundStmt IfStmt Null BinaryOperator '==' 'int' IntegerLiteral 1 'int' ASTs for main() ImplicitCastExpr 'int' FunctionDecl main 'void (int, char **)' DeclRefExpr 'param' 'int' ParmVarDecl argc 'int' ForStmt DeclStmt CallExpr 'int' VarDecl i 'int' ImplicitCastExpr 'int (*)()' ParmVarDecl DeclRefExpr 'printf' 'int ()' argv 'char **':'char **' IntegerLiteral 0 'int' Null CompoundStmt ImplicitCastExpr 'char *' DeclStmt BinaryOperator '<' 'int' StringLiteral "param is 1" 'char [11]' VarDecl param 'int' IntegerLiteral 10 'int' IntegerLiteral 1 'int' Null ImplicitCastExpr 'int' DeclRefExpr 'i' 'int' CallExpr 'void' ImplicitCastExpr 'void (*)()' UnaryOperator '++' 'int' DeclRefExpr DeclRefExpr 'i' 'int' 'myPrint' 'void ()' ImplicitCastExpr 'int' CompoundStmt CompoundAssignOperator '+=' 'int' DeclRefExpr 'param' 'int' DeclRefExpr 'global' 'int' ReturnStmt IntegerLiteral 0 'int' ImplicitCastExpr 'int' DeclRefExpr 'i' 'int'
6 /22 Structure of AST Each node in AST is an instance of either Decl or Stmt class Decl represents declarations and there are sub- classes of Decl for different declaration types Ex) FunctionDecl class for function declaration and ParmVarDecl class for function parameter declaration Stmt represents statements and there are sub- classes of Stmt for different statement types Ex) IfStmt for if and ReturnStmt class for function return Comments (i.e. /* */, // ) are not built into an AST
7 /22 Decl (1/4) A root of the function AST is a Decl node A root of function AST is an instance of FunctionDecl which is a sub-class of Decl Function declaration FunctionDecl main 'void (int, char **)' int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0;} 14 15 16 17 18 ParmVarDecl argc 'int' ParmVarDecl argv 'char **':'char **' CompoundStmt DeclStmt VarDecl param 'int' IntegerLiteral 1 'int' CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type ReturnStmt IntegerLiteral 0 'int'
8 /22 Decl (2/4) FunctionDecl can have an instance of ParmVarDecl for a function parameter and a function body ParmVarDecl is a child class of Decl Function body is an instance of Stmt In the example, the function body is an instance of CompoundStmt which is a sub-class of Stmt ParmVarDecl argc 'int' FunctionDecl main 'void (int, char **)' Function parameter declarations ParmVarDecl argv 'char **':'char **' CompoundStmt DeclStmt VarDecl param 'int' int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; } 14 15 16 17 18 IntegerLiteral 1 'int' Function body CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type ReturnStmt IntegerLiteral 0 'int'
9 /22 Decl (3/4) VarDecl is for a local and global variable declaration VarDecl has a child if a variable has a initial value In the example, VarDecl has IntegerLiteral FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' VarDecl global 'int' Global variable declaration ParmVarDecl argv 'char **':'char **' Initial value CompoundStmt DeclStmt VarDecl param 'int' int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; } 14 15 16 17 18 IntegerLiteral 1 'int' Local variable declaration CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type ReturnStmt IntegerLiteral 0 'int'
10 /22 Decl (4/4) FunctionDecl, ParmVarDecl and VarDecl have a name and a type of declaration Ex) FunctionDecl has a name main and a type void (int, char**) FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' Types ParmVarDecl argv 'char **':'char **' CompoundStmt DeclStmt VarDecl param 'int' int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; } 14 15 16 17 18 IntegerLiteral 1 'int' Names CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend Types DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type ReturnStmt IntegerLiteral 0 'int'
11 /22 Stmt (1/9) Stmt represents a statement Subclasses of Stmt CompoundStmt class for code block DeclStmt class for local variable declaration ReturnStmt class for function return FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' ParmVarDecl argv 'char **':'char **' CompoundStmt DeclStmt VarDecl param 'int' int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; } 14 15 16 17 18 IntegerLiteral 1 'int' Statements CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type ReturnStmt IntegerLiteral 0 'int'
12 /22 https://clang.llvm.org/doxygen/
13 /22 Stmt (2/9) Expr represents an expression (a subclass of Stmt) Subclasses of Expr CallExpr for function call ImplicitCastExpr for implicit type casts DeclRefExpr for referencing declared variables and functions IntegerLiteral for integer literals FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' ParmVarDecl argv 'char **':'char **' CompoundStmt DeclStmt VarDecl param 'int' int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; } 14 15 16 17 18 IntegerLiteral 1 'int' Expressions (also statements) CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type ReturnStmt IntegerLiteral 0 'int'
14 /22 Stmt (3/9) Stmt may have a child containing additional information CompoundStmt has statements in a code block of braces ( {} ) FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' ParmVarDecl argv 'char **':'char **' CompoundStmt DeclStmt VarDecl param 'int' int param = 1; int main(int argc, char *argv[]) { int param = 1; myPrint(param); return 0; } 14 15 16 17 18 IntegerLiteral 1 'int' myPrint(param); CallExpr 'void' ImplicitCastExpr 'void (*)()' Legend DeclRefExpr Declaration type name type 'myPrint' 'void ()' ImplicitCastExpr 'int' Statement type DeclRefExpr 'param' 'int' Expression type value type return 0; ReturnStmt IntegerLiteral 0 'int'
15 /22 Stmt (4/9) Stmt may have a child containing additional information (cont ) The first child of CallExpr is for a function pointer and the others are for function parameters FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' Legend ParmVarDecl Declarations for DeclStmt argv 'char **':'char **' Declaration type name type CompoundStmt DeclStmt Statement type VarDecl param 'int' IntegerLiteral 1 'int' Expression type value type CallExpr 'void' ImplicitCastExpr 'void (*)()' Function pointer for CallExpr DeclRefExpr 'myPrint' 'void ()' ImplicitCastExpr 'int' Function parameter for CallExpr DeclRefExpr 'param' 'int' ReturnStmt IntegerLiteral 0 'int' Return value for ReturnStmt
16 /22 Stmt (5/9) Expr has a type of an expression Ex) a node of CallExpr has a type void Some sub-classes of Expr can have a value Ex) a node of IntegerLiteral has a value 1 FunctionDecl main 'void (int, char **)' ParmVarDecl argc 'int' Legend ParmVarDecl argv 'char **':'char **' Declaration type name type CompoundStmt Values DeclStmt Statement type VarDecl param 'int' IntegerLiteral 1 'int' Expression type value type CallExpr 'void' ImplicitCastExpr 'int' DeclRefExpr 'param' 'int' Types ImplicitCastExpr 'void (*)()' DeclRefExpr 'myPrint' 'void ()' ReturnStmt IntegerLiteral 0 'int' Types Value
17 /22 Stmt (6/9) myPrint function contains IfStmt and ForStmt in its function body FunctionDecl myPrint 'void (int)' ParmVarDecl param 'int' CompoundStmt IfStmt Null BinaryOperator '==' 'int' IntegerLiteral 1 'int' ImplicitCastExpr 'int' DeclRefExpr 'param' 'int' ForStmt DeclStmt CallExpr 'int' VarDecl i 'int' ImplicitCastExpr 'int (*)()' DeclRefExpr 'printf' 'int ()' IntegerLiteral 0 'int' Null ImplicitCastExpr 'char *' BinaryOperator '<' 'int' StringLiteral "param is 1" 'char [11]' IntegerLiteral 10 'int' Null ImplicitCastExpr 'int' DeclRefExpr 'i' 'int' void myPrint(int param) { if (param == 1) printf("param is 1"); for (int i=0;i<10;i++) { global += i; } } 6 7 8 9 UnaryOperator '++' 'int' DeclRefExpr 'i' 'int' CompoundStmt CompoundAssignOperator '+=' 'int' 10 11 12 DeclRefExpr 'global' 'int' ImplicitCastExpr 'int' DeclRefExpr 'i' 'int'
18 /22 Stmt (7/9) IfStmt has 4 children A condition variable in VarDecl In C++, you can declare a variable in condition (not in C) A condition in Expr Then block in Stmt Else block in Stmt Condition variable IfStmt Condition Null BinaryOperator '==' 'int' IntegerLiteral 1 'int' ImplicitCastExpr 'int' void myPrint(int param) { if (param == 1) printf("param is 1"); for (int i = 0 ; i < 10 ; i++ ) { global += i; } } 6 7 8 9 DeclRefExpr 'param' 'int' Then block 10 11 12 CallExpr 'int' ImplicitCastExpr 'int (*)()' DeclRefExpr 'printf' 'int ()' ImplicitCastExpr 'char *' StringLiteral "param is 1" 'char [11]' Else block Null
19 /22 Stmt (8/9) Initialization ForStmt has 5 children Initialization in Stmt A condition variable in VarDecl A condition in Expr Increment in Expr A loop block in Stmt ForStmt Condition variable DeclStmt VarDecl i 'int' IntegerLiteral 0 'int' Null Condition BinaryOperator '<' 'int' IntegerLiteral 10 'int' ImplicitCastExpr 'int' DeclRefExpr 'i' 'int' Increment UnaryOperator '++' 'int' void myPrint(int param) { if (param == 1) printf("param is 1"); for (int i = 0 ; i < 10 ; i++ ) { global += i; } } 6 7 8 9 DeclRefExpr 'i' 'int' 10 11 12 CompoundStmt CompoundAssignOperator '+=' 'int' DeclRefExpr 'global' 'int' Loop block ImplicitCastExpr 'int' DeclRefExpr 'i' 'int'
20 /22 Stmt (9/9) ForStmt BinaryOperator has 2 children for operands UnaryOperator has a child for operand DeclStmt VarDecl i 'int' IntegerLiteral 0 'int' Null BinaryOperator '<' 'int' Two operands for BinaryOperator IntegerLiteral 10 'int' ImplicitCastExpr 'int' DeclRefExpr 'i' 'int' UnaryOperator '++' 'int' void myPrint(int param) { if (param == 1) printf("param is 1"); for (int i = 0 ; i < 10 ; i++ ) { global += i; } } 6 7 8 9 DeclRefExpr 'i' 'int' A operand for UnaryOperator 10 11 12 CompoundStmt CompoundAssignOperator '+=' 'int' DeclRefExpr 'global' 'int' ImplicitCastExpr 'int' DeclRefExpr 'i' 'int'
21 /22 Traversing Clang AST (1/3) Clang provides a visitor design pattern for user to access AST ParseAST() starts building and traversal of an AST: void clang::ParseAST (Preprocessor &pp, ASTConsumer *C, ASTContext &Ctx, ) The callback function HandleTopLevelDecl() in ASTConsumer is called for each top-level declaration HandleTopLevelDecl() receives a list of function and global variable declarations as a parameter A user has to customize ASTConsumer to build his/her own program analyzer class MyASTConsumer : public ASTConsumer { public: MyASTConsumer(Rewriter &R) {} 1 2 3 4 5 6 7 8 9 virtual bool HandleTopLevelDecl(DeclGroupRef DR) { for(DeclGroupRef::iterator b=DR.begin(), e=DR.end(); b!=e;++b){ // variable b has each decleration in DR } return true; } 10 11 12 };
22 /22 Traversing Clang AST (2/3) HandleTopLevelDecl() calls TraverseDecl() which recursively travel a target AST from the top-level declaration by calling VisitStmt (), VisitFunctionDecl(), etc. class MyASTVisitor : public RecursiveASTVisitor<MyASTVisitor> { bool VisitStmt(Stmt *s) { printf("\t%s \n", s->getStmtClassName() ); return true; } bool VisitFunctionDecl(FunctionDecl *f) { if (f->hasBody()) { Stmt *FuncBody = f->getBody(); printf("%s\n", f->getName()); } return true; } }; class MyASTConsumer : public ASTConsumer { virtual bool HandleTopLevelDecl(DeclGroupRef DR) { for (DeclGroupRef::iterator b = DR.begin(), e = DR.end(); b != e; ++b) { MyASTVisitor Visitor; Visitor.TraverseDecl(*b); } return true; } }; 1 2 3 4 5 6 7 8 9 VisitStmt is called when Stmt is encountered VisitFunctionDecl is called when FunctionDecl is encountered 10 11 12 13 14 15 16 17 18 19 20 21 22 23
23 /22 Traversing Clang AST (3/3) VisitStmt() in RecursiveASTVisitor is called for every Stmt object in the AST RecursiveASTVisitor visits each Stmt in a depth-first search order If the return value of VisitStmt is false, recursive traversal halts Example: main function of the previous example FunctionDecl main 'void (int, char **)' RecursiveASTVisitor will visit all nodes in this box (the numbers are the order of traversal) ParmVarDecl argc 'int' ParmVarDecl argv 'char **':'char **' 1 CompoundStmt 2 3 DeclStmt VarDecl param 'int' 4 IntegerLiteral 'int' 1 5 6 CallExpr 'void' 7 ImplicitCastExpr 'void (*)()' DeclRefExpr 'myPrint' 'void ()' 8 ImplicitCastExpr 'int' 9 DeclRefExpr 'param' 'int' 10 ReturnStmt IntegerLiteral 'int' 0 11