
Efficient Clone Detection Approach for Software Products
Discover an innovative clone detection method for large-scale software products developed at Osaka University and Nara Institute of Science and Technology. Explore code reuse, code cloning, and related tools and techniques in software development.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A clone detection approach for a collection of similar large-scale software products Eunjong Choi , Norihiro Yoshida , Yoshiki Higo , Katsuro Inoue Osaka University Nara Institute of Science and Technology Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1
Software Development for Mobile Device (1/2) Releases a new model in regular and rapid rushed intervals Adapts to various country constraints and needs e.g. Oshaifu-Keitai for Japan 2 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Software Development for Mobile Device (2/2) Develop software by reusing common pieces and implement unique pieces for each feature. Reused pieces Unique features 3 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Reused Source Code Pieces Source code is reused in code fragment level (code clones) and file level A code clone : a code fragment that has lexically, syntactically, or semantically . similar code fragments in source code Detecting and managing reused pieces is necessary e.g. Inconsistency management, plagiarism detection 4 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Code Clone Generated by: Code reuse by copy & paste Stereotyped functions or tool generated code Code Clone Code Clone Code Clone A clone set: A set of code clones that are similar or identical to each other 5 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Code Clone Detection Various detection techniques and tools have been proposed e.g. Text-based and line-based(dup)[Baker1995], token-base(CP-minder)[Li2006] CCFinder [Kamiya2002] A token-base clone detection tool Multi language support (C, C++, COBOL, Java, ...) Good speed (5MLOC/20m) [Baker1995] B. S. Baker. On nding duplication and near-duplication in large software systems. In Proc. of WCRE, pages 86, July 1995. [Li 2006] Z. Li, S. Lu, S. Myagmar and Y. Zhou. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Transactions on Software Engineering, 32: pages 176-192, 2006 [Kamiya2002] T. Kamiya, S. Kusumoto and K. Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28: 654-670, 2002 6 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Example of Clone Detection Technique: CCFinder Source files 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } ) ; } ) ; } ) ; } ) ; } ) ; } 17. } static void foo static $ static 1. static void foo() throws RESyntaxException { static void foo static void foo [ ] [ ] [ 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); ( ( ( ) ) ) [ $ throws RESyntaxException throws RESyntaxException throws RESyntaxException ] { [ ] { String a { String a { String a $ $ ( ) throws ) $ { $ $ { $ $ $ ( throws $ Lexical analysis Lexical analysis Lexical analysis Lexical analysis [ [ ] ] = new String [ = new String [ = new String [ = $ $ ] = $ ] ] ] ; ; } { "123,400" { $ { $ $ { $ , "abc" , "orange 100" } ; org ; . apache . regexp } } } ; . RE pat = new org pat = new RE pat = new RE ( "[0-9,]+" ) ( $ $ ( $ . apache . regexp RE $ $ $ = $ new = Token sequence Token sequence Token sequence Token sequence new . RE ( "[0-9,]+" ) RE ( "[0-9,]+" ) for $ ; for ( ; ; ; ; ) int sum = int sum = int sum = ; i ; $ ; $ 0 0 0 $ = ) $ ; $ $ = $ $ $ ; ; ; a $ for for for . . $ ( ( ( ( $ . int int int ; $ i i i ; ++ ; = = = = $ ++ 0 0 0 i ) $ ; i < < < ( $ ( ; $ $ $ = < $ ; if ) i ( $ < Transformation Transformation Transformation Transformation a a . . . . $ . length length length ( $ ; ; ++ ++ ++ [ i i i ) ) ) ] if if if ) ( pat pat pat $ ( if . . match match match $ += ( ( ( a a a [ [ i i ] ] ) ) ) sum sum sum . $ ( $ ( [ $ [ ] ) ) $ ) ) ) $ $ ] ) $ += += += ) ) Sample Sample Sample ) ) ) . . . $ $ ; parseNumber parseNumber parseNumber . $ . $ ( ( ( pat pat pat println ( $ . getParen getParen getParen ( "sum = " $ ( 0 += . ( $ ( . $ . ( ( ( 0 0 . $ . $ $ $ $ Transformed token sequence Transformed token sequence Transformed token sequence Transformed token sequence ) ) + ) ) $ + ; ; ; ; ) $ System System System ; ; ) . . . static void goo static $ } static out out out . $ . $ . println println $ $ ( "sum = " ( "sum = " String $ ( $ $ ( . . ) + sum ) + sum ) + sum ) a [ [ $ $ ; ; static void goo static void goo throws RESyntaxException { throws $ ) throws $ } } } } ; ( String String = $ ( ( ( $ Match detection Match detection Match detection Match detection a a [ [ ] ] ] ] [ ) ) ) ) ] throws RESyntaxException throws RESyntaxException ; ) ; $ ) ; $ { { { 0 RE RE RE exp = exp = exp = $ { $ $ = new RE ( "[0-9,]+" ) new RE ( "[0-9,]+" ) new RE ( "[0-9,]+" ) ; forfor ; ; ; int sum = int sum = int sum = ; i ; $ ; $ 0 0 new $ ( $ ( $ $ = $ $ = new $ $ ( ( ( ( $ . $ ( $ $ = = = = $ ++ $ = < $ ; ; ; a $ for for for . . $ int int int ; $ i i i ; ++ ; 0 0 0 i ) $ ; i < < < ( $ ( ; if ) i ( $ < Clones on transformed sequence Clones on transformed sequence Clones on transformed sequence Clones on transformed sequence a a . . . . $ . length length length ( $ ; ; ++ ++ ++ [ i i i ) ) ) ] if if if ) ( exp exp exp sum $ ( if $ $ ( [ $ [ ] ) ) $ ) . . match match match $ $ += ( ( ( a a a [ [ i i ] ] ( ) ) ) sum sum getParen ) $ ( $ ) ) . $ $ ] ) += . parseNumber parseNumber ( $ . $ exp $ ( . getParen exp exp ( $ $ ( ) += += += ; ; parseNumber $ . . . $ ( . ( 0 ) ) ) ) ( ( 0 0 ) ) ) ) Formatting Formatting Formatting Formatting ( . getParen $ $ . $ System System System $ ; . . . $ . out out out . $ . . . $ . println println println ( $ ( "sum = " + sum ( "sum = " + sum ( "sum = " + sum + $ + $ $ ; ; Code clones 7 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Problem of Code Clone Detection Tools Take enormous time for existing tools to detect code clones on large-scale software Suggest an approach for detecting code clone for a collection of similar large-scale software products Excluding detecting code clones among each set of files that are identical each other A identical file set : A set of files that are identical each other 8 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Overview Of Our Approach Step1. Calculate MD5 hash. Step2. Prepare Input Files for CCFinder Step3. Detect code clones using CCFinder Step4. Generate all clone sets All Input Files for CCFinder Source Files Hashed files Clone Sets Clone Sets (1) (2) Prepare Input Files for CCFinder (4) (3) Detect Code Clones Using CCFinder Calculate MD5 Hash Generate All Clone Sets Identical File Sets 9 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Step1. Calculate MD5 Hash Creates MD5 hash value of input files MD5 hash does not require any large substitution tables MD5 Hash Ce9e187434e357 46abf2 C9ad2 A77bdd2 7ed90608d1 2622448 97ccd1164dc3 -------------- ------ ----- ------ -------- ----- ---------- Calculate Source Files Source Files 10 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Step2. Prepare Input Files for CCFinder (1/2) Detect identical file sets MD5 Hash 175 A9 0cc 0cc 0cc C1 D1 A1 A2 A3 Identical File Set A . . . . be05 be05 B2 B1 Identical File Set B . . . . Input Files for CCFinder Identical File Sets 11 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Step2. Prepare Input Files for CCFinder (2/2) Prepare Input Files for CCFinder 175 A9 0cc 0cc 0cc C1 D1 A1 A2 A3 Identical File Set A . . . . be05 be05 B2 B1 Identical File Set B . . . . Input Files for CCFinder Identical File Sets 12 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Step3. Detect Code Clones Use CCFinder to detect code clones Detected Code Clones C1 D1 A1 A2 A3 Identical File Set A . . . . B2 B1 Identical File Set B . . . . Code Clones Detected by CCFinder Identical File Sets 13 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Step3. Detect Code Clones Use CCFinder to detect code clones Detected Code Clones C1 D1 A1 A2 A3 Identical File Set A Clone Set 1 . . . . B2 B1 D1 Identical File Set B Clone Set 2 . . . . Identical File Sets Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Clone Sets Detected by CCFinder 14
Step4. Generate all clone sets Generate all clone sets C1 D1 A1 A1 A2 A3 Identical File Set A A2 Clone Set 1 A3 . . . . B2 B1 Identical File Set B D1 B2 B1 . . . . Clone Set 2 . . . . Identical File Sets All Clone Sets 15 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Overview of Case Study (2/2) Approach Compare detection time between our method and using only CCFinder Confirm that the detection result of our method is the same as the one of using only CCFinder Detection Environment 64 bits Windows 7 Professional workstation equipped with 2 processors and 24 gigabytes of main memory. 16 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Results of Case Study (1/2) Detection Time Our approach detects code clones faster than using only CCFinder. Project Name Only CCFinder #Clone Sets (sec.) Apache ant 11,169 241 Linux kernel 24,235 1,119 Galxy y pro(GT-B5510 model Our Approach #Identical File Sets 10,692 21,343 148,761 Time #Clone Sets Time (sec.) 3,083 1,967 28,181 89 168 325,274 113,44 11,902 5 17 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Results of Case Study (2/2) Accuracy of Results : manually checked outputs Arbitrary selected 30 clone sets that are detected by our approach from each OSS Selected 30 identical file sets from each OSS project. 18 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Summary Suggest an approach for detecting code clones for a collection of similar large-scale software products. MD5 hash to identify identical file sets CCFinder to detect code clones. Apply our approach to three OSS projects and compared code clone detection time between using only CCFinder and our approach. Our approach takes shorter time to detect code clones. 19 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Future Work Improve for detecting files with slightly modification as identical file sets Our current approach detects file that are identical each as a identical file set Apply to various size of software projects in different domains Introduce other code clones detection tools and compare results from them in the case study 20 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Thank you for your attentions 21 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University