Data Selection for Cross-Project Defect Prediction with Source Code Features

Data Selection for Cross-Project Defect Prediction with Source Code Features
Slide Note
Embed
Share

This research introduces the ALGoF framework for data selection in Cross-Project Defect Prediction (CPDP) by automatically learning local and global features of source code files. The paper explores the effectiveness of ALGoF compared to traditional methods, focusing on predicting defect-prone files in projects using rich semantics and structural information extracted from Abstract Syntax Trees (ASTs) and code classes. Approach details, dataset insights, and experimental results are presented to address research questions on the efficacy of CPDP data selection methods.

  • Data Selection
  • Cross-Project
  • Defect Prediction
  • Source Code
  • ALGoF

Uploaded on Feb 18, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data Selection for Cross-Project Defect Prediction with Local and Global Features of Source Code AUTHORS: Xuan deng , Peng He*, Chun Ying Zhou School of Computer Science and Information Engineering Hubei University SEKE 2022 July 1 - July 10, 2022

  2. Introduction Defect prediction (CPDP) is to predict defect-prone files in a project based on the defect data collected from other projects.Nowadays,the growing collection of defect datasets on the Internet make it a more serious challenge to construct an appropriate training data set (TDS) . Many approaches have been proposed to characterize code files by using traditional hand-crafted features. Unfortunately, they ignore rich semantics hidden in the code s Abstract Syntax Trees (ASTs) and the globally structural information extracted from the classes in the source code files. In this paper, we propose a new framework called ALGoF to Automatically learn the Local semantic (fine-grained) and Global structural (coarse-grained) Features of code files for data selection in CPDP and seek empirical evidence that they can achieve acceptable performance compared with the traditional method. 2

  3. Approach - ALGoF Similarity node2vec Ranked instances Project A CDN modeling Structure feature learning f1 f1 Y f1 f1 N Defect data Source Project B Duplicate removal f1 f1 Y TDS setup f1 f1 Y f1 f1 N Selected instances f1 f1 N Source code files Project C Target Defect predictor f1 f1 Y CNN f1 f1 N Semantic feature learning Parsing AST I. II. Perform network embedding learning on the CDN to generate the global structural features of classes III. Leverage a Convolutional Neural Network (CNN) to automatically learn semantic features using token vectors extracted from the class files abstract syntax trees (AST). IV. Combine the structural and semantic features and use them as the representation of the class file V. Select one project from the Defect data as the target set and all other projects as the training set VI. Apply cosine similarity index to construct a model for ranking source instances based on the given target instances VII. Use the resulting source file instances to train the predictor and test on the target instances. Extract the dependencies between the classes from the source code files to construct a class dependency network 3

  4. Dataset Project Releases #files defect rate(%) Camel 1.4 892 17.1 Lucene 2.0 186 48.9 Poi 2.5 379 65.1 Synapse 1.1 222 27.0 Xalan 2.6 875 47.0 Xerces 1.3 446 15.0 Six defect datasets from the PROMISE repository #files and defect rate are the number of files and the percentage of defective files, respectively 4

  5. 5 Experiment Results RQ1: Does data selection based on ALGoF and its variants of CPDP work well? Research Question RQ2: For CPDP data selection, which is better: automatically learned features or traditional hand-crafted features?

  6. Four Scenarios (i)THC represents predictor based on the traditional hand-crafted features. (ii)ALoF represents predictor based only on the local semantic features. (iii)AGoF represents predictor based only on the global structural features. (iv)ALGoF represents predictor based on both the local and global features. 6

  7. Experiment Results RQ1: Does data selection based on ALGoF and its variants of CPDP work well? I. A comparison on F-measure of CPDP under the case of iTDS and sTDS Model iTDS sTDS % ALoF 0.292 0.352 20.55% AGoF 0.456 0.470 3.07% ALGoF 0.325 0.482 48.31% the initial source TDS without any selection is considered as a baseline, labeled as iTDS label the case of TDS selection using the features learned in this paper as sTDS 7

  8. Experiment Results RQ1: Does data selection based on ALGoF and its variants of CPDP work well? II. The Wilcoxon signed-rank test and Cliff's Delta (0.33 | | <0.474 means the medium effectiveness level, and | | 0.474 means the large effectiveness level ). Sig. p-value (0.05) cliff's delta ( ) ALoF_iTDS-ALoF_sTDS AGoF_iTDS-AGoF_sTDS ALGoF_iTDS-ALGoF_sTDS ALoF_sTD-ALGoF_sTDS AGoF_sTD-ALGoF_sTDS 0.934 0.745 0.116 0.219 0.345 -0.028 0.000 -0.484 -0.389 -0.083 8

  9. Experiment Results RQ2 : For CPDP data selection, which is better: automatically learned features or traditional hand-crafted features? I. The improvement rate of F-measure values compared with THC scenario (with data selection). 9

  10. Experiment Results RQ2 : For CPDP data selection, which is better: automatically learned features or traditional hand-crafted features? II. Comparison of Wilcoxon signed-rank test and Cliff's effect size of the automatic extraction features with THC Sig. p-value (0.05) cliff's( ) ALoF_sTDS-THC_sTDS 0.719 0.111 AGoF_sTDS-THC_sTDS 0.035 0.417 ALGoF_sTDS-THC_sTDS 0.035 0.667 10

  11. Conclusion The results indicate that features automatically learned from the source code (e.g., local semantic feature and global structural feature) are helpful to guide the training data selection for CPDP, and compared with the case of no data selection processing, the F-measure improvement rate of ALGoF is 48.31%. The results also show that our method is significant better than the traditional method, especially when using both the local semantic and global structural features as the representation of code files, and about 42.6% actually defective instances can be additionally predicted by our method. 11

  12. Thank You! Any Questions? xuan_deng@qq.com

Related


More Related Content