
Effective Combinatorial Testing for Data Mining Algorithms
Explore the effectiveness of applying combinatorial testing to data mining algorithms through experimental design, research questions, subject programs, datasets, test generation, metrics, and results analysis. Discover the impact of different datasets on test coverage and the correlation between branch coverage and fault detection.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Applying Combinatorial Testing to Data Mining Algorithms Jaganmohan Chandrasekaran(UTA), Huadong Feng(UTA), Yu Lei(UTA), D. Richard Kuhn(NIST), Raghu Kacker(NIST) March 13, 2017
Outline Introduction Experimental Design Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Introduction Data Mining Algorithms Widely developed and used Large amounts of data as input Intensive and complex computing Combinatorial Testing(CT) Proven method for more effective software testing at lower cost How effective is Combinatorial Testing when applied to Data Mining Algorithms? 3
Outline Introduction Experimental Design Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Experimental Design Research Questions How effective is CT applied to data mining algorithms? How do different datasets impact test coverage? Is branch coverage a good indicator of fault detection effectiveness for data mining algorithms? 5
Experimental Design Subject Programs Top 5 most influential data mining algorithms* C4.5, K-Means, SVM, Apriori, EM Implementations from WEKA CT tests are applied on the configuration options of the subject algorithms 6
Experimental Design Datasets 51 bench marking datasets Datasets provided by WEKA, UC Irvine Machine Learning Repository Not all datasets are applicable to all algorithms 7
Experimental Design Input Parameter Modeling(IPM) Applied on configuration options Equivalence partitioning base on domain knowledge Identify representative values of equivalence partitions Constrains 9
Experimental Design Test Generation 1-way to 6-way positive tests Generated using ACTS with extend mode Negative 1-way test 10
Experimental Design Metrics Branch Coverage by JaCoCo A free code coverage library for Java. Mutation Coverage by PIT Mutation testing tool developed by Henry Coles. 11
Outline Introduction Experimental Design Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Impact of Datasets Finding Larger datasets do not necessarily achieve higher branch coverage. In some cases, smaller datasets can achieve higher branch coverage than larger datasets. Implication: The size of a dataset is not a dominating factor for determining test effectiveness of a dataset. Other characteristics must be considered, e.g., the dataset structure, and the relationship between different data instances. It is possible to create small datasets that are effective for testing data mining algorithms. 15
Branch Coverage of T-way Testing 16
Branch Coverage of T-way Testing 17
Branch Coverage of T-way Testing Finding: Branch coverage increases progressively slower as test strength increases. The coverage increase stops at a test strength that is relatively low. Implication: During CT, data mining algorithms display similar behavior as general software applications. CT has the potential to be effective for testing data mining algorithms. 18
Mutation Coverage of T-way Testing 19
Mutation Coverage of T-way Testing 20
Branch Coverage of T-way Testing 21
Mutation Coverage of T-way Testing 22
Mutation Coverage of T-way Testing Finding: Higher branch coverage seems to imply higher mutation coverage, and vice versa. Implication: Branch coverage could be used as a good indicator of fault detection effectiveness for data mining algorithms, since mutation coverage is expensive to measure. 23
Outline Introduction Experimental Design Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work 24
Conclusion Larger datasets do not necessarily achieve higher test coverage than smaller datasets. Test coverage of CT test set increases progressively slower with respect to increase of test strength. Branch coverage correlates well with mutation coverage. 25
Future Work Detailed Code Analysis Why some branches are not covered by our test cases? Apply CT to create or reduce datasets for data mining algorithms Further investigation and experiments on negative testing of data mining algorithms. 26