
Errors in Machine Learning Experiments: A Study on Prevalence and Impact
Explore the prevalence of errors in machine learning experiments as highlighted in the research by Yousefi et al. Uncover the concerning statistics of inconsistencies in various domains, the challenges in statistical inferencing, and the critical findings on sampling methods and performance metrics.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
The Prevalence of Errors in The Prevalence of Errors in Machine Learning Experiments Machine Learning Experiments Leila Yousefi, Martin Shepperd, Mahir Arzoky, Andrea Capiluppi, Steve Counsell, Giuseppe Destefanis, Stephen Swift, Allan Tucker Brunel University London Yuchen Guo Xi an Jiaotong University, China Ning Li Northwestern Polytechnical University, China
Errors are ubiquitous ...
1. The Background 1. The Background Growing concerns about experimental reliability from other domains Brown and Heathers (2017) checked psychology studies for simple arithmetic errors and found that of 71 testable articles around half (36/71) appeared to contain at least one inconsistent mean Nuijten et al. (2016) checked use of inferential statistics e.g., t-test and 2 in psychology experiments. of 250,000 p-values, ~50% are problematic in 12% of papers this impacts the statistical conclusion
2. Sampling 2. Sampling Interested in comparing supervised and unsupervised classifiers for software defect prediction. So we undertook a systematic review arXiv preprint arXiv:1907.12027 Located 49 refereed papers containing 2456 individual experimental results Re-computed the confusion matrix to obtain comparable performance metrics (Matthews correlation coefficient).
Example inconsistency rule checking Example inconsistency rule checking 1. Performance metric out of range e.g., MCC [ 1,1] 2. Recomputed defect density d 0 3. d d where =margin of error for rounding etc
Statistical inferencing errors If study uses the null hypothesis significance testing (NHST) paradigm then the reject H0threshold alpha must be adjusted if there are multiple tests.
3. Results Checked all 49 papers for: confusion matrix inconsistencies NHST statistical errors (failure to adjust for multiple tests) BUT many papers provide insufficient details (14/49 papers ~30%)
Overall error rates Overall error rates Stat error No stat error Total Confusion matrix error 1 15 16 No confusion matrix error 3 16 19 Incomplete reporting 3 11 14 Total 7 43 49 NB One paper contains both classes of error.
Impact of publication venue Impact of publication venue Paper Type Incomplete reporting Contains error(s) No errors Total Conference 11 12 8 31 Journal 3 7 8 18 Total 14 19 16 49 Little difference in error rates but conference papers less likely to provide full reporting, perhaps due to page restrictions.
4. Summary and discussion 4. Summary and discussion we have audited 49 papers describing experiments based on learners for software defect prediction. found a surprising error rate 14/49 papers contained incomplete results 16/35 papers (where we can check) have confusion matrix inconsistencies 7/49 papers contained statistical errors overall there are problems with (14+16+7-1=36)/49 papers (1 contains inconsistencies and stat errors) More prestigious venues not immune e.g., IST, IEEE Intelligent Systems, KBS
Limitations Limitations There are probably undetected errors (we focused on easy to detect) Some errors may be trivial Perhaps there are differences between problem domains and research communities?
What can we do? What can we do? 1. It seems easy to make errors 2. Our experiments are complex 3. Open to proper scrutiny > open science principles 4. Approach authors with courtesy and professionalism 5. Better correction and updating mechanisms