
Thresholding Techniques for Outlier Detection
Explore various thresholding techniques for outlier detection, including SD, MAD, and IQR equations. Learn how to effectively threshold outlier scores using two-stage thresholding methods and the impact of outliers on statistical bias.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Outlier Detection: How to Threshold Outlier Scores? Jiawei Yang Susanto Rahardja Pasi Fr nti 20.12.2019
Data with outliers Data object Outlier
Outlier detection steps 1000 60 17 1 1500 2 3 Scoring: 6 18 18 8 16 500 300 Thresholding: 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500
Thresholding techniques Based on literature 6/2016 6/2018 statistics
Equations SD: T = mean + ? SD; MAD: T = median(X) + ? MAD; MAD = b median( X median X ); IQR: T = Q3 + c IQR; IQR = Q3 Q1;
Statistics are biased Reason: presence of the outliers
Performance of biases statistics Expected 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500 IQR SD MAD Method SD MAD IQR Threshold 1574.81 84.22 590.25 Detected Outliers {} {300, 500, 1000, 1500} {1000, 1500}
2T Algorithm Select initial threshold T=(SD, MAD or IQR); REPEAT 1. Remove biggest outlier scores 2. Re-calculate T=(SD, MAD, IQR) UNTIL Stop condition RETURN T
Performance of 2T Expected 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500 SD MAD IQR Method SD MAD IQR Threshold 1574.81 39.13 38.00 Detected Outliers {} {60, 300, 500, 1000, 1500} {60, 300, 500, 1000, 1500} If first stage fails, 2T will fail!
Datasets Dataset KDDCup99 Wilt Stamps PageBlocks Cardiotocography Pima SpamBase HeartDisease Arrhythmia Parkinson Size 60632 4839 340 5473 2126 768 4601 270 450 195 Outliers Dim 246 261 31 560 471 268 1,813 120 206 147 Outlier Object Network attack Diseased trees Forged stamps Pictures or graphics Patients Patients Spam email Patients Affected patients Patients 38 5 9 10 21 8 57 13 259 22 Campos et al: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , Data Mining and Knowledge Discovery, 2016.
Outlier detectors KNN MOD x x 4? 3?? 1?? 2?? y3 y1 y2 5? MOD Ramaswamy et al. Efficient algorithms for mining outliers from large data sets . ACM SIGMOD Record, 2000. Yang et al. Mean-shift outlier detection , FSDM 2018.
Results with KNN detector F1-score SD Original 2T Clever 0.54 0.46 0.00 0.61 0.59 0.05 0.58 0.72 0.08 0.69 0.74 0.09 0.61 0.56 0.61 0.55 0.63 0.49 0.42 0.48 0.42 0.52 0.59 0.38 0.55 0.65 0.31 0.31 0.42 0.30 0.53 0.58 0.33 Dataset MAD IQR Original 0.43 0.57 0.75 0.66 0.57 0.58 0.47 0.53 0.65 0.39 0.57 2T 0.37 0.50 0.61 0.55 0.53 0.65 0.51 0.59 0.67 0.46 0.54 Original 0.48 0.61 0.59 0.75 0.61 0.49 0.41 0.42 0.54 0.31 0.51 2T 0.45 0.60 0.69 0.73 0.61 0.51 0.43 0.46 0.59 0.34 0.54 KDDCup. Wilt Stamps PageB. Card. Pima Spam. HeartD. Arrhy. Parki. AVG
Results with MOD detector F1-score SD Original 2T Clever 0.54 0.47 0.00 0.61 0.53 0.05 0.60 0.72 0.60 0.68 0.73 0.09 0.55 0.56 0.55 0.54 0.62 0.42 0.55 0.49 0.38 0.52 0.54 0.38 0.56 0.65 0.55 0.35 0.49 0.34 0.55 0.58 0.34 Dataset MAD IQR Original 0.43 0.52 0.73 0.66 0.56 0.60 0.56 0.51 0.61 0.42 0.56 2T 0.37 0.47 0.65 0.55 0.54 0.63 0.52 0.54 0.67 0.48 0.54 Original 0.48 0.59 0.60 0.75 0.55 0.49 0.42 0.40 0.53 0.34 0.51 2T 0.45 0.56 0.66 0.72 0.55 0.52 0.43 0.42 0.57 0.36 0.52 KDDCup. Wilt Stamps PageB. Card. Pima Spam. HeartD. Arrhy. Parki. AVG
The amount of detected outliers MOD detector Dataset KDDCup. Wilt Stamps PageB. Card. Pima Spam. HeartD. Arrhy. Parki. Outlier 246 261 31 560 471 268 1813 120 206 147 SD 3509 330 35 229 183 108 693 44 63 22 2T 9383 1073 79 834 522 228 1047 85 136 52 Clever 48105 4806 334 5378 2103 734 4186 263 428 174
Time (s) MOD detector Dataset (Size) KDDCup. (60632) Wilt (4839) Stamps (340) PageB. (5473) Card. (2126) Pima (768) Spam. (4601) HeartD. (270) Arrhy. (450) Parki. (195) SD <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 2T 0.12 0.01 Clever 811.40 8.39 0.06 10.54 1.69 0.26 6.29 0.05 0.10 0.02 0.01 0.01
Conclusions Why to use: Simple but effective! How it performs: Improve existing thresholding! Usefulness: Almost no extra coding needed!
Outlier detection steps 1000 60 17 1 1500 2 3 Scoring: 6 18 18 8 16 500 300 Thresholding: 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500