Thresholding Techniques for Outlier Detection

outlier detection how to threshold outlier scores n.w
1 / 20
Embed
Share

Explore various thresholding techniques for outlier detection, including SD, MAD, and IQR equations. Learn how to effectively threshold outlier scores using two-stage thresholding methods and the impact of outliers on statistical bias.

  • Outlier Detection
  • Thresholding Techniques
  • Statistical Bias
  • Two-Stage Thresholding
  • Outlier Scores

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Outlier Detection: How to Threshold Outlier Scores? Jiawei Yang Susanto Rahardja Pasi Fr nti 20.12.2019

  2. Data with outliers Data object Outlier

  3. Outlier detection steps 1000 60 17 1 1500 2 3 Scoring: 6 18 18 8 16 500 300 Thresholding: 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500

  4. Thresholding techniques Based on literature 6/2016 6/2018 statistics

  5. Equations SD: T = mean + ? SD; MAD: T = median(X) + ? MAD; MAD = b median( X median X ); IQR: T = Q3 + c IQR; IQR = Q3 Q1;

  6. Statistics are biased Reason: presence of the outliers

  7. Performance of biases statistics Expected 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500 IQR SD MAD Method SD MAD IQR Threshold 1574.81 84.22 590.25 Detected Outliers {} {300, 500, 1000, 1500} {1000, 1500}

  8. Two-stage Thresholding (2T)

  9. 2T Algorithm Select initial threshold T=(SD, MAD or IQR); REPEAT 1. Remove biggest outlier scores 2. Re-calculate T=(SD, MAD, IQR) UNTIL Stop condition RETURN T

  10. Performance of 2T Expected 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500 SD MAD IQR Method SD MAD IQR Threshold 1574.81 39.13 38.00 Detected Outliers {} {60, 300, 500, 1000, 1500} {60, 300, 500, 1000, 1500} If first stage fails, 2T will fail!

  11. Experiments

  12. Datasets Dataset KDDCup99 Wilt Stamps PageBlocks Cardiotocography Pima SpamBase HeartDisease Arrhythmia Parkinson Size 60632 4839 340 5473 2126 768 4601 270 450 195 Outliers Dim 246 261 31 560 471 268 1,813 120 206 147 Outlier Object Network attack Diseased trees Forged stamps Pictures or graphics Patients Patients Spam email Patients Affected patients Patients 38 5 9 10 21 8 57 13 259 22 Campos et al: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , Data Mining and Knowledge Discovery, 2016.

  13. Outlier detectors KNN MOD x x 4? 3?? 1?? 2?? y3 y1 y2 5? MOD Ramaswamy et al. Efficient algorithms for mining outliers from large data sets . ACM SIGMOD Record, 2000. Yang et al. Mean-shift outlier detection , FSDM 2018.

  14. Results with KNN detector F1-score SD Original 2T Clever 0.54 0.46 0.00 0.61 0.59 0.05 0.58 0.72 0.08 0.69 0.74 0.09 0.61 0.56 0.61 0.55 0.63 0.49 0.42 0.48 0.42 0.52 0.59 0.38 0.55 0.65 0.31 0.31 0.42 0.30 0.53 0.58 0.33 Dataset MAD IQR Original 0.43 0.57 0.75 0.66 0.57 0.58 0.47 0.53 0.65 0.39 0.57 2T 0.37 0.50 0.61 0.55 0.53 0.65 0.51 0.59 0.67 0.46 0.54 Original 0.48 0.61 0.59 0.75 0.61 0.49 0.41 0.42 0.54 0.31 0.51 2T 0.45 0.60 0.69 0.73 0.61 0.51 0.43 0.46 0.59 0.34 0.54 KDDCup. Wilt Stamps PageB. Card. Pima Spam. HeartD. Arrhy. Parki. AVG

  15. Results with MOD detector F1-score SD Original 2T Clever 0.54 0.47 0.00 0.61 0.53 0.05 0.60 0.72 0.60 0.68 0.73 0.09 0.55 0.56 0.55 0.54 0.62 0.42 0.55 0.49 0.38 0.52 0.54 0.38 0.56 0.65 0.55 0.35 0.49 0.34 0.55 0.58 0.34 Dataset MAD IQR Original 0.43 0.52 0.73 0.66 0.56 0.60 0.56 0.51 0.61 0.42 0.56 2T 0.37 0.47 0.65 0.55 0.54 0.63 0.52 0.54 0.67 0.48 0.54 Original 0.48 0.59 0.60 0.75 0.55 0.49 0.42 0.40 0.53 0.34 0.51 2T 0.45 0.56 0.66 0.72 0.55 0.52 0.43 0.42 0.57 0.36 0.52 KDDCup. Wilt Stamps PageB. Card. Pima Spam. HeartD. Arrhy. Parki. AVG

  16. The amount of detected outliers MOD detector Dataset KDDCup. Wilt Stamps PageB. Card. Pima Spam. HeartD. Arrhy. Parki. Outlier 246 261 31 560 471 268 1813 120 206 147 SD 3509 330 35 229 183 108 693 44 63 22 2T 9383 1073 79 834 522 228 1047 85 136 52 Clever 48105 4806 334 5378 2103 734 4186 263 428 174

  17. Time (s) MOD detector Dataset (Size) KDDCup. (60632) Wilt (4839) Stamps (340) PageB. (5473) Card. (2126) Pima (768) Spam. (4601) HeartD. (270) Arrhy. (450) Parki. (195) SD <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 2T 0.12 0.01 Clever 811.40 8.39 0.06 10.54 1.69 0.26 6.29 0.05 0.10 0.02 0.01 0.01

  18. Conclusions Why to use: Simple but effective! How it performs: Improve existing thresholding! Usefulness: Almost no extra coding needed!

  19. Thank you!

  20. Outlier detection steps 1000 60 17 1 1500 2 3 Scoring: 6 18 18 8 16 500 300 Thresholding: 1, 2, 3, 6, 8, 16, 17, 18, 18, 60, 300, 500, 1000, 1500

Related


More Related Content