Introduction to Data Mining Evaluation and Imbalanced Class Problem

Slide Note

This content introduces the concepts of evaluation in data mining, specifically focusing on the challenges posed by imbalanced class problems. It discusses common metrics like accuracy, confusion matrix, and the class imbalance issue in various classification scenarios. Additionally, it highlights the limitations of accuracy in imbalanced datasets and provides examples to illustrate the importance of detecting rare classes. The comparison of models based on predicted and actual classes is also presented to emphasize the need for appropriate evaluation measures in such scenarios.

aprev Follow

Uploaded on Feb 17, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Data Mining Evaluation and Imbalanced Class Problem Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar Modified by Yuzhen Ye, Fall 2022

Accuracy PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL CLASS (TP) c (FP) (FN) d (TN) Class=No Most widely-used metric: + + a d TP TN = = Accuracy + + + + + + a b c d TP TN FP FN 2/15/2021 Introduction to Data Mining, 2nd Edition 2

Confusion Matrix Confusion Matrix: PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL CLASS Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 2/15/2021 Introduction to Data Mining, 2nd Edition 3

Class Imbalance Problem Lots of classification problems where the classes are skewed (more records from one class than another) Credit card fraud Intrusion detection Defective products in manufacturing assembly line COVID-19 test results on a random sample Ye Ref: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets 2/15/2021 Introduction to Data Mining, 2nd Edition 4

Challenges Evaluation measures such as accuracy is not well-suited for imbalanced class Detecting the rare class is like finding needle in a haystack 2/15/2021 Introduction to Data Mining, 2nd Edition 5

Problem with Accuracy Consider a 2-class problem Number of Class NO examples = 990 Number of Class YES examples = 10 If a model predicts everything to be class NO, accuracy is 990/1000 = 99 % This is misleading because this trivial model does not detect any class YES example Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) PREDICTED CLASS Class=Yes Class=No Class=Yes 0 10 ACTUAL CLASS Class=No 0 990 2/15/2021 Introduction to Data Mining, 2nd Edition 6

Which model is better? PREDICTED Class=Yes 0 0 Class=No 10 990 A ACTUAL Class=Yes Class=No Accuracy: 99% PREDICTED Class=Yes 10 500 B Class=No ACTUAL Class=Yes Class=No 0 490 Accuracy: 50% 2/15/2021 Introduction to Data Mining, 2nd Edition 7

Which model is better? PREDICTED Class=Yes 5 0 A Class=No ACTUAL Class=Yes Class=No 5 990 B PREDICTED Class=Yes 10 500 Class=No ACTUAL Class=Yes Class=No 0 490 2/15/2021 Introduction to Data Mining, 2nd Edition 8

Alternative Measures PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL CLASS Class=No c d a + = Precision (p) a c a + = Recall (r) a b 2 2 rp + a = = F - measure (F) + + 2 r p a b c F-measure (F1 score, F score): weighted harmonic mean of the precision and recall Ye 2/15/2021 Introduction to Data Mining, 2nd Edition 9

Alternative Measures 10 + = = Precision (p) 5 . 0 PREDICTED CLASS 10 10 10 + Class=Yes Class=No = = Recall (r) 1 10 0 Class=Yes 10 0 2 * 1 5 . 0 * ACTUAL CLASS = = F - measure (F) . 0 62 5 . 0 + 1 Class=No 10 980 990 = = Accuracy . 0 99 1000 2/15/2021 Introduction to Data Mining, 2nd Edition 10

Alternative Measures 10 + = = Precision (p) 5 . 0 PREDICTED CLASS 10 10 10 + Class=Yes Class=No = = Recall (r) 1 10 0 Class=Yes 10 0 2 * 1 5 . 0 * ACTUAL CLASS = = F - measure (F) . 0 62 5 . 0 + 1 Class=No 10 980 990 = = Accuracy . 0 99 1000 1 + PREDICTED CLASS = = Precision (p) 1 1 0 Class=Yes Class=No 1 + = = Recall (r) 1 . 0 1 9 Class=Yes 1 9 ACTUAL CLASS 2 1 . 0 * * 1 = = F - measure (F) . 0 18 Class=No 0 990 1 . 0 + 1 991 = = Accuracy . 0 991 1000 2/15/2021 Introduction to Data Mining, 2nd Edition 11

Which of these classifiers is better? PREDICTED CLASS = Precision (p) = 8 . 0 Class=Yes Class=No A Recall (r) 8 . 0 Class=Yes 40 10 = F - measure (F) = 8 . 0 ACTUAL CLASS Accuracy 8 . 0 Class=No 10 40 PREDICTED CLASS B Class=Yes Class=No = Precision (p) = ~ . 0 04 Recall (r) 8 . 0 Class=Yes 40 10 ACTUAL CLASS = F - measure (F) = ~ . 0 08 Class=No 1000 4000 Accuracy ~ 8 . 0 2/15/2021 Introduction to Data Mining, 2nd Edition 12

Measures of Classification Performance PREDICTED CLASS Yes No ACTUAL CLASS Yes No TP FP FN TN is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP). is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN). 2/15/2021 Introduction to Data Mining, 2nd Edition 13

Alternative Measures A PREDICTED CLASS Precision (p) = 0.8 TPR = Recall (r) = 0.8 FPR = 0.2 F measure (F) = 0.8 Accuracy = 0.8 Class=Yes Class=No Class=Yes 40 10 ACTUAL CLASS Class=No 10 40 TPR FPR= 4 B PREDICTED CLASS Precision (p) = 0.038 TPR = Recall (r) = 0.8 FPR = 0.2 F measure (F) = 0.07 Accuracy = 0.8 Class=Yes Class=No Class=Yes 40 10 ACTUAL CLASS Class=No 1000 4000 TPR FPR= 4 2/15/2021 Introduction to Data Mining, 2nd Edition 14

Which of these classifiers is better? A PREDICTED CLASS = Precision (p) 5 . 0 Class=Yes Class=No Recall = = TPR (r) 2 . 0 Class=Yes 10 40 2 . 0 = measure FPR F ACTUAL CLASS Class=No 10 40 = 0.28 B PREDICTED CLASS = Precision (p) 5 . 0 Class=Yes Class=No Recall = = TPR (r) 5 . 0 Class=Yes 25 25 5 . 0 = measure FPR F ACTUAL CLASS Class=No 25 25 = 0.5 C PREDICTED CLASS = Precision (p) 5 . 0 Class=Yes Class=No Recall = = TPR (r) 8 . 0 Class=Yes 40 10 8 . 0 = measure FPR F ACTUAL CLASS Class=No 40 10 = 0.61 2/15/2021 Introduction to Data Mining, 2nd Edition 15

ROC (Receiver Operating Characteristic) A graphical approach for displaying trade-off between detection rate and false alarm rate Developed in 1950s for signal detection theory to analyze noisy signals ROC curve plots TPR against FPR Performance of a model represented as a point in an ROC curve 2/15/2021 Introduction to Data Mining, 2nd Edition 16

ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 2/15/2021 Introduction to Data Mining, 2nd Edition 17

ROC (Receiver Operating Characteristic) To draw ROC curve, classifier must produce continuous-valued output Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record By using different thresholds on this value, we can create different variations of the classifier with TPR/FPR tradeoffs Many classifiers produce only discrete outputs (i.e., predicted class) How to get continuous-valued outputs? Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM 2/15/2021 Introduction to Data Mining, 2nd Edition 18

Example: Decision Trees Decision Tree x2 < 12.63 x2 < 17.35 x1 < 13.29 Continuous-valued outputs x1 < 2.15 x1 < 6.56 x2 < 12.63 x1 < 7.24 x2 < 8.64 x2 < 17.35 x1 < 13.29 x1 < 12.11 x1 < 2.15 x2 < 1.38 x1 < 6.56 0.059 0.220 x1 < 18.88 x1 < 7.24 x2 < 8.64 0.071 0.107 x1 < 12.11 x2 < 1.38 0.727 0.164 x1 < 18.88 0.669 0.271 0.143 0.654 0 2/15/2021 Introduction to Data Mining, 2nd Edition 19

ROC Curve Example x2 < 12.63 x2 < 17.35 x1 < 13.29 x1 < 2.15 x1 < 6.56 0.059 0.220 x1 < 7.24 x2 < 8.64 0.071 0.107 x1 < 12.11 x2 < 1.38 0.727 0.164 x1 < 18.88 0.669 0.271 0.143 0.654 0 2/15/2021 Introduction to Data Mining, 2nd Edition 20

ROC Curve Example - 1-dimensional data set containing 2 classes (positive and negative) - Any points located at x > t is classified as positive At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88 2/15/2021 Introduction to Data Mining, 2nd Edition 21

How to Construct an ROC curve Use a classifier that produces a continuous-valued score for each instance The more likely it is for the instance to be in the + class, the higher the score Sort the instances in decreasing order according to the score Apply a threshold at each unique value of the score Count the number of TP, FP, TN, FN at each threshold TPR = TP/(TP+FN) FPR = FP/(FP + TN) Instance 1 2 3 4 5 6 7 8 9 10 Score 0.95 0.93 0.87 0.85 0.85 0.85 0.76 0.53 0.43 0.25 True Class + + - - - + - + - + 2/15/2021 Introduction to Data Mining, 2nd Edition 22

How to construct an ROC curve + - + - - - + - + + Class Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve: 2/15/2021 Introduction to Data Mining, 2nd Edition 23

Using ROC for Model Comparison No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area = 0.5 2/15/2021 Introduction to Data Mining, 2nd Edition 24

Which measure to choose? Many measures exists, but none of them may be ideal in all situations Random classifiers can have high value for many of these measures TPR/FPR provides important information but may not be sufficient by itself in many practical scenarios Given two classifiers, sometimes you can tell that one of them is strictly better than the other C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or same TPR and better FPR, and vice versa) Even if C1 is strictly better than C2, C1 s F-value can be worse than C2 s if they are evaluated on data sets with different imbalances Classifier C1 can be better or worse than C2 depending on the scenario at hand (class imbalance, importance of TP vs FP, cost/time tradeoffs) Ye 2/15/2021 Introduction to Data Mining, 2nd Edition 25

Which Classifier is better? = Precision (p) . 0 98 = T1 PREDICTED CLASS Recall = TPR (r) 5 . 0 Class=Yes Class=No . 0 = FPR TPR/FPR 01 = 50 Class=Yes 50 50 ACTUAL CLASS Class=No 1 99 = F measure 0.66 = Precision (p) 9 . 0 T2 PREDICTED CLASS Recall = = TPR (r) . 0 99 Class=Yes Class=No 1 . 0 = FPR TPR/FPR Class=Yes 99 1 9.9 = ACTUAL CLASS Class=No 10 90 = F measure 0.94 = Precision (p) . 0 99 = T3 PREDICTED CLASS Recall = TPR (r) . 0 99 Class=Yes Class=No . 0 = FPR TPR/FPR 01 = Class=Yes 99 1 99 ACTUAL CLASS Class=No 1 99 = F measure 0.99 2/15/2021 Introduction to Data Mining, 2nd Edition 26 Ye

Which Classifier is better? Medium Skew case = Precision (p) . 0 83 = T1 PREDICTED CLASS Recall = TPR (r) 5 . 0 Class=Yes Class=No . 0 = FPR TPR/FPR 01 = 50 Class=Yes 50 50 ACTUAL CLASS Class=No 10 990 = F measure 0.62 = Precision (p) 5 . 0 T2 PREDICTED CLASS Recall = = TPR (r) . 0 99 Class=Yes Class=No 1 . 0 = FPR TPR/FPR 9.9 = Class=Yes 99 1 ACTUAL CLASS Class=No 100 900 = F measure 0.66 = Precision (p) 9 . 0 T3 PREDICTED CLASS Recall = = TPR (r) . 0 99 Class=Yes Class=No . 0 = FPR TPR/FPR 01 = Class=Yes 99 1 99 ACTUAL CLASS Class=No 10 990 = F measure 0.94 Ye 2/15/2021 Introduction to Data Mining, 2nd Edition 27

Which Classifier is better? High Skew case = Precision (p) 3 . 0 T1 PREDICTED CLASS Recall = = TPR (r) 5 . 0 Class=Yes Class=No . 0 = FPR TPR/FPR 01 = 50 Class=Yes 50 50 ACTUAL CLASS Class=No 100 9900 = F measure 0.375 = Precision (p) . 0 09 = T2 PREDICTED CLASS Recall = TPR (r) . 0 99 Class=Yes Class=No 1 . 0 = FPR TPR/FPR 9.9 = Class=Yes 99 1 ACTUAL CLASS Class=No 1000 9000 = F measure 0.165 = Precision (p) 5 . 0 T3 PREDICTED CLASS Recall = = TPR (r) . 0 99 Class=Yes Class=No . 0 = FPR TPR/FPR 01 = Class=Yes 99 1 99 ACTUAL CLASS Class=No 100 9900 = F measure 0.66 Ye 2/15/2021 Introduction to Data Mining, 2nd Edition 28

Handling Class Imbalanced Problem Sampling-based approaches Undersample the majority class Oversample the rare class Cost-sensitive classification Cost-sensitive classification treats the different misclassifications differently Misclassifying rare class as majority class is more expensive than misclassifying majority as rare class 2/15/2021 Introduction to Data Mining, 2nd Edition 29

Sampling-based Approaches Modify the distribution of training data so that rare class is well-represented in training set Undersample the majority class Oversample the rare class Advantages and disadvantages Easy to realize Disadvantage: prone to overfitting (over- sampling); loss of useful information (under- sampling) 2/15/2021 Introduction to Data Mining, 2nd Edition 30

Cost-sensitive Classification: Cost Matrix PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes f(Yes, Yes) f(Yes,No) C(i,j): Cost of misclassifying class i example as class j Class=No f(No, Yes) f(No, No) Cost Matrix PREDICTED CLASS = Cost , ( i C ) , ( i ) j f j C(i, j) Class=Yes Class=No Class=Yes C(Yes, Yes) C(Yes, No) ACTUAL CLASS Class=No C(No, Yes) C(No, No) 2/15/2021 Introduction to Data Mining, 2nd Edition 31

Computing Cost of Classification Cost Matrix PREDICTED CLASS C(i,j) + -1 1 - ACTUAL CLASS 100 0 + - Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS + - + - ACTUAL CLASS ACTUAL CLASS 150 60 40 250 250 5 45 200 + - + - Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 2/15/2021 Introduction to Data Mining, 2nd Edition 32

Cost Sensitive Classification Cost insensitive (cost-blind) classification Given a test record x: Compute p(i|x) for each class i Decision rule: classify node as class k if = arg max i ( | ) k p i x For 2-class, classify x as + if p(+|x) > p(-|x) This decision rule implicitly assumes that C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+) Ye 2/15/2021 Introduction to Data Mining, 2nd Edition 33

Cost Sensitive Classification General decision rule: Classify test record x as class k if i = arg min ( | ) , ( ) k p i x C i j j 2-class: Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+) Cost(-) = p(+|x) C(+,-) + p(-|x) C(-,-) Decision rule: classify x as + if Cost(+) < Cost(-) if C(+,+) = C(-,-) = 0: ( p + C ( , ) C + | ) x + + + ( , ) ( , ) C 2/15/2021 Introduction to Data Mining, 2nd Edition 34

Cost Sensitive Classification The values of the cost matrix must be carefully defined. The choice of costs will determine the quality the model. For the credit card fraud classification, the costs for a false positive might be the monetary cost of follow-up with the customer to the company and the cost of a false negative might be the cost of the fraudulent transaction to the company. For disease diagnosis, the costs of a false positive might be related to the monetary cost of extra tests; but how about the costs of a false negative? Ye 2/15/2021 Introduction to Data Mining, 2nd Edition 35

Summary Imbalanced class problems are common Examine the class distributions first (EDA) Accuracy isn t good metric for classification problem with imbalanced classes; consider alternative metrics Consider approaches that can deal with imbalanced classes. 2/15/2021 Introduction to Data Mining, 2nd Edition 36

Introduction to Data Mining Evaluation and Imbalanced Class Problem

Download Presentation

Presentation Transcript

Related

More Related Content