Understanding Classifier Evaluation Methods

evaluating classifiers n.w
1 / 28
Embed
Share

Learn how to evaluate classifiers and regression models using measures like confusion matrix, accuracy, error rate, precision, recall, and F-measure. Explore techniques for comparing and selecting the best classifier based on performance metrics.

  • Classifier Evaluation
  • Model Assessment
  • Performance Metrics
  • Machine Learning
  • Evaluation Methods

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Evaluating classifiers

  2. Problem How do you evaluate a classifier/regression model? (e.g., how can we tell one classifier is better than another?) We have seen some evaluation measures and the concept of training/testing Have not systematically checked the evaluation methods yet Measures we have seen so far MSE for regression loss (for function approximation/optimization, neural networks)

  3. Outline Measures for classifiers Confusion matrix, accuracy, and error rate Precision and recall Multi-class evaluation Design of experiments

  4. Classifier Evaluation: Confusion Matrix Confusion Matrix: Actual class\Predicted class C1 C1 C1 True Positives (TP) False Negatives (FN) C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 4

  5. Confusion matrix: multiple classes Given m classes, an entry, CMi,jin a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals

  6. Classifier Evaluation Measures: Accuracy, Error Rate A\P C C Class Imbalance Problem: C TP FN P One class may be rare, e.g. fraud, or HIV-positive C FP TN N P N All Significant majority of the negative class and minority of the positive class Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All Error rate:1 accuracy, or Error rate = (FP + FN)/All Accuracy is not a good measure for imbalanced classes (why?) 7

  7. Classifier Evaluation Measures: Precision and Recall, and F-measures Precision: exactness what % of examples that the classifier labeled as positive are actually positive Recall: completeness what % of positive examples did the classifier label as positive? Perfect score of precision or recall is 1.0 F measure (F1orF-score): harmonic mean of precision and recall, 8

  8. Precision/recall: Example Actual Class\Predicted class cancer = yes cancer = no Total cancer = yes 90 (TP) 210 (FN) 300 cancer = no 140 (FP) 9560 (TN) 9700 Total 230 9770 10000 Precision =? Recall = ? 9

  9. Tradeoffs between precision and recall Important: Model output should be probabilitic output, for example Logistic regression Most neural networks Na ve bayes (distorted) The default probability cutoff is 0.5: i.e., Prob >0.5 : class 1, otherwise class 0 However, you can moving the cutoff to see how the results (precision and recall) look like Plot every pair of precision-recall for each cutoff value The area under the curve can be used as a metric as well

  10. ROC Curves ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Move the probability threshold to get different TPR and FPR. Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line (left figure) classifier 1 performs better than classifier 2 Classifier 1 Classifier 2

  11. Area under the ROC Curve (AUC) A quantitative measure for overall model quality 1 perfect 0.5 random guess Can it be less than 0.5? How

  12. Area under the ROC Curve (AUC) A quantitative measure for model quality 1 perfect 0.5 random guess Can it be less than 0.5? how

  13. ROC AUC or precision-recall AUC? Rule of thumb Balanced classes ROC AUC Imbalanced classes precision-recall AUC sklearn has library functions to compute the ROC_AUC score and the area under the precision-recall curve (function average_precision_score)

  14. Sec. 15.2.4 How to evaluate >2 classes Sample dataset: Classic Reuters-21578 Data Set 21578 documents 9603 training, 3299 test articles 118 categories An article can be in more than one category Learn 118 binary category distinctions Average document: about 200 tokens Average number of classes assigned 1.24 for docs with at least one category Only about 10 out of 118 categories are large Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Common categories (#train, #test) 15

  15. Micro- vs. Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  16. Sec. 15.2.4 Micro- vs. Macro-Averaging: Example Class 1 Class 2 Micro Ave. Table Truth: yes 10 (TP) 10 (FN) Truth: no 10 (FP) 970 (TN) Truth: yes 90 (TP) Truth: no 10 (FP) Truth: yes 100 (TP) 20 Truth: no 20 (FP) 1860 Classifi er: yes Classifi er: no Classifi er: yes Classifi er: no Classifier: yes Classifier: no 10 (FN) 890(TN ) Macroaveraged precision: Microaveraged precision: (TP1+TP2)/(TP1+TP2+FP1+FP2) Microaveraged score is dominated by score on common classes - not good for imbalanced classes 17

  17. The proper design of experiments (for selecting model) Approximated by the error on a holdout test dataset, which has never been seen by the modeling process and can be only used once E.g. Datasets used in private leaderboard in Kaggle that competitors don t see Put a model online and collect the prediction error online test error Validation dataset (actually it is what we call test data ): Often a subset of the training dataset, left out and not used for model training Can be used multiple times for hyper param tuning offline test error usually refers to error on validation dataset

  18. Design of experiments: Holdout & Cross- Validation Methods for model selection Holdout method Given data is partitioned into two independent sets (by a certain criteria, e.g., by date in Project 1) Training set (e.g., 2/3) for model construction validation set (e.g., 1/3) for estimating prediction error Often used for large datasets, expensive models Expensive to train multiple times Weakness: don t know the error bar (i.e., model variance that we will discuss) 19

  19. Random split or not? Random data splitting may lead to underestimate of generalization error Sequential data e.g. house sales, stock prices Valid set should not overlap with train set in time Examples are highly clustered e.g. photos of the same person, clips of the same video May need to split by clusters instead of examples Highly imbalanced label classes Sample more from minor classes

  20. Case study: house price prediction (need to be split by date) Linear regression with sequential split does not work!

  21. K-fold cross-validation: when random split is OK Allow you to get model average performance and standard deviation!

  22. Two models A and B which model you want to select? A: 0.5, 0.6, 0.55, 0.7, 0.65 B: 0.55, 0.55, 0.6, 0.65, 0.7 A : 0.5, 0.6, 0.55, 0.7, 0.65 B : 0.51, 0.62, 0.57, 0.71, 0.67 Large error bars are often not a good sign: data quality or model quality is not good

  23. Error bars for k-fold cross validation Standard deviation, standard error, and confidence interval Standard deviation: theoretical property of a distribution, e.g., ? ?? ? ?,?2, that we often don t know Corrected (unbiased) sample standard deviation : ? = Different software may use biased (1/n) or unbiased(1/(n-1)). Check the details 2 1 ? ? 1 ?=1 ?? ? Standard error (of mean) for computing the error in estimating the mean of samples: For k-fold CV, we compute the average performance by ? = (incorrect in practice!) Thus, var ? = ??? ? ?? = 1 ? ??, assuming xi are independent 1 1 1 ??2 , standard error is estimated as ? ?2 ??? ?? = ? ? ? Confidence interval (of estimated mean model performance): [ ? ? 1.96 for confidence level 95% ?, ? + ? ?], alpha =

  24. Comparing two models based on k-fold cross- validation (or multiple datasets) If confidence intervals (CIs) are not overlapping, roughly the difference is significant If CIs are overlapping, more complicated Commonly used t-test between two sets of k-fold CV results (paired t-test), ?? ?? ? ? ??????????? ????????? ??= ?? ???? ? ? ???? ???? ????:????? ? ??? ? ? ???? ??????????? ?? ???????????? ????: ? ????????? = ? ?? ?, ?? ?? ? ?? ? If the null is true, t stat follows ? 0,?2. Check p-value: the probability that the null hypothesis is correct. Often if p-value < 0.05, we reject the null hypothesis

  25. Example: SpamAssassin Na ve Bayes has found a home in spam filtering Paul Graham s A Plan for Spam A mutant with more mutant offspring... Naive Bayes-like classifier with weird parameter estimation Widely used in spam filters Classic Naive Bayes superior when appropriately used According to David D. Lewis But also many other things: black hole lists, etc. Many email topic filters also use NB classifiers 26

  26. Nave Bayes on spam email (problems with the presentation?) 27

  27. Summary Many learning tasks are about classification Important to understand the well-known evaluation measures Imbalance data Intuition behind micro- vs macro- averaging for multi-class evaluation

Related


More Related Content