Analysis of Cost-Sensitive Boosting Algorithms

Slide Note
Embed
Share

Explore the discussion around the necessity of cost-sensitive boosting algorithms as a unified approach in machine learning. Discover the boosting approach, Adaboost algorithm, theoretical history, and comparison with traditional learning algorithms. Dive into the process of turning weak learners into strong ones through weighted voting and model corrections.


Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cost-Sensitive Boosting algorithms: Do we really need them? Professor Gavin Brown School of Computer Science, University of Manchester

  2. The important thing in science is not so much to obtain new facts, but to discover new ways of thinking about them. Sir William Bragg Nobel Prize, Physics 1915

  3. Cost-sensitive Boosting algorithms: Do we really need them? Nikolaos Nikolaou, Narayanan Edakunni, Meelis Kull, Peter Flach and Gavin Brown Machine Learning Journal, Vol. 104, Issue 2, Sept 2016 Nikos Nikolaou Cost-sensitive Boosting: A Unified Approach PhD thesis, University of Manchester, 2016

  4. The Usual Machine Learning Approach data + labels Learning Algorithm Model Predicted label New data (no labels)

  5. The Boosting approach data + labels Dataset 2 Dataset 3 Dataset 4 Model 1 Model 2 Model 3 Model 4 Each model corrects the mistakes of its predecessor.

  6. The Boosting approach New data Model 1 Model 2 Model 3 Model 4 Weighted vote

  7. Boosting: A Rich Theoretical History Can we turn a weak learner into a strong learner? Kearns, 1988 Marginally more accurate than random guessing Arbitrarily high accuracy Yes! Schapire, 1990 AdaBoost algorithm. Freund & Schapire, 1997 G del Prize 2003

  8. The Adaboost algorithm (informal) Start with all datapoints equally important. for t=1 to T do Build a classifier, h Calculate the voting weight for h Update the distribution over the datapoints - increase emphasis where h makes mistakes - decrease emphasis where h is already correct end for Calculate answer for test point by majority vote.

  9. The Adaboost algorithm Majority voting weight for classifier #t Distribution update Majority vote on test example x

  10. Cost-sensitive Problems What happens with differing False Positive / False Negative costs? can we minimize the expectedcost (a.k.a. risk)? Prediction 1 0 1 0 Truth

  11. Cost-sensitive Adaboost? 15+ boosting variants, 20 years Some re-invented multiple times Most proposed as heuristic modifications to original AdaBoost Many treat FP/FN costs as hyperparameters

  12. The Adaboost algorithm, made cost-sensitive? We could modify this Or, some modify this Others modify this And some modify this

  13. The important thing in science is not so much to obtain new facts, but to discover new ways of thinking about them. Let s take a step back Functional Gradient Descent(Mason et al, 2000) Decision Theory (Freund & Schapire, 1997) Margin Theory (Schapire et al., 1998) Probabilistic Modelling (Lebanon & Lafferty 2001)

  14. So, for a cost-sensitive boosting algorithm? Algorithm X Functional Gradient Descent Decision Theory Does my new algorithm still follow from each? Margin Theory Probabilistic Modelling

  15. Functional Gradient Descent Distribution update? = Direction in function space Voting weight? = Step size Property 1: FGD-consistency Are the voting weights and distribution updates consistent with each other? (i.e. both a consequence of FGD on a given loss?)

  16. Decision Theory Ideally: Assign each example to risk-minimizingclass Bayes optimal rule, predict class y=1 iff > Property 2: Cost-consistency Does the algorithm use the above to make decisions? (assuming good probability estimates)

  17. Margin theory Surrogate exponential loss, upper bound on 0/1 loss . Large margins encourage small generalization error. Adaboost promotes large margins.

  18. Margin theory with costs Different surrogates for each class.

  19. Margin theory with costs But some algorithms do this We expect this to be the case. Property 3: Asymmetry preservation Does the loss function preserve the relative importance of each class, for all margin values?

  20. Probabilistic Models AdaBoost does not produce good probability estimates. Niculescu-Mizil & Caruana, 2005 AdaBoost is successful at [..] classification [..] but not class probabilities. Mease et al., 2007 This increasing tendency of [the margin] impacts the probability estimates by causing them to quickly diverge to 0 and 1. Mease & Wyner, 2008

  21. Probabilistic Models Adaboost tends to produce probability estimates close to 0 or 1. Adaboost output (score) Frequency Empirically Observed Probability Adaboost output (score)

  22. Probabilistic Models Adaboost tends to produce probability estimates close to 0 or 1. Adaboost output (score) Frequency Empirically Observed Probability Adaboost output (score) Property 4: Calibrated estimates Does the algorithm generate calibrated probability estimates?

  23. The results are in. All algorithms produce uncalibrated probability estimates! So what if we calibrate these last three?

  24. Platt scaling (logistic calibration) Training phase: Reserve part of training data (here 50%) to fit a sigmoid - corrects the distortion: Adaboost output (score) Empirically Observed Probability Test phase: Apply sigmoid transform to score (output of ensemble) to get probability estimate

  25. Experiments (lots) 15 algorithms. 18 datasets. 21 degrees of cost imbalance.

  26. In summary. Have all four properties Have all except calibrated probs Algorithms Ranked by Average Brier Score AdaMEC- Calibrated AdaMEC, CGAda & AsymAda outperform all others. Their calibrated versions outperform the uncalibrated ones

  27. In summary. AdaMEC-calibrated was an algorithm ranked 2nd overall. So what is it? 1. Take original Adaboost. 2. Calibrate it (we use Platt scaling) 3. Shift the decision threshold . Consistent with all theory perspectives. No extra hyperparameters added. No need to retrain if cost ratio changes. Consistently top (or joint top) in empirical comparisons.

  28. The results are in. All algorithms produce uncalibrated probability estimates!

  29. Question: What if we calibrate all algorithms? Answer: in theory calibration will improve the probability estimates. if a method is not cost-sensitive, calibration will not fix it. if the gradient steps/direction are not consistent, calibration will not fix it. if class importance is swapped, calibration will not fix it.

  30. Question: What if we calibrate all algorithms? Have all properties All except calibrated Original AdaBoost (1997)

  31. Conclusions We analyzed the cost-sensitive boosting literature 15+ variants over 20 years, from 4 different theoretical perspectives Cost sensitive modifications to the original Adaboost seem unnecessary. if the scores are properly calibrated, and the decision threshold is shifted according to the cost matrix.

  32. Code to play with All software at http://www.cs.man.ac.uk/~gbrown/costsensitiveboosting/ Including - a step by step walkthrough of Calibrated AdaMEC , in Python implementation - a detailed tutorial on all aspects, allowing experiments to be reproduced

  33. Results (extended) Isotonic regression beats Platt scaling, for larger datasets Can do better than 50%-50% train-calibration split. (Calibrated) Real AdaBoost > (Calibrated) Discrete AdaBoost...

Related