
Performance Evaluation: Indices, Recognition Rate, and Data Partitioning
Learn about typical performance indices for classification and regression models, including recognition rate and error metrics. Understand concepts like underfitting, overfitting, and canonical data partitioning to optimize model construction and evaluation.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Performance Evaluation: Accuracy Estimate for Classification and Regression J.-S. Roger Jang ( ) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University 2025/3/22
Introduction to Performance Evaluation Performance evaluation An objective procedure to derive the performance index (or figure of merit) of a given model for classification or regression Importance For comparing two models performance For extracting useful features For determining a model s complexity For AutoML 2/32
Typical Performance Indices Classification Accuracy: Regression RMSE (root mean squared error): MAE (mean absolute error): Top-K accuracy Recognition rate : Error rate: Coefficient of determination (R2): EER (equal error rate) for binary classification Precision, recall, f- measure: AUROC (area under ROC): AUPRC (area under PRC): There are many more PIs! 3/32
Synonyms Sets of synonyms to be used interchangeably (since we are focusing on classification): Classifiers, models Recognition rate, accuracy Training/validation, design-time Test, run-time 4/32
Performance Indices for Classifiers Performance indices of a classifier Quiz! Recognition rate How to derive it objectively? Computation load Design-time computation (training and validation) Run-time computation (test) Our focus Recognition rate and the procedures to derive it The estimated accuracy depends on Data partitioning Model s types and complexity 5/32
Underfitting and Overfitting For regression How to determine the order of the fitting polynomial? For classification How to determine the complexity of the classifier? 6/32
Canonical Data Partitioning: Concepts Data partitioning: to make the best use of the dataset for both model construction & evaluation Quiz! Training set: For model construction Validation set: For model evaluation & selection Test set: For final model evaluation Whole dataset Training set Validation set Test set For model construction For model For final model evaluation Evaluation & selection Disjoint! 7/32
Canonical Data Partitioning: Usage Scenarios Quiz! Your instructor wants you to develop a classifier using training set which should perform good on validation set. And, most likely, your instructor will finally check the results of your classifier on the test set, which he didn't share with you. You can use historical data to create a stock predictor, where the data is partitioned into training set for model construction and validation set for model selection (to prevent overfitting). Then you can use future data as the test set (which won t be available until the date comes) for evaluating the performance of your predictor. 8/32
Canonical Data Partitioning: Characteristics Validation set is used for model selection, particularly for preventing overfitting. Performance on validation set should be close to (or slightly higher than) the performance on test set How to make do with a small dataset Use cross validation (CV) on the training & validation sets to have a more reliable estimate of the accuracy. Estimate the classifier s performance based cross validation only; no further test set is used. 9/32
Simplified Data Partitioning: Training & Test Data partitioning Training set: For model construction Test set: For model evaluation Drawback No validation set for model selection. Model is selected by the training error and it is likely to run into the risk of overfitting. Whole dataset Training set Test set For model construction For final model evaluation Disjoint! 10/32
Simplified Data Partitioning: Training & Validation Data partitioning Training set: For model construction Validation set: For model evaluation & selection Drawback No test set for final model evaluation. Performance on validation set may also lead to overfitting. Whole dataset Training set Validation set For model construction For model Evaluation & selection 11/32
Simplified Data Partitioning: Cross Validation Data partitioning with cross validation (rotation) Training set: For model construction Validation set: For model evaluation & selection Characteristics Can obtain a more robust estimate of the model s performance based on the average performance of the validation sets Less likely to have dependency on dataset partitioning Whole dataset Training set & validation set 12/32 Use cross validation to rotate training and validation sets
Methods for Performance Evaluation Typical methods to derive the recognition rate Inside test (not desirable) One-side holdout test Two-side holdout test (two-fold cross validation) M-fold cross validation Leave-one-out cross validation Ultimate goal To find a just-right model for better prediction No underfitting or overfitting Reasonable computation time 13/32
Inside Test: Concept Data partitioning Use the whole dataset for training & evaluation Recognition rate (RR) Inside-test or resubstitution recognition rate ( ) = = x Dataset , | 2 , 1 ,..., D y i n dataset i i ( ) Model : identified by the F D D 1 n ( ( ) x ) = i = == RR y F inside i D i D 1 14/32
Inside Test: Characteristics Characteristics Too optimistic since RR tends to be higher For instance, 1-NNC always has an RR of 100%! Can be used as the upper bound of the true RR. Potential reasons for low inside-test RR: Bad features of the dataset Bad method for model construction, such as Bad results from neural network training Bad results from k-means clustering 15/32
One-side Holdout Test: Concept Data partitioning Training set for model construction Validation set for performance evaluation Recognition rate Inside-test RR Outside-test RR = Dataset D , A | B 2 , 1 ( ) ( ) = = = = A i Model : A i B i B i A x x ,..., , , | 2 , 1 ,..., A y i identified A by the B y i B ( ) dataset F A A ( ( ) i x ) 1 = i = == A i A RR y F inside A A 1 B ( ( ) i x ) 1 = i = == B i B RR y F 16/32 outside A B 1
One-side Holdout Test: Characteristics Characteristics Highly affected by data partitioning Usually Adopted when training (design-time) computation load is high (for instance, deep neural networks) 17/32
Two-side Holdout Test: Concept Data partitioning Training set for model construction Validation set for performance evaluation Role reversal = Dataset D , A | B 2 , 1 ( ) ( ) = = = = A i Model : A i B i B i x x ,..., , , | 2 , 1 ,..., A y i identified A by the B y i B ( ) dataset F X X A B ( ( ) i x ) ( ( ) i x ) ( ) = = = == + == + A i A B i B / RR y F y F A B inside A B 1 1 i i A B ( ( ) i x ) ( ( ) i x ) ( ) = = = == + == + A i A B i B / RR y F y F A B outside B A 1 1 i i 18/32
Two-side Holdout Test: Block Diagram Two-side holdout test (two-fold cross-validation) construction Dataset A Model A RRB evaluation A Model B RRA evaluation B Dataset B construction + * * A RR B RR = A B B A RR CV + A B Outside test! 19/32
Two-sided Holdout Test: Characteristics Characteristics Better use of the dataset Still highly affected by the partitioning Suitable for models with high training (design-time) computation load 20/32
M-fold Cross Validation: Concept Data partitioning Quiz! Partition the dataset into m folds One fold for validation, the other folds for training Repeat m times = identified by the Dataset D : D D D 1 2 m ( ) Model dataset F D D D D i i ( ) ( ) x m m = j = j = == / RR y F D D inside i D D i j ( ) j 1 1 x , y D D i i j ( ) ( ) x m m = j 1 x = j = == / RR y F D outside i D D i j ( ) j 1 , y D i i j 21/32
M-fold Cross Validation: Block Diagram D 1 construction D ... 2 ... m disjoint sets RR D Model k evaluation k k ... m = k * D RR k k D = 1 RR m CV m = k D k 1 Outside test! 22/32
M-fold Cross Validation: Characteristics Characteristics When m=2 Two-sided holdout test When m=n Leave-one-out cross validation The value of m depends on the computation load imposed by the selected model. 23/32
Stratified Partitioning for Cross Validation Stratified partitioning Quiz! Each fold has the ratio of class sizes as close as possible to that of the original dataset. Class 1 Class 2 Fold 1 Fold 2 Fold 3 Class size ratio = 2:3 24/32
Leave-one-out CV: Concept Data partitioning Quiz! When m=n and Di= (xi, yi) Model : ( ) ( identified ) ( ) = x x x Dataset , , , D y y y , 1 1 , 2 2 , n n dataset ( ) )( ) x by the F D y ( x , D y j j , j j ( ) )( ) x n ( ( ) ) = = == / * 1 RR y F n n ( x inside i D y i ( ) , j j ( ) 1 j x x , y D y , i i j j ( ) )( ) x n = = == / RR y F n ( x outside i D y i , i i 1 i 25/32
Leave-one-out CV: Block Diagram Leave-one-out CV ( ( ) ) x 1; y x construction 1 2; y ... 2 ... Either 0% or 100%! n i/o pairs ( ) RR ky x ; Model k evaluation k k ... ( ) n ny x ; = RR n k = 1 k RR LOOCV n Outside test! 26/32
Leave-one-out CV: Characteristics LOOCV (leave-one-out cross validation) Quiz! Strength: Best use of the dataset to derive a reliable accuracy estimate Drawback: Perform model construction n times Slow! To speed up the computation LOOCV Construct a common part that is used repeatedly, such as Global mean and covariance for QC or NBC How to construct the final model after CV? Use the selected model structure on the whole dataset (training set + validation set) to construct the final model, and then test it. 27/32
CV Applications Feature selection in classification/regression Model complexity determination Order determination in polynomial fitting Number of prototypes for VQ-based 1-NNC Performance comparison among different models Meta parameter tuning Outlier detection (LOOCV in particular) AutoML 28/32
CV Example: Feature Selection Wine dataset (13 features), na ve Bayes classifier, sequential forward selection, leave-one-out cross validation 97.75% validation accuracy 29/32
CV Example: Order Determination in Polynomial Fitting Census dataset, polynomial on years, leave-one-out cross validation Best order = 2 30/32
Efficiency in LOOCV Goal: How to speed up LOOCV in performance evaluation? Common scenario Construct a common model for all dataset Update the model to remove the effect of an I/O pair Use the I/O pair to evaluate the update model Go back to step 1 until all I/O pairs are done. The above scenario is applicable to the following classifiers: K-nearest-neighbor classifiers Na ve Bayes classifiers Quadratic classifiers SVM 1. 2. 3. 4. 31/32
Caveat of CV Even with CV, overfitting is likely to happen when Dataset is small. Feature dimension is high. Do not try to boost validation accuracy too much, or you are running the risk of indirectly training on the left-out data! Example for using feature selection on a random dataset 32/32
Exercise: Compution Time for Cross Validation Question: For a dataset of n input-output pairs, building a model using the set requires 2n seconds. And evaluating the model requires m seconds for m input-output pairs. Now we have a dataset of 100 input-output pairs, and we want to perform 10-fold cross validation on the set. Please answer the following questions. What is the overall time required for performing 10-fold cross validation? What is the overall time required for performing leave-one-out cross validation? Answer 1,900 sec 19,900 sec 33/32