Enhancing Model Performance through Data Balancing Techniques
Explore methods like SMOTE and synthetic data generation to address class imbalance issues in machine learning models, ensuring more accurate predictions. Learn about the importance of conducting these techniques on the training set and considerations for post-processing to maintain model performance.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Week 3 Video 6 Tweaking Towards Optimality
SMOTE Sometimes your data has a really bad class imbalance For example, only 3% of the training labels are the class you want to predict Many algorithms will bias in favor of predicting the 97% over the 3%
SMOTE SMOTE clones data points from the rare case And can also remove data points from the common case, although it s used this way less often Can lead to (mildly) better performance
SMOTE SMOTE clones data points from the rare case And can also remove data points from the common case, although it s used this way less often Can lead to (mildly) better performance You always need to make sure to conduct SMOTE only on the training set Otherwise, your model performance estimates will not be representative of real-world conditions
Rescaling after SMOTE Using SMOTE (or other algorithms that change the distribution) will mess with your probability estimates, driving them towards 50%
Rescaling after SMOTE If that matters (mostly when comparing classifiers to each other), there s a lot of simple hacks Simplest is probably just to fit a linear function to map the confidences back to the original proportion i.e. if the original proportion was 3%, but average confidence is now 16%, fit a linear function on confidence to get it back to 3% You can also fit a spline curve to do this more smoothly Or, for cases where what you really care about is that proportions come out right across multiple classifiers, fit/rescale each classifier s confidence to the values such that the final proportion comes out right
Synthetic Data Generation As an alternative to SMOTE, various papers have experimented with different forms of synthetic data generation Trying to create new data points that look like existing data points Increasing recent interest in using generative AI (LLMs) to do this Watch this space fast-moving area
Studying your model and improving it Another approach is to look at the cases that your model fails for, and conducting further feature engineering to capture those cases Slater et al. (2020) proposes a method involving visualizing cases the model fails for When doing this, important to hold out a test set!
Hyperparameter Tuning Most contemporary (and classic) algorithms have hyperparameters that govern their behavior Hyperparameter tuning can lead to much better performance Trying different parameter values and seeing if data fit better Grid search or interative gradient descent Again, important to hold out a test set!
Fine Tuning Modern foundation models based on neural networks allow fine tuning Add some new examples and train last output layer of neural network to fit them Can be a powerful tool for getting new behaviors out of existing foundation models, and improving model performance
Machine Labeling and Human Feedback Relatedly, you can fit a model and then have it label cases and have a human give input on whether the model is correct, which becomes additional training data If the model labels cases it is uncertain about, or just presents cases it is uncertain about for labeling, then this becomes active learning, which we ll discuss next week
Next Up Transfer Learning and Active Learning