
Intelligent Ensembles for Improved Machine Learning Performance
Learn about the concept of ensembles in machine learning, where multiple models are combined to boost accuracy and organize the ML process effectively. Explore bagging, boosting, random forests, stacking, and more to enhance your predictive modeling abilities.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Ensembles & Combining Intelligence Geoff Hulten
Model Ensembles Instead of learning one model, learn several (many) and combine them Reasons: Often improves accuracy, a lot Organizes the process of doing ML on a team Many methods Bagging, Boosting, GBM, Random Forests, Stacking (Meta-models), sequencing, partitioning, etc
Properties of Well-Organized Intelligence Accurate Comprehensible Easy to Grow Measurable Loosely Coupled Supportive of Team
Bagging Generate K training sets by sampling from the original training set Bootstrap sample Training set contains N training examples Each of the K training sets also contains N training examples Created by sampling with replacement from the original Learn one model on each of the K training sets Combine their predictions by uniform voting
Bootstrap sampling sampling with replacement Most contain duplicates of the original < ?,? >3 < ?,? >3 < ?,? >2 < ?,? >1 < ?,? >5 3 3 < ?,? >1 < ?,? >2 < ?,? >3 < ?,? >4 < ?,? >5 2 1 Most are missing some of the original samples -- ~37% 5 Original Training Set < ?,? >4 < ?,? >5 < ?,? >1 < ?,? >1 < ?,? >2 4 5 1 1 2
Advantages of Bagging Each model focuses on a different part of the problem Can fit that part of the problem better Introduces variance between individual models Voting tends to cancel out the variance
Boosting for ? in range(<num models>): Reweight training samples so weights sum to 1 Learn a model ( ?) on the weighted training data Update the weights of the training data based on M s errors Add M to the ensemble with a weighted vote
Random Forests Build N trees Bootstrap sample for each training set (Bagging) Restrict the features each tree can use Combine by uniform voting
Example RandomForest Grow Selected features: ?1 ?2 ?1= 1? < ?,? >3 < ?,? >3 < ?,? >2 < ?,? >1 < ?,? >5 3 False True 3 Grow Tree 1 ?2= 1? ? = 1 2 Features: ?1 ?2?3 False True Tree 1 1 < ?,? >1 < ?,? >2 < ?,? >3 < ?,? >4 < ?,? >5 5 ? = 1 ? = 0 Selected features: ?1 ?3 Tree 2 ?3= 1? < ?,? >4 < ?,? >5 < ?,? >1 < ?,? >1 < ?,? >2 4 False True 5 ?1= 1? ? = 0 Grow Tree 2 1 False True 1 ? = 1 2 ? = 0
Example RandomForest Predict ?1= 1? ?2= 1? False True ?1 0 ?2 0 ?3 0 ? False True ?2= 1? ? = 1 ? = 0 ? = 1 False True 0 0 1 0 1 0 ? = 1 ? = 0 0 1 1 1 0 0 1 0 1 ?3= 1? 1 1 0 ?3= 1? False True 1 1 1 False True ?1= 1? ? = 0 False ? = 1 True ? = 0 ? = 1 ? = 0
RandomForest Pseudocode trees = [] for i in range(numTrees): (xBootstrap, yBootstrap) = BootstrapSample(xTrain, yTrain) featuresToUse = RandomlySelectFeatureIDs(len(xTrain), numToUse) trees.append(GrowTree(xBootstrap, yBootstrap, featuresToUse)) yPredictions = [ PredictByMajorityVote(trees, xTest[i]) for i in len(xTest) ] yProbabilityEstimates = [ CountVotes(trees, xTest[i]) / len(trees) for i in len(xTest) ]
Model Sequencing Model 1 Model Model 2 Model 3 Override? Override? Override? Override? Default answer 0 Accurate Comprehensible Easy to Grow Measurable Loosely Coupled Supportive of Team
Partitioning Contexts Large Web Site? No Yes Ensemble 2 Ensemble 1 Accurate Comprehensible Easy to Grow Measurable Loosely Coupled Supportive of Team
Ensembles & Combining Intelligence Summary Almost every practical ML situation has more than one model One important reason is accuracy Another is maintainability Avoid Spaghetti Intelligence