
Foundations of Algorithms and Machine Learning - Combining Models and Decision Trees
Explore the fundamentals of algorithms and machine learning through the concepts of combining models and decision trees. Learn how multiple models can be leveraged to reduce errors and how decision trees partition input space for simple modeling and learning. Discover examples of model combinations and the process of learning decision trees.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Combining Models 1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Motivation 1 Many models can be trained on the same data Typically none is strictly better than others Recall no free lunch theorem Can we combine predictions from multiple models? Yes, typically with significant reduction of error! 2 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Motivation 2 Combined prediction using Adaptive Basis Functions ? ? ? = ????(?;??) ?=1 M basis functions with own parameters Weight / confidence of each basis function Parameters including M trained using data Another interpretation: automatically learning best representation of data for the task at hand Difference with mixture models? 3 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Examples of Model Combinations Also called Ensemble Learning Decision Trees Bagging Boosting Committee / Mixture of Experts Feed forward neural nets / Multi-layer perceptrons 4 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Decision Trees Partition input space into cuboid regions Simple model for each region Classification: Single label; Regression: Constant real value Sequential process to choose model per instance Decision tree Figure from Murphy 5 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Learning Decision Trees Decision for each region Regression: Average of training data for the region Classification: Most likely label in the region Learning tree structure and splitting values Learning optimal tree intractable Greedy algorithm Find (node, dim., value) w/ largest reduction of error Regression error: residual sum of squares Classification: Misclassification error, entropy, Stopping condition Preventing overfitting: Pruning using cross validation 6 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Pros and Cons of Decision Trees Easily interpretable decision process Widely used in practice, e.g. medical diagnosis Not very good performance Restricted partition of space Restricted to choose one model per instance Unstable 7 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Mixture of Supervised Models ? ? = ????(?,?) ? Mixture of linear regression models Mixture of logistic regression models Figure from Murphy Training using EM 8 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Conditional Mixture of Supervised Models Mixture of experts ? ? = ??(?)??(?,?) ? Figure from Murphy Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
Bootstrap Aggregation / Bagging Individual models (e.g. decision trees) may have high variance along with low bias Construct M bootstrap datasets Train separate copy of predictive model on each Average prediction over copies ? ? =1 ? ??(?) ? If the errors are uncorrelated, then bagged error reduces linearly with M 10 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Random Forests Training same algorithm on bootstraps creates correlated errors Randomly choose (a) subset of variables and (b) subset of training data Good predictive accuracy Loss in interpretability 11 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Boosting Combining weak learners, ?-better than random E.g. Decision stumps Sequence of weighted datasets Weight of data point in each iteration proportional to no of misclassifications in earlier iterations Specific weighting scheme depends on loss function Theoretical bounds on error 12 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Example loss functions and algorithms 2 Squared error ?? ? ?? Absolute error |?? ?(??)| Squared loss 1 ???(??)2 0-1 loss ?( ?? ?(??)) Exponential loss exp( ???(??)) Logloss Hinge loss 1 ??? ?? 1 log 2log(1 + ? ???(??)) + 13 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Example: AdaBoost Binary classification problem + Exponential loss 1 ? (1)= 1. Initialize ?? 2. Train classifier ??(?) minimizing ??? ??(???? ??) ?? ???? ?? ??? (?+1)= ?? ?????(?)) ??? and ??= log1 ?? 3. Evaluate ??= ? ?? (?)exp{???(???? ??)} 4. Update wts ?? 5. Predict ??? = ???( ?=1 14 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Neural networks: Multilayer Perceptrons Multiple layers of logistic regression models Parameters of each optimized by training Motivated by models of the brain Powerful learning model regardless 15 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
LR and R remembered Linear models with fixed basis functions ? ?,? = ?( ????(?)) ? Fixed basis functions Non-linear transformation ?? linear followed by non-linear transformation ? ?;?,? = ( ??? ?( ?????)) ?=1 ?? ? ?=1 ?? ? 16 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Feed-forward network functions M linear combinations of input variables ??= ????? ?=1 ?? ? Apply non-linear activation function ??= ?(??) Linear combinations to get output activations ??= ????? ?=1 ?? ? Apply output activation function to get outputs ??= (??) 17 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Network Representation ??????? ?????? ????? ? ?? ?? ? ? ?? Easy to generalize to multiple layers ?1 ?1 ??? ??? ?? ?? ?? ?? ?? ?? ?? ?1 ?1 ?1 18 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Power of feed-forward networks Universal approximators A 2 layer network with linear outputs can uniformly approximate any smooth continuous function with arbitrary accuracy given sufficient number of nodes in hidden layer Why are >2 layers needed? 19 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Training Formulate error function in terms of weights 2 ? ?,? = ? ??;?,? ?? ?=1 ?? ? Optimize weights using gradient descent ?,?(?+1)= ?,?(?) ???( ?,?(?)) Deriving the gradient looks complicated because of feed-forward 20 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Error Backpropagation Full gradient: sequence of local computations and propagations over the network ??? ???? =??? ???? ???? ???? Output layer ???? = ??? ??????? ???????? ? ?= ??? ??? ??? ??? ??? ?? ?? ?? ?? ?? ? ?? ? ??? ??? ???? ???? ??? ???? =??? ???? ???????? ????? Hidden layer ???? = ??? ?? ??= ??? ?? ??? ???? ???? ???? ?= ????? (???) ??? = ??? ? ? ? 21 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Backpropagation Algorithm 1. Apply input vector ?? and compute derived variables ??,??,??, ?? 2. Compute ??? 3. Back propagate ??? nodes 4. Compute derivatives ??? ? at all output nodes ? to compute ??? ? at all hidden ???? and ??? ???? 5. Batch: Sum derivatives over all input vectors Vanishing gradient problem 22 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Neural Network Regularization Given such a large number of parameters, preventing overfitting is vitally important Choosing the number of layers + no of hidden nodes Controlling the weights Weight decay Early stopping Weight sharing Structural regularization Convolutional neural networks for invariances in image data 23 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
So Which classifier is the best in practice? Extensive experimentation by Caruana et al 2006 See more recent study here Low dimensions (9-200) 1. Boosted decision trees 2. Random forests 3. Bagged decision trees 4. SVM 5. Neural nets 6. K nearest neighbors 7. Boosted stumps 8. Decision tree 9. Logistic regression 10. Na ve Bayes High dimensions (500-100K) 1. HMC MLP 2. Boosted MLP 3. Bagged MLP 4. Boosted trees 5. Random forests Remember the No Free Lunch theorem 24 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya