Data Analytics: Understanding Classification Using Decision Trees

mis2502 data analytics classification using n.w
1 / 22
Embed
Share

Explore the concept of classification in data analytics through decision trees, which involve determining group assignments for data elements based on attributes. Learn how classification works, the training process, and the goals of creating predictive models using decision trees.

  • Data Analytics
  • Decision Trees
  • Classification
  • Machine Learning

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. MIS2502: Data Analytics Classification using Decision Trees

  2. What is classification? Determining to what group a data element belongs Or attributes of that entity Examples Determining whether a customer should be given a loan Flagging a credit card transaction as a fraudulent charge Categorizing a news story as finance, entertainment, or sports

  3. How classification works 1 Choose a set of records for the training set 2 Choose a set of records for the validation set 3 Choose a class attribute (classifier) 4 Find model that predicts the class attribute as a function of the other attributes 5 Apply that model to the validation set to check accuracy 6 Apply the final model to future records to classify

  4. Decision Tree Learning Training Set Trans. ID Charge Amount Avg. Charge 6 months Item Same state as billing Classification Derive model 1 $800 $100 Electronics No Fraudulent 2 $60 $100 Gas Yes Legitimate 3 $1 $50 Gas No Fraudulent Classification software 4 $200 $100 Restaurant Yes Legitimate 5 $50 $40 Gas No Legitimate 6 $80 $80 Groceries Yes Legitimate 7 $140 $100 Retail No Legitimate 8 $140 $100 Retail No Fraudulent Validation Set Trans. ID Charge Amount Avg. Charge 6 months Item Same state as billing Classification Apply model 101 $100 $200 Electronics Yes ? 102 $200 $100 Groceries No ? 103 $1 $100 Gas Yes ? 104 $30 $25 Restaurant Yes ?

  5. Goals The trained model should assign new cases to the right category It won t be 100% accurate, but should be as close as possible The model s rules can be applied to new records as they come along An automated, reliable way to predict the outcome

  6. Classification Method: The Decision Tree A model to predict membership of cases or values of a dependent variable based on one or more predictor variables (Tan, Steinback, and Kumar 2004)

  7. Example: Credit Card Default TID Income Debt Owns/ Rents Outcome We create the tree from a set of training data 1 25k 35% Owns Default 2 35k 40% Rent Default 3 33k 15% Owns No default Each unique combination of predictors is associated with an outcome 4 28k 19% Rents Default 5 55k 30% Owns No default 6 48k 35% Rent Default 7 65k 17% Owns No default This set was rigged so that every combination is accounted for and has an outcome 8 85k 10% Rents No default Training Data

  8. Example: Credit Card Default Leaf node Child node TID Income Debt Owns/ Rents Outcome Root node 1 25k 35% Owns Default Owns house Default 2 35k 40% Rent Default Debt > 20% 3 33k 15% Owns No default Rents Default Income <40k 4 28k 19% Rents Default Owns house No Default Debt < 20% 5 55k 30% Owns No default Rents Default Credit Approval 6 48k 35% Rent Default Owns house No Default 7 65k 17% Owns No default Debt > 20% Rents Default 8 85k 10% Rents No default Income >40k Owns house No Default Debt < 20% Rents No Default Training Data

  9. Same Data, Different Tree TID Income Debt Owns/ Rents Outcome Debt > 20% Default Income < 40k 1 25k 35% Owns Default Debt < 20% No Default 2 35k 40% Rent Default Owns Debt > 20% No Default 3 33k 15% Owns No default Income > 40k 4 28k 19% Rents Default Debt < 20% No Default Credit Approval 5 55k 30% Owns No default Debt > 20% Default Income < 40k 6 48k 35% Rent Default Debt < 20% Default 7 65k 17% Owns No default Rents Debt > 20% Default 8 85k 10% Rents No default Income > 40k Debt < 20% No Default Training Data We just changed the order of the predictors.

  10. Apply to new (validation) data Validation Data TID Income Debt Owns/ Rents Decision (Predicted) Decision (Actual) 1 80k 35% Rent Default No Default Owns house Default Debt > 20% 2 20k 40% Owns Default Default Rents Default Income <40k 3 15k 15% Owns No Default No Default Owns house No Default Debt < 20% 4 50k 19% Rents No Default Default Rents Default Credit Approval 5 35k 30% Owns Default No Default Owns house No Default Debt > 20% Rents Default Income >40k How well did the decision tree do in predicting the outcome? Owns house No Default Debt < 20% Rents No Default When it s good enough, we ve got our model for future decisions.

  11. In a real situation The tree induction software has to deal with instances where The same set of predictors resulting in different outcomes Multiple paths result in the same outcome Not every combination of predictors is in the training set

  12. Tree Induction Algorithms Tree induction algorithms take large sets of data and compute the tree Owns house Default Similar cases may have different outcomes Debt > 20% Rents Default Income <40k Owns house No Default Debt < 20% Rents Default Credit Approval Owns house No Default Debt > 20% Rents Default So probability of an outcome is computed Income >40k 0.8 Owns house No Default Debt < 20% Rents No Default For instance, you may find: When income > 40k, debt < 20%, and the customers rents no default occurs 80% of the time.

  13. How the induction algorithm works Start with single node with all training data Yes Are samples all of same classification? No No Are there predictor(s) that will split the data? Yes Partition node into child nodes according to predictor(s) No Are there more nodes (i.e., new child nodes)? DONE! Yes Go to next node.

  14. Start with root node Owns house Default Debt > 20% Rents Default Income <40k Owns house No Default Debt < 20% Rents Default Credit Approval Owns house No Default Debt > 20% Rents Default Income >40k Owns house No Default Debt < 20% Rents No Default There are both defaults and no defaults in the set So we need to look for predictors to split the data Credit Approval

  15. Split on income Owns house Default Debt > 20% Rents Default Income <40k Owns house No Default Debt < 20% Rents Default Credit Approval Owns house No Default Debt > 20% Rents Default Income >40k Owns house No Default Debt < 20% Rents No Default Income is a factor (Income < 40, Debt > 20, Owns = Default) but (Income > 40, Debt > 20, Owns = No default) But there are also a combination of defaults and no defaults within each income group So look for another split Income <40k Credit Approval Income >40k

  16. Split on debt Owns house Default Debt > 20% Rents Default Income <40k Owns house No Default Debt < 20% Rents Default Credit Approval Owns house No Default Debt > 20% Rents Default Income >40k Owns house No Default Debt < 20% Rents No Default Debt is a factor (Income < 40, Debt < 20, Owns = No default) but (Income < 40, Debt < 20, Rents = Default) But there are also a combination of defaults and no defaults within some debt groups So look for another split Debt >20% Income <40k Credit Approval Debt <20% Debt >20% Income >40k Debt <20%

  17. Split on Owns/Rents Owns house Default Debt > 20% Rents Default Income <40k Owns house No Default Debt < 20% Rents Default Credit Approval Owns house No Default Debt > 20% Rents Default Income >40k Owns house No Default Debt < 20% Rents No Default Owns/Rents is a factor For some cases it doesn t matter, but for some it does So you group similar branches And we stop because we re out of predictors! Debt >20% Default Income <40k Owns house No Default Debt <20% Credit Approval Rents Default Owns house No Default Debt >20% Rents Default Income >40k Debt <20% No Default

  18. How does it know when and how to split? There are statistics that show When a predictor variable maximizes distinct outcomes (if age is a predictor, we should see that older people buy; younger people don t) When a predictor variable separates outcomes (if age is a predictor, we should not see older people who buy mixed up with older people who don t)

  19. Example: Chi-squared test Is the proportion of the outcome class the same in each child node? Observed Owns Rents Default 300 450 750 No Default 550 200 750 It shouldn t be, or the classification isn t very helpful 850 650 1500 Expected Owns Rents Default 425 325 750 Root (n=1500) No Default 425 325 750 Default = 750 850 650 1500 No Default = 750 Owns (n=850) Rents (n=650) Default = 300 Default = 450 No Default = 550 No Default = 200 ) ij E ( 2 E O ij = 2 ij

  20. Chi-squared test ( ) 2 E O E Observed ij ij = 2 Owns Rents ij Default 300 450 750 No Default 550 200 750 2 2 2 2 300 ( 425 ) 550 ( 425 ) ( 450 325 ) ( 200 325 ) = + + + 2 850 650 1500 425 425 325 325 = + + + = 2 36 7 . 36 7 . 48 0 . 48 0 . 169 4 . Expected Owns Rents 0001 . 0 p Default 425 325 750 No Default 425 325 750 850 650 1500 Small p-values (i.e., less than 0.05 mean it s very unlikely the groups are the same) If the groups were the same, you d expect an even split (Expected) But we can see they aren t distributed evenly (Observed) So Owns/Rents is a predictor that creates two different groups But is it enough (i.e., statistically significant)?

  21. Bottom line: Interpreting the Chi-Squared Test High statistic (low p-value) from chi-squared test means the groups are different SAS shows you the logworth value Which is -log(p-value) Way to compare split variables (big logworths = ) Low p-values are better: -log(0.05) 2.99 -log(0.0001) 9.21

  22. Reading the SAS Decision Tree Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let s code Default as 1 and No Default as 0 0: 15% 1: 85% So what is the chance that: Debt >20% Income <40k Owns house 0: 80% 1: 20% A renter making more than $40,000 and debt more than 20% of income will default? Debt <20% 0: 70% 1: 30% Credit Approval Rents Owns house 0: 78% 1: 22% A home owner making less than $40,000 and debt more than 20% of income will default? Debt >20% 0: 40% 1: 60% Rents Income >40k 0: 74% 1: 26% Debt <20%

More Related Content