Understanding Decision Trees for Data Analysis
Dive into the world of decision trees - from their basic construction and popular algorithms like ID3 and C4.5 to advanced methods like the GUHA approach by Petr Berka. Explore the principles, structure, and parameters involved in creating decision trees for data analysis.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Martin Ralbovsk KIZI FIS V E 6.12.2007
The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations in logics and statistics One of the main principles of the method is to provide everything interesting to the user.
Decision trees One of the most known classification methods There are several known algorithms for construction of decision trees (ID3, C4.5 ) Algorithm outline: Iterate through attributes, in each step choose the best attribute for branching and make node from the attribute Best decision tree in output
Making decision trees GUHA (Petr Berka) A decision tree can be viewed as a GUHA verification/hypothesis But there is only 1 tree in the output Modification of the initial algorithm ETree procedure We do not branch according to best attribute, but according to n best attributes In each iteration, nodes suitable for branching from existing trees are selected and branched Only sound decision trees to output
ETree parameters (Petr Berka) Criterion for attribute ordering - 2 Trees: Maximal tree depth (parameter) Allow only full length trees Number of attributes for branching Branching: Minimal node frequency Minimal node purity Stopping branching criterion (frequency, purity, frequency OR purity) Sound trees: Confusion matrix, F-Measure + any 4ft-quantifier in Ferda
How to branch I Attribute1 = {A,B,C} Attribute2 = {1,2} Ad b) A B C Ad a) A B C A B C A B C A B C A B C
How to branch II Attribute1 = {A,B,C} Attribute2 = {1,2} Ad a) Ad b) A B C A B C A B C A B C A B C 1 2 2 1 2 1
Pseudocode algorithm LIFO stack; stack.Push(MakeSeedTree()); while (stack.Length >= 0) { Tree processTree = stack.Pop(); foreach (Node n in NodesForBranching(processTree) { stack.Push(CreateTree(processTree,n); } if (QualityTree(processTree)) { PutToOutPut(processTree); } }
Implementation in Ferda Instead of creating a new DM tool, modularity of Ferda was used Data preparation boxes 4ft-quantifiers can be used to measure quality of trees MiningProcessor (bit string generation engine) usage
ETree task example 4ft-quantifiers ETree task box Existing data preparation boxes
Experiment 1 - Barbora Barbora bank, cca. 6100 cliends, classification of client status from: Loan amount Client district Loan duration Client Salary Number of attributes for branching = 4 Minimal node purity = 0.8 Minimal node frequency = 61 (1% of data)
Results - Barbora Tree Depth F-Threshold Verifications Hypothese s Best hypothesis 1 0.5 5 2 0.75 2 0.7 17 7 0.88 3 0.85 193 26 0.88 4 0.85 7910 222 0.90 Performance: 36 verifications/sec
Experiment 2: Forest tree cover UCI KDD Dataset for classification (10K sample) Classification of tree cover based on characteristics: Wilderness area Elevation Slope Horizontal + vertical distance to hydrology Horizontal distance to fire point Number of attributes for branching : 1,3,5 Minimal node purity: 0.8 Minimal node frequency: 100 (1% of dataset)
Results Forest tree cover Tree Depth F-Threshold Verifications Hypotheses Best hypothesis Attributes for branching: 1 Performance: 39VPS 1 0.5 2 1 0.50 2 0.7 2 1 0.71 3 0.71 10 8 0.71 4 0.72 59 21 0.72 Tree Depth F-Threshold Verifications Hypotheses Best hypothesis Attributes for branching: 3 Performance: 86VPS 1 0.5 4 1 0.50 2 0.7 21 5 0.72 3 0.72 273 67 0.73 4 0.74 6673 73 0.74 Tree Depth F-Threshold Verifications Hypotheses Best hypothesis Attributes for branching: 5 Performance: 71VPS 1 0.5 6 1 0.50 2 0.7 31 7 0.73 3 0.73 551 52 0.74 4 0.74 17396 183 0.75
Experiment 3: Forest tree cover Construction of trees for whole dataset (cca. 600K) Does increase of number attributes for branching result in better trees? Tree length = 3 the other parameters same as in experiment 2. Number of attributes for branching = 1 Best hypothesis: 0.30, 6VPS (strings in cache) Number of attributes for branching = 4 Best hypothesis: 0.52, 2VPS(strings in cache)
Verifications 4FT vs. ETree On tasks on similar data table length 4FT (in Ferda) approx. 5000 VPS ETree about 70 VPS The ETreeverification is far more complicated: In addition to computing quantifier, counting 2 for each node suitable for branching Hard operations (sums) instead of easy operations (conjunctions ) Not only verification of a tree, but construction of trees derived from this tree
Further work How new/known is the method? Boxes for attribute selection criteria Classification box Better result browsing + result reduction Optimizing Elective classification - tree has a vote (Petr Berka) Experiments with various data sources Decision trees from fuzzy attributes Better estimation of relevant questions count