
Introduction to Classification: Basic Concepts and Techniques
Explore the fundamental concepts of classification, including decision trees, model evaluation, and various techniques used in machine learning. Learn about the definition of classification, common tasks, and examples of its applications in real-world scenarios. Discover popular classification techniques such as decision tree-based methods, neural networks, support vector machines, and ensemble approaches.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification COSC 3337
COSC 3337: Classification Part 1: Introduction to Classification: Basic Concepts and Decision Trees 1. Decision Tree Basics 2. Decision Tree Induction Algorithms 3. Overfitting and Underfitting (separate Slideshow!) 4. Other Issues with Decision Trees 5. Advantages and Disadvantages of Decision Trees 6. Regression Trees (brief) 7. Model Evaluation
COSC 3337: Classification Part2 1. K-nearest Neighbors 2. Support Vector Machines 3. Neural Networks 4. Ensemble Approaches not covered in 2024
Classification: Definition o Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. o Find a model for the class attribute as a function of the values of other attributes. o Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Illustrating Classification Task Learning algorithm Attrib1 Attrib2 Attrib3 Class Tid No 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K Induction No 4 Yes Medium 120K Yes 5 No Large 95K No 6 No Medium 60K Learn Model No 7 Yes Large 220K Yes 8 No Small 85K No 9 No Medium 75K Yes 10 No Small 90K Model 10 Training Set Apply Model Attrib1 Attrib2 Attrib3 Class Tid ? 11 No Small 55K ? 12 Yes Medium 80K Deduction ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K 10 Test Set
Examples of Classification Task o Predicting tumor cells as benign or malignant o Classifying credit card transactions as legitimate or fraudulent o Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil o Categorizing news stories as finance, weather, entertainment, sports, etc
Classification Techniques o Decision Tree based Methods * o Rule-based Methods o Memory based reasoning, instance-based learning, kNN o Neural Networks * o Na ve Bayes and Bayesian Belief Networks o Support Vector Machines * o Ensemble Methods ? *:= to be discussed in Fall 2024
Example of a Decision Tree Splitting Attributes Tid Refund Marital Status Taxable Income Cheat No 1 Yes Single 125K Refund No 2 No Married 100K Yes No No 3 No Single 70K No 4 Yes Married 120K NO MarSt Yes 5 No Divorced 95K Married Single, Divorced No 6 No Married 60K TaxInc NO No 7 Yes Divorced 220K < 80K > 80K Yes 8 No Single 85K No 9 No Married 75K YES NO Yes 10 No Single 90K 10 Model: Decision Tree Training Data
Another Example of Decision Tree Single, Divorced MarSt Married Tid Refund Marital Status Taxable Income Cheat NO Refund No 1 Yes Single 125K No Yes No 2 No Married 100K NO TaxInc No 3 No Single 70K < 80K > 80K No 4 Yes Married 120K Yes 5 No Divorced 95K YES NO No 6 No Married 60K No 7 Yes Divorced 220K Yes 8 No Single 85K There could be more than one tree that fits the same data! No 9 No Married 75K Yes 10 No Single 90K 10
COSC 3337: Classifiction Part 1: Introduction to Classification: Basic Concepts and Decision Trees 1. Decision Tree Basics 2. Decision Tree Induction Algorithms 3. Overfitting and Underfitting (separate Slideshow!) 4. Other Issues with Decision Trees 5. Advantages and Disadvantages of Decision Trees 6. Regression Trees (brief) 7. Model Evaluation
Another Example of Decision Tree Single, Divorced MarSt Married Tid Refund Marital Status Taxable Income Cheat NO Refund No 1 Yes Single 125K No Yes No 2 No Married 100K NO TaxInc No 3 No Single 70K < 80K > 80K No 4 Yes Married 120K Yes 5 No Divorced 95K YES NO No 6 No Married 60K No 7 Yes Divorced 220K Yes 8 No Single 85K There could be more than one tree that fits the same data! No 9 No Married 75K Yes 10 No Single 90K 10
Decision Tree Induction o Goal: Given a trainset, create a decision tree! o Many Algorithms: Hunt s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT
General Structure of Hunts Algorithm Tid Refund Marital Taxable Income Cheat o Let Dt be the set of training records that reach a node t o General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Status 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Dt ?
Hunts Algorithm Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No Refund 3 No Single 70K No Don t Cheat Yes No 4 Yes Married 120K No Don t Cheat Don t Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Refund Refund 9 No Married 75K No Yes No Yes No 10 No Single 90K Yes 10 Don t Cheat Marital Status Don t Cheat Marital Status Single, Divorced Single, Divorced Married Married Don t Cheat Taxable Income Don t Cheat Cheat < 80K >= 80K Don t Cheat Cheat
Tree Induction Greedy strategy Split the records based on an attribute test that optimizes certain criterion. Remark: Finding optimal decision trees is NP-hard. http://en.wikipedia.org/wiki/NP-hard Issues 1. Determine how to split the records How to specify the attribute test condition? How to determine the best split? 2. Determine when to stop splitting o o
Tree Induction o Greedy strategy (http://en.wikipedia.org/wiki/Greedy_algorithm ) Creates the tree top down starting from the root, and splits the records based on an attribute test that optimizes certain criterion. o Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
Side Discussion Greedy Algorithms o Makes locally optimal choices at each stage. o Produce a solution quickly and are therefore attractive to solve NP- hard and other problems with high complexity. Later decisions are made in the context of decision selected early dramatically reducing the size of the search space. o They do not backtrack: if they make a bad decision (based on local criteria), they never revise the decision. o They are not guaranteed to find the optimal solution(s), and sometimes can get deceived and find really bad solutions. o In spite of what is said above, a lot successful and popular algorithms in Computer Science are greedy algorithms. o Greedy algorithms are particularly popular in AI and Operations Research. o See also: http://en.wikipedia.org/wiki/Greedy_algorithm Popular Greedy Algorithms: Decision Tree Induction,
How to Specify Test Condition? o Depends on attribute types Nominal Ordinal Continuous o Depends on number of ways to split 2-way split Multi-way split
Splitting Based on Nominal Attributes o Multi-way split: Use as many partitions as distinct values. CarType Family Luxury Sports o Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType CarType {Sports, Luxury} {Family, Luxury} OR {Family} {Sports}
Splitting Based on Ordinal Attributes o Multi-way split: Use as many partitions as distinct values. Size Small Large Medium o Binary split: Divides values into two subsets. Need to find optimal partitioning. Size Size OR {Small, Medium} {Medium, Large} {Large} {Small} Size {Small, Large} o What about this split? {Medium}
Splitting Based on Continuous Attributes o Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), clustering, or supervised clustering. Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut v can be more compute intensive
Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split
Tree Induction o Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. o Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Before: E(1/2,1/2) Own Car? Car Type? Student ID? Family Luxury c1 c20 No Yes c10 c11 Sports ... ... C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 C0: 0 C1: 1 After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8) Gain: Before-After Pick Test that has the highest gain! Remark: E stands for Gini, H (Entropy), Impurity(1-Purity),
Example: Entropy of 3-way Split X x, o, and * represent examples belonging to 3 classes C1, C2, and C3. H(X)= r=1 (|Cr|/| p=1|Cp|)*H(pr) Cluster 1 x o Before the Split into X: H(0.6,0.3,0.1) X After the 3 way Split into X: H(X)= 3/10*H((1/3,1/3,1/3))+ 4/10*H((1/2,1/2,0))+ 3/10*H((0,1,0))= 3/10*log2(3) + 4/10*1 + 3/10*0= 0.8754888 Cluster 2 x o x o Cluster 3 o o o
How to determine the Best Split o Greedy approach: Nodes with homogeneous class distribution (pure nodes) are preferred o Need a measure of node impurity, named E and H before: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity
Measures of Node Impurity o Gini Index o Entropy o Misclassification error Remark: All three approaches center on selecting tests that maximize purity. They just disagree in how exactly purity is measured.
How to Find the Best Split C0 C1 N00 N01 M0 Before Splitting: A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 C0 C1 N10 N11 C0 C1 N30 N31 C0 C1 N20 N21 C0 C1 N40 N41 M2 M3 M4 M1 M12 M34 Gain = M0 M12 vs M0 M34
Measure of Impurity: GINI o Gini Index for a given node t : j = 2 ( ) 1 [ ( | )] GINI t p j t (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information C1 C2 Gini=0.000 0 6 C1 C2 Gini=0.278 1 5 C1 C2 Gini=0.444 2 4 C1 C2 Gini=0.500 3 3
Examples for computing GINI j = 2 ( ) 1 [ ( | )] GINI t p j t C1 C2 0 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0 C1 C2 1 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6)2 (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 C1 C2 2 4 Gini = 1 (2/6)2 (4/6)2 = 0.444
Examples for computing GINI for More than 2 classes j = 2 ( ) 1 [ ( | )] GINI t p j t GINI((0.2,0.4,0.4,0))= 1- 0.2**2 2*0.4**2= GINI((0,1,0,0))=0
Splitting Based on GINI o Used in CART, SLIQ, SPRINT. o When a node p is split into k partitions (children), the quality of split is computed as, k n = i = ( ) i GINI GINI i split n 1 where, ni = number of records at child i, n= number of records at node p.
Binary Attributes: Computing GINI Index o Splits into two partitions o Effect of Weighing partitions: Larger and Purer Partitions are sought for. Parent 6 6 B? C1 C2 Gini = 0.500 Yes No Node N1 Node N2 Gini(N1) = 1 (5/7)2 (2/7)2 = N1 N2 5 2 Gini=y Gini(Children) = 7/12 * Gini(N1) + 5/12 * Gini(N2)=y C1 C2 1 4 Gini(N2) = 1 (1/5)2 (4/5)2 =
Continuous Attributes: Computing Gini Index Tid Refund Marital Taxable Income Cheat o Use Binary Decisions based on one value o Several Choices for the splitting value Number of possible splitting values = Number of distinct values o Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A v o Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. Status 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Taxable Income > 80K? Yes No
Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini index o Cheat No No No Yes Yes Yes No No No No Taxable Income 60 70 75 85 90 95 100 120 125 220 Sorted Values 55 65 72 80 87 92 97 110 122 172 230 Split Positions > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 0.300 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on INFO o Entropy at a given node t: Entropy = ( ) ( | ) log ( | ) t p j t p j t j (NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations
Examples for computing Entropy = ( ) ( | ) log ( | ) Entropy t p j t p j t j 2 C1 C2 0 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1) = 1/6 P(C2) = 5/6 C1 C2 1 5 Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 C1 C2 2 4 Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92
How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Before: E(1/2,1/2) Own Car? Car Type? Student ID? Family Luxury c1 c20 No Yes c10 c11 Sports ... ... C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 C0: 0 C1: 1 After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8) Gain: Before-After Pick Test that has the highest gain! Remark: E stands for Gini, H (Entropy), Impurity(1-Purity),
Splitting Based on INFO... o Information Gain: n = k ( ) ( ) GAIN Entropy p Entropy i = i i n split 1 Parent Node, p is split into k partitions; ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.
Splitting Criteria based on Classification Error o Classification error at a node t : = ( ) 1 max i ( | ) Error t P i t o Measures misclassification error made by a node. Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information Comment: same as impurity which is (1-Purity)! Example: Error((0.4,0.2,0,0.1,0.3))= 1-Purity((0.4,0.2,0,0.1,0.3))=1-0.4=0.6
Examples for Computing Error = ( ) 1 max i ( | ) Error t P i t C1 C2 0 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 max (0, 1) = 1 1 = 0 P(C1) = 1/6 P(C2) = 5/6 C1 C2 1 5 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 C1 C2 2 4 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3
News September 10, 2024 o An updated version of Task1 has been posted. o There will be a lab taught by Janet on Sept. 17 starting 11:50a which discusses Python Data Science Basics and Task2. Task2 will be posted on the course website later this week; read it before the Sept. 17 lab! o GHC: Group A will present today, and groups B (at 11:30a) and C will present on Sept. 17 and 19 (for tasks see webpage) o Dr. Eick s office hour today has been cancelled; there is a makeup office tomorrow Wednesday 1-2p in MS Teams. o Raunak will teach the lecture on Sept. 24 which centers on Neural Networks. o Today s Topic: a. Continue the discussion of decision trees and DS basics b. Group A GHC presentation c. Brief Discussion of Task1 d. Overfitting / Underfitting
How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Before: E(1/2,1/2) Own Car? Car Type? Student ID? Family Luxury c1 c20 No Yes c10 c11 Sports ... ... C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 C0: 0 C1: 1 After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8) Gain: Before-After Pick Test that has the highest gain! Remark: E stands for Gini, H (Entropy), Impurity(1-Purity),
Information Gain vs. Gain Ratio sale custId c1 c2 c3 c4 c5 c6 car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf la sf sf la la Result: I_Gain_Ratio: city>age>car Result: I_Gain: age > car=city yes yes yes yes no no Gain(D,city=)= H(1/3,2/3) H(1,0) H(1/3,2/3)=0.45 D=(2/3,1/3) city=sf G_Ratio_pen(city=)=H(1/2,1/2)=1 city=la D1(1,0) D2(1/3,2/3) Gain(D,car=)= H(1/3,2/3) 1/6 H(0,1) H(2/3,1/3) 1/3 H(1,0)=0.45 D=(2/3,1/3) G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45 car=van car=merc car=taurus D3(1,0) D2(2/3,1/3) D1(0,1) Gain(D,age=)= H(1/3,2/3) 6*1/6 H(0,1) = 0.90 G_Ratio_pen(age=)=log2(6)=2.58 D=(2/3,1/3) age=22 age=25 age=27 age=35 age=40 age=50 D1(1,0) D3(1,0) D4(1,0) D5(1,0) D2(0,1) D6(0,1)
Entropy and Gain Computations Assume we have m classes in our classification problem. A test S subdivides the examples D= (p1, ,pm) into n subsets D1 =(p11, ,p1m) , ,Dn =(p11, ,p1m). The qualify of S is evaluated using Gain(D,S) (ID3) or GainRatio(D,S) (C5.0): Let H(D=(p1, ,pm))= i=1 (pi *log2(1/pi)) (called the entropy function) Gain(D,S)= H(D) i=1 (|Di|/|D|)*H(Di) Gain_Ratio(D,S)= Gain(D,S) / H(|D1|/|D|, , |Dn|/|D|) Remarks: |D| denotes the number of elements in set D. D=(p1, ,pm) implies that p1+ + pm =1 and indicates that of the |D| examples p1*|D| examples belong to the first class, p2*|D| examples belong to the second class, , and pm*|D| belong the m-th (last) class. H(0,1)=H(1,0)=0; H(1/2,1/2)=1, H(1/4,1/4,1/4,1/4)=2, H(1/p, ,1/p)=log2(p). C5.0 selects the test S with the highest value for Gain_Ratio(D,S), whereas ID3 picks the test S for the examples in set D with the highest value for Gain (D,S). m
Entropy Function H Revisited m H((p1, ,pm))= i=1 f(pi) with f(p)= IF p=0 THEN 0 ELSE p log2(1/p) H is called entropy function
Example Entropy Function H Computations m H((p1, ,pm))= i=1 f(pi) with f(p)= IF p=0 THEN 0 ELSE p log2(1/p) H((1/2,1/4,1/8,1/8,0))= *log_2(2) + *log_2(4) + 2*1/8*log_2(8)+ O =1/2+1/2+2*3/8+0=1.75 Remark: The above formula is the same as the formula given earlier and in most textbooks because: log_2(p)=log_2(1/p) For example: log_2(1/16)=-(-4)=4=log_2(16) R-Bloggers-Link: https://www.r-bloggers.com/2020/02/how-is-information-gain-calculated/
Splitting Based on INFO... o Gain Ratio: GAIN n n split= = k GainRATIO log SplitINFO Split = i i i SplitINFO n n 1 Parent Node, p is split into k partitions ni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Tests which generate more nodes are penalized! Used in C4.5 Designed to overcome the disadvantage of information gain not considering the number of nodes generated by a split.
Information Gain vs. Gain Ratio sale custId c1 c2 c3 c4 c5 c6 car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf la sf sf la la Result: I_Gain_Ratio: city>age>car Result: I_Gain: age > car=city yes yes yes yes no no Gain(D,city=)= H(1/3,2/3) H(1,0) H(1/3,2/3)=0.45 D=(2/3,1/3) city=sf G_Ratio_pen(city=)=H(1/2,1/2)=1 city=la D1(1,0) D2(1/3,2/3) Gain(D,car=)= H(1/3,2/3) 1/6 H(0,1) H(2/3,1/3) 1/3 H(1,0)=0.45 D=(2/3,1/3) G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45 car=van car=merc car=taurus D3(1,0) D2(2/3,1/3) D1(0,1) Gain(D,age=)= H(1/3,2/3) 6*1/6 H(0,1) = 0.90 G_Ratio_pen(age=)=log2(6)=2.58 D=(2/3,1/3) age=22 age=25 age=27 age=35 age=40 age=50 D1(1,0) D3(1,0) D4(1,0) D5(1,0) D2(0,1) D6(0,1)
Comparison among Splitting Criteria For a 2-class problem: