Data Mining Techniques: Classification and Supervised Learning Overview

slide1 n.w
1 / 36
Embed
Share

Dive into the world of data mining techniques with a focus on classification and supervised learning. Explore decision trees, KNN, rule-based models, and more, along with understanding the differences between supervised and unsupervised learning methods. Gain insights into classifiers, algorithms like Naïve-Bayes, and practical examples like determining scholarship applications based on various features. Uncover the principles of K Nearest Neighbors (KNN) and distance formulas like Euclidian distance, essential in predictive analytics and pattern recognition.

  • Data Mining
  • Classification
  • Supervised Learning
  • Decision Trees
  • K Nearest Neighbors

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. 1 CS 43105 Data Mining Techniques Chapter 7 Classification (2) Xiang Lian Department of Computer Science Kent State University Email: xlian@kent.edu Homepage: http://www.cs.kent.edu/~xlian/

  2. 2 Outline Decision Tree KNN for Classification Rule-based Classification Model Evaluation and Selection

  3. 3 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

  4. 4 Supervised Learning and Classification Given: categories Goal: using the knowledge in the dataset, classify a given instance predict the category of the given instance that is rationally consistent with the dataset dataset of instances with known

  5. 5 Classifiers X1 X2 X3 Y feature values category Classifier Xn collection of instances with known categories DB

  6. 6 Algorithms Decision Trees K Nearest Neighbors (kNN) Na ve-Bayes Many others (support vector machines, neural networks, genetic algorithms, etc.)

  7. 7 K Nearest Neighbors For a given instance T, get the top k dataset instances that are nearest to T Select a reasonable distance measure Inspect the category of these k instances, choose the category C that represent the most instances Conclude that T belongs to category C

  8. 8 Example 1 Determining decision on scholarship application based on the following features: Household income (annual income in millions of pesos) Number of siblings in family High school grade (on a QPI scale of 1.0 4.0) Intuition (reflected on data set): award scholarships to high-performers and to those with financial need

  9. 9 Distance Formula Euclidian distance: square root of sum of squares of differences ( x)2 + ( y)2 for two features: Intuition: similar samples should be close to each other May not always apply (example: quota and actual sales)

  10. 10 Incomparable Ranges The Euclidian distance formula has the implicit assumption that the different dimensions are comparable Features that span wider ranges affect the distance value more than features with limited ranges

  11. 11 Example Revisited Suppose household income was instead indicated in thousands of pesos per month and that grades are given on a 70-100 scale Note different results produced by kNN algorithm on the same dataset

  12. 12 Non-Numeric Data Feature values are not always numbers Example Boolean values: Yes or no, presence or absence of an attribute Categories: Colors, educational attainment, gender How do these values factor into the computation of distance?

  13. 13 Dealing with Non-Numeric Data Boolean values => convert to 0 or 1 Applies to yes-no/presence-absence attributes Non-binary characterizations Use natural progression when applicable; e.g., educational attainment: GS, HS, College, MS, PHD => 1,2,3,4,5 Assign arbitrary numbers but be careful about distances; e.g., color: red, yellow, blue => 1,2,3 How about unavailable data? (0 value not always the answer)

  14. 14 Preprocessing Your Dataset Dataset may need to be preprocessed to ensure more reliable data mining results Conversion of non-numeric data to numeric data Calibration of numeric data to reduce effects of disparate ranges Particularly when using the Euclidean distance metric

  15. 15 k-NN Variations Value of k Larger k increases confidence in prediction Note that if k is too large, decision may be skewed Weighted evaluation of nearest neighbors Plain majority may unfairly skew decision Revise algorithm so that closer neighbors have greater vote weight Other distance measures

  16. 16 Other Distance Measures City-block distance (Manhattan dist) Add absolute value of differences Cosine similarity Measure angle formed by the two samples (with the origin) Jaccard distance Determine percentage of exact matches between the samples (not including unavailable data) Others

  17. 17 k-NN Time Complexity Suppose there are m instances and n features in the dataset Nearest neighbor algorithm requires computing m distances Each distance computation involves scanning through each feature value Running time complexity is proportional to m X n

  18. 18 Using IF-THEN Rules for Classification Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes Rule antecedent/precondition vs. rule consequent Assessment of a rule: coverage and accuracy ncovers = # of tuples covered by R ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers

  19. 19 Using IF-THEN Rules for Classification (cont'd) If more than one rule are triggered, need conflict resolution Size ordering: assign the highest priority to the triggering rules that has the toughest requirement (i.e., with the most attribute tests) Class-based ordering: decreasing order of prevalence or misclassification cost per class Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts

  20. 20 Rule Extraction from a Decision Tree Rules are easier to understand than large trees One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive and exhaustive age? >40 <=30 31..40 student? credit rating? yes excellent fair no yes yes no yes

  21. 21 Rule Extraction from a Decision Tree (cont'd) Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes THEN buys_computer = yes age? >40 <=30 31..40 student? credit rating? yes excellent fair no yes yes no yes

  22. 22 Rule Induction: Sequential Covering Method Sequential covering algorithm: Extracts rules directly from training data Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes

  23. 23 Rule Induction: Sequential Covering Method (cont'd) Steps: Rules are learned one at a time Each time a rule is learned, the tuples covered by the rules are removed Repeat the process on the remaining tuples until termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold Comp. w. decision-tree induction: learning a set of rules simultaneously

  24. 24 Sequential Covering Algorithm while (enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples

  25. 25 Rule Generation To generate a rule while(true) find the best predicate p if foil-gain(p) > threshold then add p to current rule else break A3=1&&A1=2 A3=1&&A1=2 &&A8=5 A3=1 Positive examples Negative examples

  26. 26 How to Learn-One-Rule? Start with the most general rule possible: condition = empty Adding new attributes by adopting a greedy depth- first strategy Picks the one that most improves the rule quality

  27. 27 Rule-Quality Measures Rule-Quality measures: consider both coverage and accuracy Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition (log ' _ 2 pos Gain FOIL = ' pos + pos + log ) 2 ' ' pos neg pos neg favors rules that have high accuracy and cover many positive tuples Rule pruning based on an independent set of test tuples pos neg = _ ( ) FOIL Prune R + pos neg Pos/neg are # of positive/negative tuples covered by R. If FOIL_Prune is higher for the pruned version of R, prune R

  28. 28 Model Evaluation and Selection Evaluation metrics: How can we measure accuracy? Other metrics to consider? Use validation test set of class-labeled tuples instead of training set when assessing accuracy Methods for estimating a classifier s accuracy: Holdout method, random subsampling Cross-validation Bootstrap Comparing classifiers: Confidence intervals Cost-benefit analysis and ROC Curves

  29. 29 Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\Predicted class C1 C1 C1 True Positives (TP) False Negatives (FN) C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 Given m classes, an entry, CMi,jin a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals

  30. 30 Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C C Class Imbalance Problem: C TP FN P One class may be rare, e.g. fraud, or HIV-positive C FP TN N P N All Significant majority of the negative class and minority of the positive class Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All Error rate:1 accuracy, or Error rate = (FP + FN)/All Sensitivity: True Positive recognition rate Sensitivity = TP/P Specificity: True Negative recognition rate Specificity = TN/N

  31. 31 Classifier Evaluation Metrics: Precision and Recall, and F-measures Precision: exactness what % of tuples that the classifier labeled as positive are actually positive Recall: completeness what % of positive tuples did the classifier label as positive? Perfect score is 1.0 Inverse relationship between precision & recall F measure (F1orF-score): harmonic mean of precision and recall, F : weighted measure of precision and recall assigns times as much weight to recall as to precision

  32. 32 Classifier Evaluation Metrics: Example Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy) Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

  33. 33 Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into kmutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

  34. 34 Evaluating Classifier Accuracy: Bootstrap Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several bootstrap methods, and a common one is .632 boostrap A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test set (since (1 1/d)d e-1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model:

  35. Model Selection: ROC Curves ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0 35

  36. Issues Affecting Model Selection Accuracy classifier accuracy: predicting class label Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 36

Related


More Related Content