Descriptive Data Mining

Slide Note

Descriptive data mining analyzes historical data to find patterns, relationships, and anomalies, aiding in decision-making. Unsupervised learning and examples of techniques like clustering are explored, showcasing the power of data analysis in business.

winona Follow

Uploaded on Dec 24, 2023 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Descriptive Data Mining Herbert F. Lewis, Ph.D. College of Business SBU Math Camp 2023

Descriptive Data Mining Descriptive data mining involves analyzing historical data to identify patterns and relationships. Descriptive data mining produces summaries and visualizations of the data. The increased use of data-mining techniques in business has occurred due to: the explosion in the amount of data being produced and electronically tracked, the ability to electronically warehouse these data, and the decrease in the cost of computer power to analyze the data.

Unsupervised Learning Observation (record): A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database. Unsupervised learning: A descriptive data-mining technique used to identify relationships between observations. Thought of as high-dimensional descriptive analytics. There is no outcome variable to predict; instead, qualitative assessments are used to assess and compare the results.

Examples of Descriptive Data Mining Examples of descriptive data mining include clustering, association rule mining, and anomaly detection. Clustering involves grouping similar objects together. Association rule mining involves identifying relationships between different items in a dataset. Anomaly detection involves identifying unusual patterns or outliers in the data.

Clustering Goal of clustering is to segment observations into similar groups based on observed variables. Can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration. Commonly used in marketing to divide customers into different homogenous groups; known as market segmentation. Used to identify outliers.

Clustering Categorical Variables: Matching Coefficient Matching coefficient: Measure of similarity between observations based on the number of matching values of categorical variables. Matching distance: Measure of dissimilarity between observations based on the matching coefficient.

Clustering Categorical Variables: Matching Coefficient Matching Coefficients for Titanic Passenger Data FirstClass SecondClass ThirdClass Female 0 1 AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup 0 1 HasChild HasElderly NoAges 0 0 Survived 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 16 1 2 3 4 5 6 7 8 9 10 1 1 2 3 4 5 6 7 8 9 10 0.625 0.875 0.625 1 0.75 1 1 0.75 0.875 0.6875 0.625 0.5625 0.8125 0.5625 1 1 0.625 0.4375 0.875 0.4375 0.6875 0.6875 0.625 0.4375 0.875 0.4375 0.6875 0.6875 1 0.8125 0.75 0.6875 0.6875 0.4375 0.8125 0.75 0.6875 0.6875 0.4375 1 0.5625 0.625 1 0.5625 0.5625 0.5625 1 0.5 0.625 0.625 1 0.375 0.625 1

Clustering Categorical Variables: Jaccards Coefficient Jaccard s coefficient: Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries. Jaccard distance: Measure of dissimilarity between observations based on Jaccard s coefficient.

Clustering Categorical Variables: Jaccards Coefficient Jaccard s Coefficients for Titanic Passenger Data FirstClass SecondClass ThirdClass Female 0 1 AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup 0 1 HasChild HasElderly NoAges 0 0 Survived 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 16 1 2 0.142857 3 4 0.142857 5 6 7 8 0.166667 9 0.285714 10 1 1 2 3 4 5 6 7 8 9 10 1 0.6 0.428571 1 1 0.428571 1 1 0.142857 0.4 0.2 0.6 0.142857 1 0 0.285714 0.6 0.142857 0 0.375 0.571429 0.375 0.222222 0 0.4 0.2 1 0 0.6 1 0 0.125 0 0.166667 0.142857 0.375 0.285714 0.111111 0.375 1 0.125 0.125 0.25 0.25 0.333333 1 0 0 0 1

Euclidean Distance Euclidean distance: Geometric measure of dissimilarity between observations based on the Pythagorean theorem. For two observations, ? = ?1, ?2, , ?? ? = ?1, ?2, , ?? the Euclidean distance is: 2 2+ ?2 ?2 2+ + ?? ?? ???= ?1 ?1

Manhattan Distance Manhattan distance: Measure of dissimilarity between two observations based on the sum of the absolute differences in each variable dimension. For two observations, ? = ?1, ?2, , ?? ? = ?1, ?2, , ?? the Manhattan distance is: ???= ?1 ?1+ ?2 ?2+ + ?? ??

Standardized Distance Both Euclidean and Manhattan distance are highly influenced by the scale on which variables are measured. It is common to replace each variable ujof observation u with its z- score zj. The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the distance between observations.

Clustering Numerical Data: k-Means k-means clustering: Process of organizing observations into one of k groups based on a measure of similarity (typically Euclidean distance). Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

Clustering Numerical Data: k-Means (2 Clusters) Initial Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster A B A B A B A B A B Centroid A B X Y Dist Sq 7.72 4.24 0.32 11.24 0.52 1.64 0.72 1.44 4.84 4.24 A B 2.6 3.2 1.4 3 7.72 6.12 0.32 13.52 0.52 2.32 0.72 2.92 5.12 12.52 12.24 4.24 4.04 11.24 5.44 1.64 2.44 1.44 4.84 4.24 SSE 36.92

Clustering Numerical Data: k-Means (2 Clusters) Iteration 1 Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster A B A B A B A B B B Centroid A B X 3 Y 1 3 Dist Sq 5 5 0 9.027778 1 2 1.694444 0.694444 3.361111 5.694444 A 5 5 0 18 1 2 2 5 8 13 B 13.69444 5.694444 4.027778 9.027778 4.694444 2.361111 1.694444 0.694444 3.361111 5.694444 2.833333 SSE 33.47222

Clustering Numerical Data: k-Means (2 Clusters) Iteration 2 Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster A A A B A A B B B B Centroid A B X Y Dist Sq 2.88 2.08 0.68 4.64 3.28 0.68 1.44 0.04 1.04 9.28 A B 3.8 2 1.2 3.2 2.88 2.08 0.68 22.28 3.28 0.68 3.88 6.48 11.08 9.28 19.24 10.44 5.84 4.64 4.84 5.44 1.44 0.04 1.04 9.64 SSE 26.04

Clustering Numerical Data: k-Means (2 Clusters) Iteration 3 Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster A A A B A A B B B A Centroid A B X 4 Y Dist Sq 3.777778 3.777778 1.111111 1.111111 1.444444 1.444444 2.5625 4.444444 4.444444 0.111111 0.111111 1.5625 0.5625 0.0625 6.444444 6.444444 A B 1.666667 3 23.0625 15.0625 7.0625 2.5625 4.5625 8.5625 1.5625 0.5625 0.0625 15.0625 1.25 SSE 22.08333 21.44444 4.111111 5.777778 10.77778

Clustering Numerical Data: k-Means (2 Clusters) Clusters 4.5 B A 4 3.5 B B 3 2.5 B A A 2 1.5 A A 1 0.5 A 0 0 1 2 3 4 5 6

k-Means Optimization (2 Clusters) ??= 1 if observation ? is in cluster ? 0 otherwise ? = 1,2, ,? ? Minimize ?????+ ????? ?=1 ??= 1 if observation ? is in cluster ? 0 otherwise ? = 1,2, ,? Subject to ??,?? are the coordinates of observation j ? = 1,2, ,? ??+??=1 ? = 1,2, ,? ??,?? is the centroid of cluster A ??,??= 0,1 ? = 1,2, ,? ??,?? is the centroid of cluster B ??,??, ??,?? 0 2+ ?? ?? 2 ? = 1,2, ,? ???= ?? ?? 2+ ?? ?? 2 ? = 1,2, ,? ???= ?? ??

Clustering Numerical Data: k-Means (3 Clusters) Initial Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster A B C A B C A B C A Centroid A B C X 3 3 Y Dist Sq 8 4 1 11.11111 1.444444 1 0.444444 1.25 3.777778 6.25 A B 8 4 1 13 2 1 1 2 5 8 C 2.5 2 2 10.25 4.25 2.25 11.25 3.25 1.25 1.25 1.25 4.25 6.25 9.444444 5.444444 1.111111 11.11111 1.444444 1.777778 0.444444 1.444444 3.777778 9.444444 2.666667 SSE 38.27778

Clustering Numerical Data: k-Means (3 Clusters) Iteration 1 Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster B B B C C B C A C A Centroid A B C X Y Dist Sq 2.125 1.125 1.625 3.8125 2.8125 0.625 0.8125 0.8125 0.3125 2.5 A B C 3.5 4.25 1.25 3.5 1.25 2.5 14.5 4.5 6.5 12.5 8.5 2.5 4.5 2.5 6.5 2.5 2.125 1.125 1.625 25.625 5.125 0.625 5.625 8.125 13.625 8.125 20.3125 14.3125 5.3125 3.8125 2.8125 7.8125 0.8125 0.8125 0.3125 16.3125 SSE 16.5625

Clustering Numerical Data: k-Means (3 Clusters) Iteration 2 Clusters for Facility Location Data X 5 5 3 0 2 4 2 2 1 5 Y 0 2 1 4 1 2 2 3 3 4 Cluster B B B C C B C C C A Centroid A B C X 5 Y 4 Dist Sq 2.125 1.125 1.625 3.92 2.92 0.625 0.72 0.52 0.32 0 A 16 4 13 25 18 5 13 10 17 0 B C 2.125 1.125 1.625 25.625 5.125 0.625 5.625 8.125 13.625 8.125 19.72 13.32 5.12 3.92 2.92 7.12 0.72 0.52 0.32 14.92 4.25 1.4 1.25 2.6 SSE 13.9

Clustering Numerical Data: k-Means (3 Clusters) Clusters 4.5 4 C A 3.5 3 C C 2.5 2 C B B 1.5 1 C B 0.5 0 B 0 1 2 3 4 5 6

Hierarchical Clustering Hierarchical clustering: Process of agglomerating observations into a series of nested groups based on a measure of similarity. Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster. There are several ways obtain a cluster similarity measure: Single linkage, Complete linkage, Group average linkage, Median linkage, or Centroid linkage.

Hierarchical Clustering: Linkage Single linkage: Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters. Complete linkage: Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters. Group average linkage: Measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between two clusters. Median linkage: Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters. Centroid linkage: Method of calculating dissimilarity between clusters by considering the two centroids of the respective clusters.

Hierarchical Clustering : Linkage

Hierarchical Clustering: Methods Ward s method: Procedure that partitions observations in a manner to obtain clusters with the least amount of information loss due to the aggregation. McQuitty s method: Measure that computes the dissimilarity introduced by merging clusters A and B by, for each other cluster C, averaging the distance between A and C and the distance between B and C and summing these average distances. Dendrogram: A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering.

Hierarchical Clustering versus k-Means Clustering Hierarchical Clustering Suitable when we have a small data set (e.g., fewer than 500 observations) and want to easily examine solutions with increasing numbers of clusters. Convenient method if you want to observe how clusters are nested. k-Means Clustering Suitable when you know how many clusters you want and you have a larger data set (e.g., more than 500 observations). Partitions the observations, which is appropriate if trying to summarize the data with k average observations that describe the data with the minimum amount of error.

Association Rules Association rule: An if then statement describing the relationship between item sets. Antecedent: The item set corresponding to the if portion of an if then association rule. Consequent: The item set corresponding to the then portion of an if then association rule. Support: The percentage of transactions in which a collection of items occurs together in a transaction data set.

Association Rules: Confidence and Lift Ratio Confidence: The conditional probability that the consequent of an association rule occurs given the antecedent occurs. Lift ratio: The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction.

Association Rules: 1 Antecedent 1 Consequent Supports, Confidences, and Lift Ratios for Titanic Passenger Data Antecedent Support FirstClass SecondClass ThirdClass Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly NoAges Antecedent Consequent Support FirstClass Survived SecondClass Survived ThirdClass Survived Female Survived AgeMissing Survived Child Survived Adult Survived Elderly Survived IsSolo Survived IsCouple Survived IsTriplet Survived IsGroup Survived HasChild Survived HasElderly Survived NoAges Survived Confidence 0.62962963 1.640351 0.472826087 1.231836 0.242362525 0.631418 0.742038217 1.933205 0.293785311 0.765388 0.539823009 1.406381 0.386404293 1.006685 0.30952381 0.806391 0.27027027 0.704125 0.513812155 1.338616 0.653465347 1.702449 0.4140625 1.078742 0.49047619 0.539473684 1.405471 0.273972603 0.713771 Lift 216 184 491 314 177 113 559 42 481 181 101 128 210 76 146 136 87 119 233 52 61 216 13 130 93 66 53 103 41 40 1.27782 Consequent Support Survived 342 Records 891

Association Rules: 2 Antecedents 1 Consequent Antecedents Supports for Titanic Passenger Data Antecedents Support FirstClass SecondClass ThirdClass Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly SecondClass ThirdClass 0 Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly NoAges 0 0 94 76 144 30 11 136 53 12 23 78 55 0 147 142 270 196 27 8 7 10 0 0 0 72 89 320 103 116 20 316 29 71 50 60 92 26 22 127 34 28 39 55 16 25 54 6 0 0 39 17 72 64 19 46 62 1 0 0 0 31 55 124 113 12 113 83 59 10 7 36 2 1 31 42 29 12 18 17 6 15 10 121 38 146 0 0 0 0 0 6 0 2 20 42 57 91 116 11 13 6 0 0

Association Rules: 2 Antecedents 1 Consequent Antecedents and Consequent Supports for Titanic Passenger Data Antecedents and Consequent Support SecondClass ThirdClass FirstClass SecondClass ThirdClass Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly NoAges 0 0 0 91 70 72 14 4 34 36 11 21 29 38 0 101 60 55 150 10 2 1 9 0 0 0 32 34 64 73 28 9 87 6 48 25 20 74 12 14 64 3 0 26 19 21 49 9 22 31 4 0 0 30 9 14 37 3 16 34 0 0 0 0 26 36 41 76 2 61 40 0 9 23 38 33 37 3 1 35 1 0 27 13 6 8 14 13 3 5 3 32 26 40 0 0 0 28 5 7 0 0 0 0 0

Association Rules: 2 Antecedents 1 Consequent Confidences for Titanic Passenger Data Confidence FirstClass SecondClass ThirdClass Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly SecondClass ThirdClass - Female - 0.968085 0.466667 0.916667 0.687075 - 0.921053 0.363636 0.913043 0.422535 0.5 0.25 0.371795 0.203704 0.142857 0.679245 0.690909 0.765306 AgeMissing Child Adult Elderly 0.37037 0.444444 0.676056 0.764706 0.769231 0.25 0.382022 0.2 0.333333 0.538462 0.194444 0.330645 0.142857 0.264463 0.9 0.708738 0.804348 0.890909 0.578125 0.672566 0.972222 0.684211 - 0.241379 0.461538 0.5625 0.157895 0.166667 - 0.45 0.636364 0.88 0.347826 0.539823 - 0.275316 0.503937 0.574074 0.548387 0.481928 0.870968 0.206897 0.5 0.666667 - - - IsSolo IsCouple IsTriplet IsGroup HasChild 0.83871 0.627119 0.333333 HasElderly NoAges 0.5 0.678571 0.529412 0.654545 0.3 0.3 - - - 0.5 0.273973 0 - - - 0 - - 0.547619 0.666667 0.454545 - 0.666667 0.777778 0.538462 0.362637 0.764706 0 0.309524 0.45 0.206897 0.241379 0 - - 0.5

Association Rules: 2 Antecedents 1 Consequent Lift Ratios for Titanic Passenger Data Consequent Support Survived 342 Records 891 Lift FirstClass SecondClass ThirdClass Female AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup HasChild HasElderly SecondClass ThirdClassFemale - AgeMissing Child Adult Elderly IsSolo IsCouple IsTriplet IsGroup 1.99226 2.004049 2.185059 1.633809 0.868421 HasChild HasElderlyNoAges - 2.522116 1.215789 2.388158 1.790011 0.964912 1.157895 1.761305 - 2.399584 0.947368 2.378719 1.100815 0.651316 0.995269 1.302632 1.767857 1.379257 1.705263 0.781579 0.781579 1.302632 0.651316 0.968623 0.530702 0.37218 0.521053 0.868421 1.402834 0.506579 0.861418 1.769613 1.8 1.993824 2.344737 1.846449 2.095538 2.321053 1.506168 1.752212 2.532895 1.782548 - - - 0.628857 1.202429 1.465461 0.411357 0.434211 1.302632 0.713771 - - 1.172368 1.657895 2.292632 0.906178 1.406381 - 0.717272 1.312889 1.495614 1.428693 1.255549 0.53902 1.302632 1.736842 0.37218 0.688995 - - - - 2.2691 - - 1.172368 - 1.426692 1.736842 1.184211 - 1.736842 2.026316 1.402834 0.944766 1.99226 1.302632 - 0.806391 0.53902 0.628857 - - - - - -

Text Mining Text mining: The process of extracting useful information from text data. Unstructured data: Data, such as text, audio, or video, that cannot be stored in a traditional structured database. Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis.

Text Mining Terminology Document: A piece of text, which can range from a single sentence to an entire book depending on the scope of the corresponding corpus. Corpus: A collection of documents to be analyzed. Term: The most basic unit of text comprising a document, typically corresponding to a word or word stem.

Bag of Words Bag of words: An approach for processing text into a structured row- column data format in which documents correspond to row observations and words (or more specifically, terms) correspond to column variables.

Preprocessing Text Data for Analysis Tokenization: The process of dividing text into separate terms, referred to as tokens. Term normalization: A set of natural language processing techniques to map text into a standardized form. Symbols and punctuations must be removed from the document, and all letters should be converted to lowercase. Different forms of the same word and synonyms probably should not be considered as distinct terms. Stemming: The process of converting a word to its stem or root word. Stopwords: Common words in a language that are removed in the pre- processing of text.

Binary Document-Term Matrix Binary document-term matrix (Presence/absence document-term matrix): A matrix with the rows representing documents (units of text) and the columns representing terms (words or word roots), and the entries in the columns indicating either the presence or absence of a particular term in a particular document (1 = present and 0 = not present).

Binary Document-Term Matrix Example Data for Airline Passenger Concerns Comments The wi-fi service was horrible. It was slow and cut off several times. My seat was uncomfortable. My flight was delayed 2 hours for no apparent reason. My seat would not recline. The man at the ticket counter was rude. Service was horrible. The flight attendant was rude. Service was bad. My flight was delayed with no explanation. My drink spilled when the guy in front of me reclined his seat. My flight was canceled. The arm rest of my seat was nasty.

Binary Document-Term Matrix Example Binary Document-Term Matrix for Airline Passenger Concerns Term Document 1 2 3 4 5 6 7 8 9 10 Delayed 0 0 1 0 0 0 1 0 0 0 Flight 0 0 1 0 0 1 1 0 1 0 Horrible 1 0 0 0 1 0 0 0 0 0 Recline 0 0 0 1 0 0 0 1 0 0 Rude 0 0 0 0 1 1 0 0 0 0 Seat 0 1 0 1 0 0 0 1 0 1 Service 1 0 0 0 1 1 0 0 0 0

Binary Document-Term Matrix Example Hierarchical Clustering for Airline Passenger Concerns Cluster 1: {1, 5, 6} = documents discussing service issues Cluster 2: {2, 4, 8, 10} = documents discussing seat issues Cluster 3: {3, 7, 9} = documents discussing schedule issues

Frequency Document-Term Matrix Frequency document-term matrix: A matrix whose rows represent documents (units of text) and columns represent terms (words or word roots), and the entries in the matrix are the number of times each term occurs in each document.

Frequency Document-Term Matrix Example Data for Movie Reviews A new action film has been released, and we have a sample of 10 reviews from movie critics. Using preprocessing techniques, we have reduced the number of tokens to only two: great and terrible. Sentiment analysis: The process of clustering/categorizing comments or reviews as positive, negative, or neutral.

Frequency Document-Term Matrix Example Frequency Document-Term Matrix for Movie Reviews Term Document 1 2 3 4 5 6 7 8 9 10 Great 5 5 5 3 5 0 4 5 1 1 Terrible 0 1 1 3 1 5 1 3 3 2

Frequency Document-Term Matrix Example k-Means Clustering for Movie Reviews

Frequency Document-Term Matrix Example Distance Measurement for Movie Reviews Cosine distance: A measure of dissimilarity between two observations often used on frequency data derived from text because it is unaffected by the magnitude of the frequency and instead measures differences in frequency patterns. Review 10 Review 2