Characterizing Content of Clusters: Methods and Insights
In clustering projects, understanding the contents of different clusters is crucial for gaining insights. Two main methods discussed are using decision tree models to predict cluster labels and creating rules, as well as comparing box plots of attributes within clusters and the dataset to identify unique cluster characteristics. Learn how to apply these techniques effectively to enhance your clustering analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Concerning Task4 Ch. Eick Christoph F. Eick
Concerning Task4 Ch. Eick Ch. Eick Characterizing the Content of Clusters In each clustering project you want to characterize what kind of objects specific clusters contain to better understand the different groupings a clustering algorithm produced. This task is actually a classification task rather a clustering task, but it is important in almost any clustering project. Different approaches exist to characterize and distinguish the content of different clusters; here we discuss only two: Fit a not too complex decision tree model that predicts the cluster labels for the dataset. Then create rules from the decision tree. Rules could have confidence of above 95% when predicting cluster memberships. Create Boxplots for each attribute for the different clusters and for the whole dataset as well. Compare Cluster box plots with Dataset Boxplots trying to identify cluster boxplots which are quite different from the dataset boxplot. Compare different cluster boxplots trying to identify pairs of boxplots that are quite different, Use the findings created by the first 2 steps to identify the unique characteristics of each cluster which distinguishes itself from the other clusters. Not all clusters might have such unique characteristics...
Concerning Task4 Ch. Eick Ch. Eick Decision Tree Approach A>3 N Y B>4 C>2 N Y Y N C2 C1 C1 C3 Cluster Membership Rules: If not(A>3) and not(C>2) then cluster C3 If A>3 and B>4 then cluster C1 If not(A>3) and C>2 then cluster C1 If A>3 and nor(B>4) then cluster C2 3
Concerning Task4 Task X: Characterizing 5 Clusters Ch. Eick Ch. Eick Cluster number Characteristic a>0.65 AND b>0.6 d>0.35 f>0.38 no interesting observation a<0.44 AND b<0.45 1 2 3 4 5 Cluster number 1 Properties ? 0.355 ??? ? 0.655 ?? (? < 0.355 ??? ? 0.665 ??? ? 0.545) (? 0.355 ??? ? < 0.655 ??? ? < 0.45) 2 3 ? < 0.355 ??? ? 0.665 ??? ? < 0.545 ?? ? < 0.355 ??? 0.475 ? < 0.665 ??? ? < 0.425 ??? ? 0.405 ?? (? < 0.355 ??? 0.475 ? < 0.665 ??? ? < 0.425 ??? ? < 0.405 ??? ? 0.535) ?? (? < 0.355 ??? ? < 0.475 ??? ? < 0.415 ??? ? 0.485 ??? ? 0.385) 4 no rule found 5 4 Remark: As we use k-means, almost everybody should have different clusters and summaries
Boxplot Approaches--- Concerning Task4 Ch. Eick Ch. Eick Comparing Dataset Boxplots with Cluster Boxplot Centering on the Box Position Case 1: Cluster Boxplot for Attribute A Dataset Boxplot for Attribute A Conclusion: Attribute A is not useful to identify unique characteristics of a Particular cluster
Boxplot Approaches--- Concerning Task4 Ch. Eick Ch. Eick Comparing Dataset Boxplots with Cluster Boxplot Centering on the Box Position Case 2: Cluster Boxplot for Attribute A Dataset Boxplot for Attribute A Conclusion: Attribute A is useful to identify unique characteristics of a Particular cluster as comparing the two boxplots, indicates that the cluster contains high values of attribute A
Boxplot Approaches--- Concerning Task4 Ch. Eick Ch. Eick Comparing Dataset Boxplots with Cluster Boxplot Centering on the Box Position Case 3: Cluster Boxplot for Attribute A Dataset Boxplot for Attribute A Conclusion: The cluster seems to contain a lot of objects with medium values for attribute A
Boxplot Approaches--- Concerning Task4 Ch. Eick Ch. Eick Comparing Dataset Boxplots with Cluster Boxplot Centering on the Box Position Case 4: Cluster Boxplot for Attribute A Dataset Boxplot for Attribute A Conclusion: The cluster contains less medium values for attribute A, In comparison with the whole dataset.
Boxplot Approaches--- Concerning Task4 Ch. Eick Ch. Eick Comparing Boxes in Boxplots of Attributes Between Different Clusters Try to find boxes attribute box plots for a particular cluster that are quite/significantly different from boxes in boxplots of the other clusters. Goal: Find rules/logical statement which uniquely describe the content of a particular cluster. If you find such boxes then use those to describe the unique characteristics of the particular cluster: e.g. Cluster 5 contains very high values of attribute B and very low values of attribute D that are not common in other clusters or more quantitative . Cluster 5 contains very values of attribute B above 0.5 and values of attribute D below -0.7 that are not common in other clusters However, this analysis sometimes has to be made more sophisticated in the case that the attribute boxplot is significantly different from the boxplots of some clusters but not all clusters.
Concerning Task4 Using Purity for External Cluster Evaluation Ch. Eick Ch. Eick Assume the ground truth consists of 3 Classes C1, C2, and C3 and we have a 4 cluster clustering X: C1,C2,C1 C2,C3 Cluster 2 Cluster 1 X C3, C3 C1,C3, C3, C3 Cluster3 Cluster 0 We compute the overall purity of the clustering as follows: Majority Class Examples/Total Number of Examples in Clusters Remark:That is, we exclude outliers from purity computations! Purity(X)=(2+1+3)/9=0.67 Outlier(X)=2/11=0.18
Concerning Task4 Ch. Eick Ch. Eick Student Result Complex9 11
Concerning Task4 Ch. Eick Ch. Eick Ideal Complex9 Clustering 12
Concerning Task4 Ch. Eick Ch. Eick K-means for Complex8 In general, the turquoise and the pink clusters are bad, whereas the brown and green clusters are okay. 13
Concerning Task4 Ch. Eick Ch. Eick Optimal Complex8 DBSCAN Clustering 14
Concerning Task4 Optimal DBSCAN Clustering for Complex 8 Ch. Eick Ch. Eick For the complex8 dataset, the best results are as follows: Purity = 1 Outliers = 0.4704038% Number of Clusters = 19 (20, if we include cluster 0 as outliers) Eps = 12.8 MinPts = 3 Remark: 3 Students found purity 100% clusters (one extra point for that; results still need to be verified) 0 0 60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 518 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 0 0 482 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 0 0 0 0 57 5 3 12 14 7 10 8 0 0 0 0 0 0 0 0 5 6 7 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 113 0 245 0 0 66 0 407 0 8 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 403 15 54
Concerning Task4 Observations For A Similar Project 8 Years Ago Ch. Eick Ch. Eick Assuming purity is used as the evaluation measure DBSCAN outperformed Kmeans quite significantly on the Complex8 dataset, as K-means was not able to detect the natural clusters; on the other hand, for the Yeast dataset K-means obtained better results than DBSCAN; in general, DBSCAN seems to create one very big cluster or obtain a clustering with a lot of outliers, and it seemed to be very difficult (or even impossible) to obtain solutions that lie between the extremes. A lot of students failed to clearly observe that k-means fails to identify the natural clusters in the Complex8 Dataset. For the purity function, some code ignored the assumption that outliers are assumed to be in cluster zero and obtained incorrect results; e.g. considering the objects in cluster 0 in purity computations of DBSCAN results or excluding cluster 1 when computing purity for k-means clusterings. For task d the main goal was to characterize the objects in clusters 1-5; a lot of students did put enough focus on this task; e.g. they provided a general analysis of boxplots rather than analyzing the box plots with respect to separating the 5 clusters and with respect to differences between the distribution in a particular cluster and the distribution in the dataset. About 35% of the students provided quite sophisticated search procedures to find good DBSCAN parameter settings; unfortunately, I had a very hard time, understanding most of the chosen approaches due to lack of explanation and examples that illustrate the approach. There was a quite dramatic differences with respect to amount of work and quality of the approach/solutions obtained for Tasks 4 and 6. Overall, some really good work was done by some students for tasks 4 and or 6 (score=9 or higher). Challenges for Devising a search procedure include: Finding an acceptable range of parameter values so that DBSCAN creates at least okay results How to search for good solutions in the range Another observation, if we maximize purity, is using a large number of clusters might be beneficiary to obtain better results; however, how to embed this knowledge into the search procedure is a challenge