
Cluto Clustering Toolkit by G. Karypis & Andrea Tagarelli
Explore the Cluto Clustering toolkit developed by G. Karypis & Andrea Tagarelli, designed for analyzing large, high-dimensional, and sparse datasets. It offers various clustering algorithms, visualization tools, and options for optimizing clustering criteria and feature identification. Learn how to use programs like vcluster and scluster to examine the relationships between clusters, objects, and features.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Cluto Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy
CLUstering Toolkit for very large, high dimensional & sparse datasets http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download Main characteristics Seeks to optimize a particular clustering criterion function Identifies the features that best describe and discriminate each cluster Allows for visually examining relations between clusters, objects, and features Handles sparsity and requires memory as roughly linear in the input size Analysis Goals To understand relations between objects assigned to each cluster and relations between the different clusters To visualize the discovered clustering solution Distributions Stand-alone programs (vcluster and scluster) Library via an application program can access CLUTO algorithms What is CLUTO?
Programs: vcluster: takes as input a multidimensional representation of the objects to be clustered scluster: takes as input the object similarity graph Parameter: -clmethod=string Partitional Direct k-way clustering (direct) Bisecting k-way clustering (rb, rbr) Agglomerative hierarchical (agglo) Partitional-based agglomerative hierarchical (bagglo) Graph-partitioning-based (graph) Clustering algorithms
vcluster [option parameters] MatrixFile NClusters scluster [option parameters] GraphFile NClusters MatrixFile: the file that stores the objects to be clustered GraphFile: the file that stores the adjacency matrix of the object similarity graph NClusters: the number of clusters Optional parameters: categorized into three groups specified using paramnameor paramname=value categorized into three groups 1. control various aspects of the clustering algorithm 2. control type of analysis and reporting that is performed computed clusters 3. control visualization of the clusters Output clustering solution is stored in a file named File.clustering.NClusters Usage
Plain text with n+1 lines storing the data matrix for nm- dimensional objects Dense format Metadata (in the first line): #rows, #columns Each remaining line contains space-separated float values Sparse format Metadata (in the first line): #rows, #columns, #nonzero entries Input file format: matrix file
Plain text with n+1 lines storing the adjacency matrix of the graph that specifies the similarity between the n objects Dense format: Metadata (in the first line): #vertices (n) Each of the remaining n lines stores n space-separated floating point values such that the ith value corresponds to the similarity to the ith vertex of the graph Sparse format: Metadata (in the first line): #vertices (n) and #edges Each of the remaining n lines contains the index of the adjacent vertex followed by the similarity of the corresponding edge Input file format: graph file
Row label file: Stores the label for each of the rows of the matrix (objects) -rlabelfile param Column label file: Stores the label for each of the columns of the matrix (attributes) -clabelfile param Row class label file Stores the class-label for each of the rows of the matrix (objects) -rclassfile param Input file format: labels
Clustering solution file n lines, with a single number per line ith line contains the cluster number that the ith object/row/vertex belongs to Cluster numbers run from zero to the number of clusters minus one If zscores is specified, each line of this file contains two additional numbers right after the cluster number internal z-score, external z-score Tree file produced by performing AHC on top of a k-way clustering solution stored into a file in the form of a parent array: 2k-1 lines such that the ith line contains the parent of the ith node of the tree In the case of the root node, which is stored in the last line of the file, the parent is set to 1. Output file format
Matrix/Graph information Settings Clustering/Clusters quality statistics Timing information Output example
Comparison with reference classification (via rclassfile) Overall Entropy and Purity For each cluster Local entropy and purity Object distribution over the classes External clustering quality
Determine the best set of descriptive & discriminating features for each cluster (via showfeatures) For each cluster Top-L most descriptive features, with % of the within cluster sim. Top-L most discriminating features, with % of the dissim. between the cluster and the rest of the objects Cluster description
via showtree Displayed in a rotated fashion First column as the root, the tree grows from left to right The leaves are numbered from Nclusters to 2*Nclusters -2 If rclassfile is specified: prints information about how the objects of the various classes are distributed in each cluster Cluster tree (1/2)
via showtree and -laveltree Further statistics on each of the the clusters Size Isim Xsim: avg sim between the objects of each pair of clusters that are children of the same node of the tree Gain: change in the value of a particular clustering criterion function Cluster tree (2/2)