
K-Means Clustering Methods for Document Analysis
"Explore the K-Means clustering method used by IN-SPIRE for document clustering, where observations are partitioned into clusters based on nearest means. Learn about the iterative refinement technique, cluster variations, and the importance of running the algorithm multiple times for better results."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
k k- -means clustering means clustering Method IN-SPIRE uses to cluster documents OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
K K- -means clustering means clustering Aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Method Uses an iterative refinement technique: Given an initial randomly assigned set of k means, the algorithm proceeds by alternating between two steps: Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Update step: Calculate the new means to be the centroids of the observations in the new clusters. The algorithm has converged when the assignments no longer change. As it is a heuristic* algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters, so it is common to run k-means clustering multiple times to identify variability. * The objective of a heuristic is to produce a solution in a reasonable time frame that is good enough for solving the problem at hand. This solution may not be the best of all the actual solutions to this problem, or it may simply approximate the exact solution. But it is still valuable because finding it does not require a prohibitively long time. OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
IN IN- -SPIRE clustering SPIRE clustering Example of cluster variation as a result of k-means clustering starting from different points. OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
Clusters merge in 3 and 4 Clusters merge in 2, 3 and 4 Cluster labels change Cluster centroids move OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
IN IN- -SPIRE clustering SPIRE clustering Option 3 Option 3 Separate screens which can be toggled through after general statement about re-running clustering giving slightly different results. There are 4 different clusterings and number 5 is the same as 4 but the labelling changes slightly. Could even make it a video OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
OPA_A#956_Mar-09-2016 Office of Portfolio Analysis
OPA_A#956_Mar-09-2016 Office of Portfolio Analysis