Understanding K-Means Clustering Algorithms

1 / 39

Embed Share

Explore the concepts behind K-Means and K-Means Parallel algorithms, including the process of choosing K centers, updating center matrices, and calculating distances. Learn about seeding, clustering methods, and the iterative steps involved in finding optimal cluster centers.

seb_bro Follow

Uploaded on May 27, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

K means ++ and K means Parallel Jun Wang

Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers

K means ++ Actually you are using it Spend some time on choosing k centers(seeding) Save time on clustering

K means ++ algorithm Seeding Choose a center from X randomly For k-1 times Sample one center each time from X with probability p Update center matrix Clustering

di2=min(euclidean distance b/t Xi to each Ci )

How to choose K centers

Choose a point from X randomly

Calculate all di2

Calculate Pi D=d12+d22+d32+ +dn2 Pi=di2/ D Pi=1 Points further away from red point have better chance to be chosen

Pick up point with probability p

Keep doing the following: Update center matrix Calculate di2 Calculate pi Until k centers are found

K means || algorithm Seeding Choose a small subset C from X Assign weight to points in C Cluster C and get k centers Clustering

Choose subset C from X Let D=Sum of square distance=d12+d22+d32+ +dn2 Let L be f(k) like 0.2k or 1.5k for ln(D) times Pick up each point in X using Bernoulli distribution P(chosen)=L*di2/D Update the C

How many data in C?

How many data in C? Ln(D) iterations Each iteration there suppose to be 1*P1+1*P2+ +1*Pn =L points Total Ln(D)*L points

Cluster the subset C Red points are in subset C

Cluster the sample C Calculate distances between point A to other points in C, and find the smallest distance In this case ,d_c1

Cluster the sample C Calculate distances between point A and all points in X, and get d_xi

Cluster the sample C Compare d_xito d_c1, and let WA=number of d_xi<d_c1 Then we get weight matrix, W Cluster W into k clusters, get k centers

Difference among three methods K means K means ++ K means || seeding choose k centers randomly Choose k centers proportionally Choose subset C and get k centers from C clustering

Hypothesis K means K means ++ K means || seeding choose k centers randomly fast Choose k centers proportionally slow Choose subset C and get k centers from C slower clustering

Hypothesis K means K means ++ K means || seeding choose k centers randomly fast Choose k centers proportionally slow fast Choose subset C and get k centers from C slower faster clustering slow

Test Hypothesis Toy data one very small Cloud data small Spam data moderate Toy data two very large

Simple data set N=100; d=2; k=2; Iteration=100 50 40 30 20 10 0 -10 -20 -30 -40 -50 -40 -20 0 20 40 60 80 100 120 140

Executive time 0.4 0.364668 0.35 0.3 0.25 0.2 0.15 0.1 0.076458 0.067244 0.05 0 K means K means ++ K means ||

Cloud data consists of 1024 points in 10 dimension k=6 200 150 100 50 0 -50 -100 -150 -200 -250 -500 0 500 1000 1500 2000 2500

250 250 250 200 200 200 150 150 150 100 100 100 50 50 50 0 0 0 -50 -50 -50 -100 -100 -100 -150 -150 -150 -200 -200 -200 -500 0 500 1000 1500 2000 2500 -500 0 500 1000 1500 2000 2500 -2500 -2000 -1500 -1000 -500 0 500

Executive time (in seconds) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 K means K means ++ K means ||

Total scatter 2.50E+06 2.00E+06 1.50E+06 1.00E+06 5.00E+05 0.00E+00 K means K means ++ K means ||

Spam base data represents features available to an e-mail spam detection system consists of 4601 points in 58 dimensions K=10

Executive time 3.5 3 2.5 2 1.5 1 0.5 0 K means K means++ K means ||

Total scatter 4.50E+07 4.00E+07 3.50E+07 3.00E+07 2.50E+07 2.00E+07 1.50E+07 1.00E+07 5.00E+06 0.00E+00 K means K means ++ K means ||

Complicate data set N=20,000 d=40 K=40 4 x 10 8 6 4 2 0 -2 -4 -6 -8 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 5 x 10

Executive time 800 700 600 500 400 300 200 100 0 K means K means++ K means||

Clustered figure with true label 4 x 10 8 6 4 2 0 -2 -4 -6 -8 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 5 x 10

Clustered figure with computed label 4 x 10 8 6 4 2 0 -2 -4 -6 -8 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 5 x 10

summary K means K means ++ K means || Small size data Fast Very fast Slow Moderate size Large size data Slow Very slow Fast

Select L Does not matter much when data is small Try on large data set 100 80 60 40 20 0 L=0.2k L=1k L=1.5K L=2k

Questions

Understanding K-Means Clustering Algorithms

Download Presentation

Presentation Transcript

Related

More Related Content