Practical K-Means Clustering with Image-based Visualization

praktik clustering n.w
1 / 14
Embed
Share

Explore K-Means Clustering through practical examples using R programming. Visualize clustering results with scatter plots and understand the clustering process step by step. Improve your understanding of clustering algorithms with real-world data analysis.

  • K-Means Clustering
  • Data Visualization
  • R Programming
  • Clustering Algorithms
  • Data Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Praktik Clustering atjahyanto@gmail.com

  2. K-Means Clustering : 4 titik datax.df <- read.csv("dataset-abcd.csv") plot(datax.df$pHindex ~ datax.df$Weight, xlab = "Weight", ylab = "pHindex", xlim = c(0, 6)) text(x = datax.df$Weight, y = datax.df$pHindex, labels = datax.df$Titik, pos = 4, cex = 0.8, srt = 20, offset = 0.2) # alternative with ggplot library(ggplot2) ggplot(datax.df, aes(y = pHindex, x = Weight)) + geom_point() + geom_text(aes(label = paste(" ", Titik)), size = 4, hjust = 0.0, angle = 15) + ylim(0, 6) + xlim(0, 6) # proses kmeans x <- datax.df[,-1] centers <- 2 km <- kmeans(x, centers, iter.max = 10, nstart = 1) km > km K-means clustering with 2 clusters of sizes 2, 2 Cluster means: Weight pHindex 1 4.5 3.5 2 1.5 1.0 Titik,Weight,pHindex A,1,1 B,2,1 C,4,3 D,5,4 Clustering vector: [1] 2 2 1 1 Within cluster sum of squares by cluster: [1] 1.0 0.5 (between_SS / total_SS = 91.0 %) Sse1 sse2

  3. K-Means Clustering : 4 titik 1 2 3 4 1 0.000000 1.000000 3.6055513 5.0000000 2 1.000000 0.000000 2.8284271 4.2426407 3 3.605551 2.828427 0.0000000 1.4142136 4 5.000000 4.242641 1.4142136 0.0000000 Perhitungan 1 2 3 4 dy 0 1 3.605551 5.000000 1 0 2.828427 4.242641 1 2 3 4 dy 0.00000 1.000000 3.6055513 5.000000 3.14466 2.357023 0.4714045 1.885618 1 2 3 4 dy 0.500000 0.500000 3.2015621 4.6097722 4.301163 3.535534 0.7071068 0.7071068 [1] 1 1 [1] 3.666667 2.666667 [1] 1.5 1.0 [1] 4.5 3.5 [1] 1.5 1.0 [1] 4.5 3.5 [1] "1: (1 - 1)^2 (1 - 1)^2 " [1] "1: (1 - 1.5)^2 (1 - 1)^2 " [1] "1: (2 - 1.5)^2 (1 - 1)^2 " [1] "1: (1 - 1.5)^2 (1 - 1)^2 " [1] "1: (2 - 1.5)^2 (1 - 1)^2 " [1] "2: (2 - 3.66667)^2 (1 - 2.66667)^2 " [1] "2: (4 - 3.66667)^2 (3 - 2.66667)^2 " [1] "2: (5 - 3.66667)^2 (4 - 2.66667)^2 [1] "2: (4 - 4.5)^2 (3 - 3.5)^2 " [1] "2: (5 - 4.5)^2 (4 - 3.5)^2 [1] "2: (4 - 4.5)^2 (3 - 3.5)^2 " [1] "2: (5 - 4.5)^2 (4 - 3.5)^2 " [1] "0 + 9.33333 = 9.33333" [1] "0.5 + 1 = 1.5" [1] "0.5 + 1 = 1.5" Iterasi 1 Iterasi 2 Iterasi 3

  4. K-Means Clustering library(plotrix) datax.df <- read.csv("aglomerative.csv") plot(datax.df$X2 ~ datax.df$X1, xlab = "X1", ylab = "X2", xlim = c(0, 6), ylim =c(0, 6) ) # proses kmeans x <- datax.df[,-1] centers <- 2 km <- kmeans(x, centers, iter.max = 10, nstart = 1) km text(x = datax.df$X1, y = datax.df$X2, labels = datax.df$Titik, pos = 4, cex = 0.8, srt = 20, offset = 0.2) # alternative with ggplot library(ggplot2) ggplot(datax.df, aes(y = X2, x = X1)) + geom_point() + geom_text(aes(label = paste(" ", Titik)), size = 4, hjust = 0.0, angle = 15) + ylim(0, 6) + xlim(0, 6) par(new=TRUE) plot(km$centers[,1], km$centers[,2], pch=3, lwd=2, xlab = "", ylab = "", xlim = c(0, 6), ylim =c(0, 6)) for (i in c(1:dim(km$center)[1])) { draw.ellipse(km$centers[i,1], km$centers[i,2], a=(i+0.7), b=1, border=1,angle=45, lty=1) } par(new=FALSE) > km K-means clustering with 2 clusters of sizes 2, 4 Cluster means: X1 X2 1 1.25 1.250 2 3.75 4.125 Titik,X1,X2 A,1,1 B,1.5,1.5 C,5,5 D,3,4 E,4,4 F,3,3.5 Clustering vector: [1] 1 1 2 2 2 2 Within cluster sum of squares by cluster: [1] 0.2500 3.9375 (between_SS / total_SS = 82.2 %)

  5. K-Means Clustering Titik (2,2) termasuk ke cluster mana? library(plotrix) datax.df <- read.csv("aglomerative.csv") plot(datax.df$X2 ~ datax.df$X1, xlab = "X1", ylab = "X2", xlim = c(0, 6), ylim =c(0, 6) ) # proses kmeans x <- datax.df[,-1] centers <- 2 km <- kmeans(x, centers, iter.max = 10, nstart = 1) km text(x = datax.df$X1, y = datax.df$X2, labels = datax.df$Titik, pos = 4, cex = 0.8, srt = 20, offset = 0.2) # alternative with ggplot library(ggplot2) ggplot(datax.df, aes(y = X2, x = X1)) + geom_point() + geom_text(aes(label = paste(" ", Titik)), size = 4, hjust = 0.0, angle = 15) + ylim(0, 6) + xlim(0, 6) # disimulasikan, jika ada titik baru (2,2), termasuk cluster mana? xt <- 2 yt <- 2 pchx <- rep(3, dim(km$center)[1]) colx <- rep(1, dim(km$center)[1]) # cari distance xc <- km$centers[,1] yc <- km$centers[,2] dx <- c() for (i in c(1:length(xc))) { d <- sqrt((xc[i]-xt)^2 + (yc[i]-yt)^2) dx <- append(dx, d) } idx <- which(dx == min(dx)) pchx[idx] <- 10 colx[idx] <- 2 par(new=TRUE) plot(km$centers[,1], km$centers[,2], col=colx, pch=pchx, lwd=2, xlab = "", ylab = "", xlim = c(0, 6), ylim =c(0, 6)) for (i in c(1:dim(km$center)[1])) { draw.ellipse(km$centers[i,1], km$centers[i,2], a=(i+0.7), b=1, border=1,angle=45, lty=1) } par(new=FALSE)

  6. K-Means Clustering SSE adalah untuk mengukur keseragaman antar data dalam satu klaster. Keseragaman diukur berdasarkan error/jarak antara setiap data dengan centroid. Semakin seragam data-data dalam sebuah cluster, semakin kecil jarak antara setiap data dengan centroidnya. Hitung Sum of Square Error (SSE) sse <- 0 for (j in c(1:dim(x)[1])) { i <- km$cluster[j] cx <- km$centers[i,1] cy <- km$centers[i,2] sse <- sse +(cx - x[j,1])^2 + (cy - x[j,2])^2 print (sprintf ("(%g - %g)^2", x[j,1], cx)) } K = banyaknya cluster Ci = Cluster ke-i mi = centroid cluster ke-I x = data yang berada di masing-masing cluster

  7. # in hclust() set argument method = # to "ward.D", "single", "complete", "average", "median", or "centroid" hc1 <- hclust(d, method = "single") plot(hc1, hang = -1, ann = FALSE) datax.df <- read.csv("aglomerative.csv") # set row names to the datax column row.names(datax.df) <- datax.df[,1] # remove the utility column datax.df <- datax.df[,-1] # compute Euclidean distance # (to compute other distance measures, change the value in method = ) d <- dist(datax.df, method = "euclidean") Dataset ABCDEF hc2 <- hclust(d, method = "average") plot(hc2, hang = -1, ann = FALSE) datax.df <- read.csv("aglomerative.csv") plot(datax.df$X2 ~ datax.df$X1, xlab = "X1", ylab = "X2", xlim = c(0, 6)) text(x = datax.df$X1, y = datax.df$X2, labels = datax.df$Titik, pos = 4, cex = 0.8, srt = 20, offset = 0.2) # alternative with ggplot library(ggplot2) ggplot(datax.df, aes(y = X2, x = X1)) + geom_point() + geom_text(aes(label = paste(" ", Titik)), size = 4, hjust = 0.0, angle = 15) + ylim(0, 6) + xlim(0, 6) > d A B C D E B 0.7071068 C 5.6568542 4.9497475 D 3.6055513 2.9154759 2.2360680 E 4.2426407 3.5355339 1.4142136 1.0000000 F 3.2015621 2.5000000 2.5000000 0.5000000 1.1180340 Titik,X1,X2 A,1,1 B,1.5,1.5 C,5,5 D,3,4 E,4,4 F,3,3.5

  8. Dataset ABCDEF A B C D E B C D E F 0.71 5.66 3.61 4.24 3.20 4.95 2.92 3.54 2.50 2.24 1.41 2.50 1.00 0.50 1.12

  9. Dataset ABCDEF dengan normalisasi # in hclust() set argument method = # to "ward.D", "single", "complete", "average", "median", or "centroid" hc1 <- hclust(d.norm, method = "single") plot(hc1, hang = -1, ann = FALSE) # normalize measurement datax.df <- read.csv("aglomerative.csv") # set row names to the datax column row.names(datax.df) <- datax.df[,1] # remove the utility column datax.df <- datax.df[,-1] hc2 <- hclust(d.norm, method = "average") plot(hc2, hang = -1, ann = FALSE) # normalize input variables datax.df.norm <- sapply(datax.df, scale) # add row names: datax row.names(datax.df.norm) <- row.names(datax.df) # compute normalized distance based on X1 (column 1) and X2 (column 2) d.norm <- dist(datax.df.norm[,c(1,2)], method = "euclidean") > d.norm A B C D E B 0.4613843 C 3.6910748 3.2296904 D 2.3308872 1.8808246 1.4797946 E 2.7683061 2.3069217 0.9227687 0.6679047 F 2.0780208 1.6202898 1.6421172 0.3183573 0.7398973

  10. Dataset Public Utilities Amati scatterplot yang dihasilkan utilities.df <- read.csv("PublicUtilities-brief.csv") plot(utilities.df$Fuel_Cost ~ utilities.df$Sales, xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000)) text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost, labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2) # alternative with ggplot library(ggplot2) ggplot(utilities.df, aes(y = Fuel_Cost, x = Sales)) + geom_point() + geom_text(aes(label = paste(" ", Company)), size = 4, hjust = 0.0, angle = 15) + ylim(0.25, 2.25) + xlim(3000, 18000) Company,Fixed,RoR,Cost,Load,Demand,Sales,Nuclear,Fuel_Cost Arizona Public Service,1.06,9.2,151,54.4,1.6,9077,0,0.628 Boston Edison Co.,0.89,10.3,202,57.9,2.2,5088,25.3,1.555 Central Louisiana Co.,1.43,15.4,113,53,3.4,9212,0,1.058 Commonwealth Edison Co.,1.02,11.2,168,56,0.3,6423,34.3,0.7 Consolidated Edison Co. (NY),1.49,8.8,192,51.2,1,3300,15.6,2.044 Florida Power & Light Co.,1.32,13.5,111,60,-2.2,11127,22.5,1.241 Hawaiian Electric Co.,1.22,12.2,175,67.6,2.2,7642,0,1.652 daho Power Co.,1.1,9.2,245,57,3.3,13082,0,0.309 Kentucky Utilities Co.,1.34,13,168,60.4,7.2,8406,0,0.862 Madison Gas & Electric Co.,1.12,12.4,197,53,2.7,6455,39.2,0.623 Nevada Power Co.,0.75,7.5,173,51.5,6.5,17441,0,0.768 New England Electric Co.,1.13,10.9,178,62,3.7,6154,0,1.897 Northern States Power Co.,1.15,12.7,199,53.7,6.4,7179,50.2,0.527 Oklahoma Gas & Electric Co.,1.09,12,96,49.8,1.4,9673,0,0.588 Pacifc Gas & Electric Co.,0.96,7.6,164,62.2,-0.1,6468,0.9,1.4 Puget Sound Power & Light Co.,1.16,9.9,252,56,9.2,15991,0,0.62 San Diego Gas & Electric Co.,0.76,6.4,136,61.9,9,5714,8.3,1.92 The Southern Co.,1.05,12.6,150,56.7,2.7,10140,0,1.108 Texas Utilities Co.,1.16,11.7,104,54,-2.1,13507,0,0.636 Wisconsin Electric Power Co.,1.2,11.8,148,59.9,3.5,7287,41.1,0.702 United Illuminating Co.,1.04,8.6,204,61,3.5,6650,0,2.116 Virginia Electric & Power Co.,1.07,9.3,174,54.3,5.9,10093,26.6,1.306

  11. Distance Matrix (using Eucledian) Euclidean distance, dij utilities.df <- read.csv("PublicUtilities-brief.csv") # set row names to the utilities column row.names(utilities.df) <- utilities.df[,1] # remove the utility column utilities.df <- utilities.df[,-1] # compute Euclidean distance # (to compute other distance measures, change the value in method = ) d <- dist(utilities.df, method = "euclidean") > d Arizona Boston Central Commonwealth NY Florida Boston 3989.40808 Central 140.40286 4125.04413 Commonwealth 2654.27763 1335.46650 2789.75967 NY 5777.16767 1788.06803 5912.55291 3123.15322 Florida 2050.52944 6039.68908 1915.15515 4704.36310 7827.42921 Hawaiian 1435.26502 2554.28716 1571.29540 1219.56001 4342.09380 3485.67156

  12. Euclidean Distance Distance between Arizona and Boston

  13. Normalizing Numerical Measurements The variables with larger scales (e.g., Sales) have a much greater influence over the total distance. Algorithm : first normalize the measurements compute the Euclidean distance between each pair utilities.df <- read.csv("PublicUtilities-brief.csv") # set row names to the utilities column row.names(utilities.df) <- utilities.df[,1] # remove the utility column utilities.df <- utilities.df[,-1] # normalize input variables utilities.df.norm <- sapply(utilities.df, scale) # add row names: utilities row.names(utilities.df.norm) <- row.names(utilities.df) # compute normalized distance based on Sales (column 6) and Fuel Cost (column 8) d.norm <- dist(utilities.df.norm[,c(6,8)], method = "euclidean") > d.norm Arizona Boston Central Commonwealth NY Florida Boston 2.0103293 Central 0.7741795 1.4657027 Commonwealth 0.7587375 1.5828208 1.0157104 NY 3.0219066 1.0133700 2.4325285 2.5719693 Florida 1.2444219 1.7923968 0.6318918 1.6438566 2.6355728 Hawaiian 1.8852481 0.7402833 1.1560922 1.7460268 1.4116954 1.2288047

  14. Displaying Cluster using Dendogram # in hclust() set argument method = # to "ward.D", "single", "complete", "average", "median", or "centroid" hc1 <- hclust(d.norm, method = "single") plot(hc1, hang = -1, ann = FALSE) hc2 <- hclust(d.norm, method = "average") plot(hc2, hang = -1, ann = FALSE)

More Related Content