K-means Clustering Algorithms for Big Data Lecture
This content dives into the intricate world of K-means clustering, discussing the challenges, heuristics, and algorithms involved. It explores dimension reduction techniques and the K-means++ algorithm, offering insights into optimizing clustering solutions for large datasets.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CIS 700: algorithms for Big Data Lecture 11: K-means Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us
K-means Clustering Given X = {?1, ,??} ? find a set of centers ? = (?1, ,??) that minimizes 2 min ? [?]? ?? ? ? NP-hard problem Popular heuristic local search (Lloyd s alg.) For a fixed partitioning ?1, ,??: 1 ??= ?? ?? ? ??
Dimension reduction for K-means Let ?????? = inf ??????,?(?) For 0 < ? <1 2 let ?:? ? be such that ?? ?? 2 ? is a ?-approx. clustering for ? ? ? is an optimal clustering for ? Lemma. 2 2 2 ?,?: 1 ? ? ?? ? ?? 1 + ? ?? ?? 2 2 1 + ? 1 ? ???? ? ? ????? (?)
Dimension reduction for K-means Let ?????? = inf For 0 < ? <1 ??????,?(?) 2 let ?:? ? be such that ?? ?? 2 ? is a ?-approx. clustering for ? ? ? is an optimal clustering for ? Lemma. 2 2 2 ?,?: 1 ? ? ?? ? ?? 1 + ? ?? ?? 2 2 1 + ? 1 ? ???? ? ? ????? (?) ? ?2 suffices by the JL-lemma ? = ? log
Dimension reduction for K-means Fix a partition ? = ?1, ,?? 2 1 ?????? = ?? ?? ?? ? ?? ? [?] ? ?? 2 2 1 2 2 ??, + = ?? ?? ?? 2 ?? ? ?? ? P? ? P? ? [?] ? ?? 2 2+ ?? 2 2 ?? 1 2 2 = ??,?? ?? ? ?? ? [?] ? ?? 1 2 ?? ?? 2 2 ?? ? ?? ? [?] ? ?? 1 ? ?????? ?????? ? 1 ? ???? ?? ???? ?? ? 1 + ? ?????(?) ? ????? ? ? ? ????? ?
K-means++ Algorithm First center uniformly at random from ? For a set of centers ? let: ?2?,? = min 2 ? ? 2 ? ? Fix current set of centers ? Subsequent centers: each ?? with prob. ?2(??,?) ?? ??2(??,?) Gives ? log? -approx. to OPT in expectation
K-means Algorithm First center ?: sample a point uniformly Initial cost ? = ??2(?,?) For ?(log?) times do: Repeat times (in parallel) ? = sample each ?? ? indep. with prob. ?2(??,?) ?? ??2(??,?) ??= ? ? ? For ? ?: ?? = the #points belonging to this center Cluster the weighted points in ? into ? clusters
K-means Algorithm Oversampling factor = ? #points in ?: log? Thm. If ?-approx. used in the last step then ?- means obtains an ? ? -approx. to k-means If and are the costs of clustering before and after one outer loop iteration then: ? = ? ??? +? ?
K-means Analysis For a set of points ? = {?1, ,??}centroid ??: 1 ? ?? ??= Order ?1, ,?? in the increasing order by distance from ?? Fix a cluster ? in OPT Fix ? prior to the iteration and let: ?2(?,?) ? ? = ? ?2(?,?) ??? = ? Let ??=?2(??,?) Probability that ?? is the smallest one chosen: ?(?) be the probability of selecting ?? ? 1 ??= ?? (1 ??) ?=1
K-means Analysis Can either assign all points to some selected ?? or keep the original clustering: 2 ??= min ??, ? ?? ? ? ? ??? ? where ??+1= prob. that no point in ? is selected Simplifying assumption: consider the case when all ??= ? (mean field analysis) ??= ? 1 ?? (decreasing sequence) ?????+ ??+1???
K-means Analysis 2 ?? = ? ?? ?? ?? is an increasing sequence ???? ???? ? ? 1 ?? ?? ? ? ?? 1 ? ?? = ? ? ? = ?? 2 ?? ? + ??+1 ??(?) ? ??? ? (1 ??+1 )2 ??