Clustering and Cleaning of Dirty Data: A Unified Approach

turn waste into wealth n.w
1 / 20
Embed
Share

Learn how to turn waste into wealth by simultaneously clustering and cleaning dirty data utilizing density-based techniques. Discover the benefits of repairing noisy data points for improved clustering results.

  • Data Cleaning
  • Data Clustering
  • Density-based Techniques
  • Data Repair
  • Unsupervised Learning

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data Shaoxu Song, ChunpingLi, XiaoquanZhang Tsinghua University

  2. Motivation Dirty data commonly exist Often a (very) large portion E.g., GPS readings Density-based clustering Such as DBSCAN Successfully identify noises Grouping non-noise points in clusters Discarding noise points 2 KDD 2015

  3. Mining Cleaning Useless Guide Make Valuable Find 3 KDD 2015

  4. Mining + Repairing Repair Knowledge Constraints Rules Repaired (Dirty) Data Density Discover 4 KDD 2015

  5. Discarding vs. Repairing Simply discarding a large number of dirty points (as noises) could greatly affect clustering results (a) Clusters in ground truth (b) Clusters in dirty data without repairing (c) Clusters in dirty data with repairing C1 C2 C1 C2 C3 C1 C2 Noise Propose to repair and utilize noises tosupport clustering Basic idea: simultaneously repairing noise points w.r.t. the density of data during the clustering process 5 KDD 2015

  6. Density-based Cleaning Boththe clustering and repairing tasks benefit Clustering: with more supports from repaired noise points Repairing: under the guide of density information Already embedded in the data Rather than manually specified knowledge 6 KDD 2015

  7. Basics DBSCAN: density-based identification of noise points Distance threshold ? Density threshold ? ?-neighbor: if two points have distance less than ? Noise point With the number of ?-neighbors less than ? Not in ?-neighbor of some other points that have ?-neighbors no less than ?(core points) 7 KDD 2015

  8. Modification Repair [SIGMOD 05] [ICDT 09] A repair over a set of points is a mapping : P P We denote (pi) the location of point piafter repairing The -neighbors of (pi) after repairing is C (pi) = { pj P| ( (pi) , (pj) ) } 8 KDD 2015

  9. RepairCost Following the minimumchange principle in data cleaning Intuition: systems or humans always try to minimize mistakes in practice prefer a repair close to the input The repair cost ( ) is defined as ( ) = iw( pi, (pi) ) w( pi, (pi) ) is the cost of repairing a point pito the new location (pi) E.g., by counting modified data points 9 KDD 2015

  10. Problem Statement Given a set of data points P, a distance threshold and a density threshold Density-based Optimal Repairing and Clustering (DORC) problem is to find a repair (a mapping : P P ) such that (1) the repairing cost ( ) is minimized, and (2) each repaired (pi) is either a core point or a board point for each repaired (pi), either |C (pi)| (core points), or |C (pj)| for some pjwith ( (pi), (pj)) All the points are utilized, no noise remains 10 KDD 2015

  11. Technique Concern Simply repairing onlythe noise points to the closest clusters is not sufficient e.g., repairing all the noise points to C1 does not help in identifying the second cluster C2 Indeed, it should be considered that dirty points may possibly formclusters with repairing (i.e., C2) 11 KDD 2015

  12. Problem Solving No additional parameters are introduced for DORC besides the density and distance requirements and for clustering ILP formulation Efficient solvers can be applied Quadratic time approximation via LP relaxation Trade-off between Effectiveness and Efficiency By groupinglocally data points into several partitions 12 KDD 2015

  13. Experimental Results Answers the following questions By utilizing dirty data, can it form more accurate clusters? By simultaneous repairing and clustering, in practice is the repairing accuracy improved compared with the existing data repairing approaches? dirty How do the approaches scale? truth Criteria repair RMS Clustering Accuracy: purity and NMI Repairing Accuracy: root-mean-square error (RMS) between truth and repair results 13 KDD 2015

  14. Artificial Data Set Compared to existing methods without repairing DBSCAN and OPTICS Proposed DORC (ILP/Quadratic-time-approximation) shows Higher clustering purity (a) e=15, h=6 (b) e=15, h=6 1 5.5 0.98 5 0.96 4.5 Clustering purity Repairing error 0.94 4 0.92 3.5 0.9 3 DBSCAN ILP QDORC LDORC(t/e=1/5) LDORC(t/e=1/10) OPTICS 0.88 2.5 ILP QDORC LDORC(t/e=1/5) LDORC(t/e=1/10) 0.86 2 0.84 1.5 0.82 1 0.8 0.5 0.03 0.06 0.09 0.12 0.15 0.18 0.21 Dirty rate 0.03 0.06 0.09 0.12 0.15 0.18 0.21 Dirty rate 14 KDD 2015

  15. Real GPS Data With errors naturally embedded, and manually labelled Compared to Median Filter (MF) A filtering technique forcleaning the noisy data in time- space correlated time-series DORC is better than MF+DBSCAN (a) e=8E-5, h=8 (a) e=8E-5, h=8 1 0.014 0.012 0.96 Clustering purity Repairing error 0.01 0.92 0.008 DBSCAN MF+DBSCAN QDORC MF+QDORC OPTICS 0.88 0.006 MF QDORC MF+QDORC 0.84 0.004 0.8 0.002 0.02 0.06 0.1 0.14 0.18 0.02 0.06 0.1 0.14 0.18 Dirty rate Dirty rate 15 KDD 2015

  16. Restaurant Data Tabular data, with artificially injected noises Widely considered in conventional data cleaning Compared to FD A repairing approach under integrity constraints (Functional Dependencies), [name,address city] (a) e=0.22, h=16 DBSCAN FD+DBSCAN QDORC FD+QDORC (a) e=0.22, h=16 0.95 0.3 0.29 0.9 0.28 Clustering purity Repairing error 0.27 0.85 0.26 0.25 0.8 0.24 0.23 FD QDORC FD+QDORC 0.75 0.22 0.21 0.7 0.2 0.1 0.14 0.18 Dirty rate 0.22 0.26 0.3 0.1 0.14 0.18 Dirty rate 0.22 0.26 0.3 16 KDD 2015

  17. More results Two labeled publicly available benchmark data, Iris and Ecoli, from UCI Normalized mutual information (NMI) clustering accuracy Similar results are observed DORC shows higher accuracy than DBSCAN and OPTICS (a) GPS (b) UCI 1 1 DBSCAN OPTICS QDORC 0.9 0.8 0.8 0.6 NMI NMI DBSCAN MF+DBSCAN QDORC MF+QDORC OPTICS 0.7 0.4 0.6 0.2 0.5 0 0.02 0.06 0.1 0.14 0.18 Iris Ecoli Dirty rate Dataset 17 KDD 2015

  18. Summary Preliminary density-based clustering can successfully identify noisy data but without cleaning them Existing constraint-based repairing relies on external constraint knowledge without utilizing density information embedded inside the data With the happy marriage of clustering and repairing advantages both the clustering and repairing accuracies are significantly improved 18 KDD 2015

  19. References (data repairing) [SIGMOD 05] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMODConference, pages 143 154, 2005. [TODS 05] J. Wijsen. Database repairing using updates. ACM Trans. Database Syst., TODS, 30(3):722 768, 2005. [PODS 08] W. Fan. Dependencies revisited for improving data quality. In PODS, pages 159 170, 2008. [ICDT 09] S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53 62, 2009. 19 KDD 2015

  20. Thanks

Related


More Related Content