
Stream Data Cleaning under Speed Constraints - SIGMOD 2015
Learn about the challenges of cleaning stream data under speed constraints, including issues with erroneous sensor readings and data extraction accuracy. Explore constraint-based methods for data cleaning and the global and local optimum approaches in handling data anomalies. SIGMOD 2015 research findings are discussed in this informative presentation.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SCREEN: Stream Data Cleaning under Speed Constraints 1/19 Shaoxu Song1, Aoqian Zhang1, Jianmin Wang1, Philip S. Yu2 1Tsinghua University, China 2University of Illinois at Chicago, USA SIGMOD 2015
Outline 2/19 Motivation Cleaning Methods Preliminary Global Optimum Local Optimum Experiments Conclusion SIGMOD 2015
Stream Data Erroneous 3/19 Stream data are often dirty Unreliable sensor reading Erroneous extraction Stock and Flight Accuracy of Stock in Yahoo! Finance is 0.93[1] Accuracy of Travelocity is 0.95[1] SALVEPAR(SY) is misused as the price of SYBASE(SY) Reasons Ambiguity in information extraction Unit error Pure mistake dirty 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 dirty 1. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97-108, 2012. SIGMOD 2015
Data Cleaning 4/19 Traditional smoothing filter Segmentation in sliding windows Modify almost all the data values Can not handle out-of-order arrival (Must re-compute) 1.5 3 2.5 2 1 3 Constraint based methods Hard to provide online computing 2 0.5 2.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1.5 dirty smooth holistic Screen SCREEN Stream Data Cleaning under Speed Constraints The first constraint-based approach for cleaning stream data 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 0.5 dirty smooth SIGMOD 2015
Outline 5/19 Motivation Cleaning Methods Preliminary Global Optimum Local Optimum Experiments Conclusion SIGMOD 2015
Preliminary 6/19 Speed constraint ? = (????,????) with window size ? Speed between ?,? is ?? ?? ?? ?? ?? ?? Repair distance ?,? = ?? ?|?? ?? | SIGMOD 2015
Global Optimum 7/19 Problem All data in windows should satisfy the speed constraint Minimize the repair distance Computation LP problem with ? ?3.5? Advantage Fast High accuracy Disadvantage Online computing Can be faster 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 dirty Screen SIGMOD 2015
Local Optimum 8/19 Problem Focus on one point, locally satisfies speed constraint s in the window starting from it, with minimized repair distance 3 3 2.5 2.5 2 2 1.5 Local 1.5 Global 1 1 0.5 0.5 0 0 1 3 5 7 9 1113151719212325272931 14 15 16 17 18 19 20 21 22 dirty Screen(Global) dirty Screen(Local) Computation Using Median Principle with ? ?? time SIGMOD 2015
Local Optimum Repair 9/19 Border repair can be represented by ?? , if ?? is settled ?? Reformulation[1] with only one variable, i.e., ?? ?? [1] http://ise.thss.tsinghua.edu.cn/sxsong/doc/screen.pdf SIGMOD 2015
Possible Repairs 10/19 Local Solution ? = ?2+ ?4 ? = ?2 ??+ ?4 ??+ ??= G ?? ? = ? ??+ ??= ? ? = ? ??+ ??+ ??= R + ?? ?? ?? ?? ?? ?? SIGMOD 2015
Possible Repairs cont. 11/19 ?+ ?4 ?= 1.75 ?+ ?2 ?= 1.5 ? = ?2 ? = ?1 ?+ ?2 ?= 1.5 ?+ ?2 ?+ ?3 ?= 1.6 ? = ?1 ? = ?1 SIGMOD 2015
Monotonicity 12/19 Monotonicity (Proposition 5) ?+ ?4 ?+ ?2 ?+ ?2 ?+ ?2 ?= 1.75 ?= ??+ ?2 ??+ ?4 ??= G ??= 1.5 ?= ?1 ?+ ?3 ? = ?2 ? = ?1 ?+ ??+ ?2 ?= ?1 ? ??= ? = 1.5 ?+??+?2 ? = ?1 ? ??+??= R + ??= 1.6 ? = ?1 Repair Distance 5 4 3 2 1 0 1 2(xG) 3(xO) 4 5 6 7 Repair Distance SIGMOD 2015
Median Principle 13/19 Local Solution Projections ?? ??? ?? ??? ?? ?? Repair Distance 5 4 3 2 1 0 1 2 3 4 5 6 7 Repair Distance ???= ??????(?? ??? ?? ??? ??), Proposition 3 ?? SIGMOD 2015
Outline 14/19 Motivation Cleaning Methods Preliminary Global Optimum Local Optimum Experiments Conclusion SIGMOD 2015
Experiment Setting 15/19 Real datasets STOCK1: daily prices of AIP.L from 1984-09 to 2010-02, with 12826 data points. FUEL:29750 data points collected in a period of 32 days GPS: 368/3441 points in the trajectory Errors Injected errors in the STOCK and FUEL Manually identified in the GPS Criteria Normalized accuracy 1. http://finance.yahoo.com/q/hp?s=AIP.L+Historical+Prices SIGMOD 2015
Baseline 16/19 Smoothing based SWAB1 WMA and EWMA2 Constraint based Holistic Repair3 Sequential Dependency4 1. E. J. Keogh, S. Chu, D. M. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289-296, 2001. E. S. Gardner Jr. Exponential smoothing: The state of the art{part ii. International Journal of Forecasting, 22(4):637-666, 2006. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458-469, 2013. L. Golab, H. J. Karlo, F. Korn, A. Saha, and D. Srivastava. Sequential dependencies. PVLDB, 2(1):574-585, 2009. 2. 3. 4. SIGMOD 2015
Effectiveness and Efficiency 17/19 SIGMOD 2015
Outline 18/19 Motivation Cleaning Methods Preliminary Global Optimum Local Optimum Experiments Conclusion SIGMOD 2015
Conclusion 19/19 Clean the stream data With small distance from original values In linear time with on violations Handle the out-of-order data A significantly higher repair accuracy Up to 4 orders of magnitude improvement in time performance compared to the state-of-art data cleaning methods SIGMOD 2015
Q & A Thanks 20/19 SIGMOD 2015