Data Postdiction Pipeline: Clustering, Outlier Detection, Accuracy Tuning

dagstuhl seminar 10042 demetris zeinalipour n.w

1 / 13

Embed Share

Explore the intricacies of data postdiction in a comprehensive pipeline focusing on clustering, outlier detection, and accuracy tuning. Learn about the tools and models used to recover and analyze past data effectively.

cobb_d Follow

Uploaded on Apr 12, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Remembering the Forgotten: Clustering, Outlier Detection, and Accuracy Tuning in a Postdiction Pipeline Anna Baskin, Scott Heyman, Brian T. Nixon, Constantinos Costa, Panos K. Chrysanthis Department of Computer Science, University of Pittsburgh, USA Rinnoco Ltd., Limassol, Cyprus {afb39, sth66, btn10}@pitt.edu, costa.c@rinnoco.com, panos@cs.pitt.edu https://db.cs.pitt.edu 27th European Conference on Advances in Databases and Information Systems

Outline What is Postdiction? Postdiction Pipeline Forecast [IDC 22]: data will grow at a compound annual growth rate of 21.2% to reach more than 221,000 exabytes by 2026 Future Work then you d have a stack of discs that can get you to the moon 29 times 2 Baskin, Heyman, Nixon, Costa, Chrysanthis

What is Data Postdiction? Prediction: aims to make a statement about the future value of a tuple Postdiction: aims to make a statement about the past value of a tuple which was deleted to free up space [IEEE MDM `18] Ulianova, S.: Cardiovascular disease dataset (2019) 3 Baskin, Heyman, Nixon, Costa, Chrysanthis

Postdiction Pipeline Output Data Decay Input Outlier Table Recovery Table ML Models Storage Size Recovered Accuracy Data Error Tolerance / Accuracy Clustering Accuracy Tuning ML Outlier Detection Current implementation: Outlier Detection: Z-score, Isolation Forest Clustering: K-means, Density-based spatial clustering (DBSCAN), Gaussian Mixture Modelling (GMM) Machine Learning: LSTM 4 Baskin, Heyman, Nixon, Costa, Chrysanthis

ID Height Weight Outlier Detection 1 2 3 4 5 6 7 8 9 168 181 156 133 178 175 166 159 166 170 171 140 62 59 85 60 90 91 68 72 62 83 95 65 Outlier Table ID Height Weight 10 11 12 5 Baskin, Heyman, Nixon, Costa, Chrysanthis

ID ID Height Height Weight Weight Clustering 3 5 6 7 8 156 178 175 166 159 164 85 90 91 68 72 68 1 3 4 5 6 7 8 9 1 4 9 168 156 133 178 175 166 159 166 170 140 164 170 140 62 85 60 90 91 68 72 62 83 65 68 83 65 Outlier Table Height ID Weight 2 181 59 11 171 95 13 ID Height Weight Pipeline Num. Models Num. Outliers Recovered Accuracy 168 133 166 62 60 62 10 12 13 10 12 No Clustering / Outlier Detection 1 0 18.7425 Clustering Only 2 0 30.97082 Outlier Detection Before Clustering 2 14301 47.43767 Outlier detection: Z-score Clustering: K-means 6 Baskin, Heyman, Nixon, Costa, Chrysanthis

Accuracy Tuning ID Height Weight Pred. Error 3 5 6 7 8 156 178 175 166 159 164 85 90 91 68 72 68 4% 4% 1% 3% 1% 2% Outlier Table Height ID Weight 2 11 181 171 59 95 13 ID Height Weight Pred. Error Pipeline Num. Models Num. Outliers Recovered Accuracy 1 4 9 168 133 166 62 60 62 4% 10% 4% No Clustering / Outlier Detection 1 0 18.7425 4 133 60 Clustering + Accuracy Tuning 2 18036 100.00 Outlier Detection Before & After Clustering + Accuracy Tuning 2 28064 100.00 10 12 12 170 140 140 83 65 65 2% 18% 7 Baskin, Heyman, Nixon, Costa, Chrysanthis

Recovery Tables Final Output ID ID Height Height Machine Learning Models 3 156 1 3 4 5 6 7 168 156 133 178 175 166 164 Outlier Table Height 5 6 7 8 178 175 166 159 ID Weight 2 11 4 12 181 171 133 140 59 95 60 65 12 140 65 13 ID Height 9 1 9 166 170 168 166 Output Statistics 10 Recovered Accuracy: 100% Percent Outliers: 30.77% 8 Baskin, Heyman, Nixon, Costa, Chrysanthis

Multi-Column Deletion Database Schema Age (16 bits) Height (16 bits) Weight (32 bits) AP_HI (16 bits) AP_LO (16 bits) Cholesterol (2 bits) Gluc (2 bits) Smoke (1 bit) Alco (1 bit) Active (1 bit) Cardio (1 bit) Question: For each column, how many other columns can it predict with an outlier table under 25% the original table size? Best result: AP_HI estimating Age and Height Outlier Table and Recovery Table 82.59% of the original table size 9 Baskin, Heyman, Nixon, Costa, Chrysanthis

Applicability of Postdiction & Scaling Inflection Point 10 Baskin, Heyman, Nixon, Costa, Chrysanthis

Future Work Extending the functionality of the postdiction pipeline: Enrich the clustering Enrich outlier detection Incorporate more machine learning models LSTM is an extremely bulky model, used for time series data Automate the selection of clustering/outlier detection/ML Enable postdiction chaining Efficient selection of the optimal subset of features for postdiction 11 Baskin, Heyman, Nixon, Costa, Chrysanthis

Chaining Postdiction ID 1 2 Height 168 181 164 Weight 62 95 68 AP_HI 110 130 110 AP_LO 80 90 60 Age 33 24 Cholesterol 1 1 Glucose 1 2 75,000 67 2 3 Remaining Features: Height, Age 12 Baskin, Heyman, Nixon, Costa, Chrysanthis

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Remembering the Forgotten: Clustering, Outlier Detection, and Accuracy Tuning in a Postdiction Pipeline Thank You! Any Questions? Anna Baskin, Scott Heyman, Brian T. Nixon, Constantinos Costa, Panos K. Chrysanthis {afb39, sth66, btn10}@pitt.edu, costa.c@rinnoco.com, panos@cs.pitt.edu https://db.cs.pitt.edu 27th European Conference on Advances in Databases and Information Systems

Data Postdiction Pipeline: Clustering, Outlier Detection, Accuracy Tuning

Download Presentation

Presentation Transcript

Related

More Related Content