
Interactive Exploration of Big Datasets with AIDE Models
Explore big datasets effectively with AIDE models by eliminating query formulation steps, reducing reviewing overhead, and offering an interactive experience. Active learning and relevance feedback enhance user involvement in data exploration.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Kyriaki Olga Papaemmanouil, Brandeis University Yanlei Diao, UMass Amherst Kyriaki Dimitriadou Dimitriadou, Brandeis University , Brandeis University
Human-in-the-loop application Explore big datasets to discover interesting data 2
[SDSS DR7] Example: astrophysical data (e.g., SDSS) Discovery of interesting sky objects Manual iterative exploration: 1. 2. 3. Challenges: Ad-hoc queries: correct predicates are unknown a priori Labor intensive: thousands of objects to review Resource intensive: execution of long query sequences on big data Query formulation Query processing Result reviewing (and back to step 1) 3
AIDEs exploration model Relies on user s relevance feedback on data samples Eliminates query formulation step Navigates the user through the data space Reduces result reviewing overhead AIDE s performance goals Effectiveness Captures user interests with high accuracy Efficiency Minimizes reviewing effort Offers interactive experience 4
Relevance Feedback Relevant Samples Data Classification User Model Irrelevant Samples User Model Samples Query Formulation SQL SQL Space Exploration Sampling queries Data Extraction Query 5
Which data samples to show to the user? User interests are unknown a-priori Labeled samples are collected in real time Active learning: Domain specific (e.g. image retrieval, document ranking) Examine all samples in the database Data samples provided by a different party How to offer interactive exploration times? Sample extraction cost is significant in big data set Accuracy vs efficiency trade-off AIDE couples model learning and sample acquisition 6
Assumption: user interests form N relevant areas Target Queries: disjunctive or conjunctive (range) queries Relevant Area Red Wavelength Relevant Area Relevant Area Green Wavelength 7
Relevant Object Discovery Discover relevant objects Misclassified Sample Exploitation Identify relevant areas Boundary Exploitation Refine relevant areas 1. 2. 3. 8
x x x x x x x x Red wavelength x x x x x x x x x x x x x x x x x x x x x x x x Green Wavelength 9
ed wavelength ******** ******** ****** * * ****** * ******* ******* * **** * * ** * ** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** ** ** ** ** ** ** ** ** ** ** * ** ** ** ** * ** * * * Green Wavelength 10
Sample Sample Object A Object B .. Object X Red 13.67 15.32 .. 14.21 Red Green 12.34 14.50 .. 13.57 Green Relevant Yes No ... Yes Relevant red red>14.82 red<=14.82 red Irrelevant red<13.55 red>=13.55 green Irrelevant green>13.74 green<=13.74 Irrelevant Relevant Decision Tree Classifier SELECT * FROM galaxy WHERE red<= 14.82 AND red>= 13.5 AND green<=13.74 11
False negative False negative x x x x x x x x Red wavelength x x x x x x x x Predicted Area False positive x x x x x x x x x x x x x x x x Green Wavelength 12
Sampling Areas x x x x x x x x x x Red wavelength x x x x x x x x x x x x x x x x x x x x x x x x Green Wavelength 13 13
x x x x x x x x x x x x x x Red wavelength Clusters- Sampling Areas x x Green Wavelength 14 14 14
1) Eliminate irrelevant attributes 2) Refine the areas Red wavelength Sampling Areas Green Wavelength 15 15 15 15
SDSS dataset (10 GB-100GB) Our target queries: Based on SDSS query workload Effectiveness: F-measure=2(precision recall) (precision+recall) 16
AIDE-Large AIDE-Medium AIDE-Small 800 (# of samples) (# of samples) 600 Effort User Effort 400 User 200 0 40% 50% 60% 70% measure (%) 80% 90% 100% F F- -measure (%) SMALL NUMBER OF SAMPLES TO PREDICT COMMON CONJUCTIVE QUERIES SMALL NUMBER OF SAMPLES TO PREDICT COMMON CONJUCTIVE QUERIES 17
AIDE-Large AIDE-Medium AIDE-Small 8 6 Time (sec) Time (sec) 4 2 0 40% 50% 60% 70% 80% 90% 100% F F- -measure (%) measure (%) USER WAIT TIME IS <6sec FOR CONJUCTIVE QUERIES AND <8 USER WAIT TIME IS <6sec FOR CONJUCTIVE QUERIES AND <8 secs secs FOR COMPLEX DISJUNCTIVE QUERIES. FOR COMPLEX DISJUNCTIVE QUERIES. 18
10G 50G 100G 100 80 F F- -measure (%) measure (%) 60 40 20 0 250 300 350 User Effort (# 400 450 500 User Effort (# of of samples) samples) SCALLING TO LARGER DATA SETS HAS NO SIGNIFICANT IMPACT ON THE EFFECTIVENESS. SCALLING TO LARGER DATA SETS HAS NO SIGNIFICANT IMPACT ON THE EFFECTIVENESS. 19
2D 3D 4D 5D F F- -measure: >70% measure: >70% 800 (# of samples) (# of samples) 600 User Effort User Effort 400 200 0 1 3 5 7 Number of Areas AIDE IDENTIFIES IRRELEVANT ATTRIBUTES IN N N- -DIMENSIONAL SPACES WITH SMALL OVERHEAD Number of Areas AIDE IDENTIFIES IRRELEVANT ATTRIBUTES IN DIMENSIONAL SPACES WITH SMALL OVERHEAD 20
User study with 7 real users: AuctionMark dataset Exploration for good deals Manual Exploration: Query formulation Query processing Review results & repeat 21
1. select i_initial_price, i_current_price from ITEM order by i_initial_price; 2. select i_initial_price, i_current_price from ITEM where i_initial_price < i_current_price order by i_initial_price; 3. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 2 order by i_initial_price; 4. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 2 and i_current_price > 1000 order by i_initial_price; 5. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 3 order by i_initial_price; 6. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 3 and i_current_price > 1000 order by i_initial_price; 7. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 4 order by i_initial_price; 8. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 10 order by i_initial_price; 9. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 50 order by i_initial_price; 10. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 70 order by i_initial_price; 11. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 90 order by i_initial_price; 12. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 3 and i_num_bids > 90 order by i_initial_price; 13. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 2 order by i_initial_price; 14. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 2 and i_days_to_close > 0 order by i_initial_price; 15. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 3 and i_days_to_close > 0 order by i_initial_price; 16. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 3 and i_days_to_close > 5 order by i_initial_price; 17. select i_initial_price, i_current_price, i_num_comments from ITEM where i_current_price > i_initial_price * 3 order by i_initial_price; AIDE outperformed manual exploration: 66% average reduction on user effort 42% average reduction in exploration time 22
N. Kamat et al. Distributed Interactive Cube Exploration. ICDE 2014 Leilani Battle et al. Dynamic Reduction of Query Result Sets for Interactive Visualization. BigDataVis Workshop 2013 A. Kalinin et al. Interactive Data Exploration using Semantic Windows. SIGMOD 2014 M. Drosou et al. YMALDB: exploring relational databases via result-driven recommendations. VLDB Journal 2013 Neophytou et al. AstroShelf: Understanding the Universe through Scalable Navigation of a Galaxy of Annotations. SIGMOD 2012 A. Albarrak et al. SAQR: An Efficient Scheme for Similarity- Aware Query Refinement. DASFAA 2014 23
AIDE (Automatic Interactive Data Exploration): Assists user in discovering interesting data objects Eliminates ad-hoc exploratory queries Highly efficient and effective exploration: Captures user interests with high accuracy Requires low reviewing effort from the user Offers interactive experience 24
1-Area 3-Areas 5-Areas 7-Areas 1500 Number of Samples Number of Samples 1000 500 0 40% 50% 60% 70% 80% 90% 100% F F- -measure (%) SMALL NUMBER OF SAMPLES TO ACCURATELY PREDICT COMPLEX DISJUNCTIVE QUERIES measure (%) SMALL NUMBER OF SAMPLES TO ACCURATELY PREDICT COMPLEX DISJUNCTIVE QUERIES 27
AIDE AIDE-Clustering 1000 Number of Samples Number of Samples 800 600 400 200 0 NoSkew Exploration Space Distribution Skew Exploration Space Distribution 28
10G 50G 100G 100 Time Improvement (%) Time Improvement (%) 80 60 40 20 0 1 3 5 7 Number of Areas Number of Areas RUNNING OUR EXPLORATION ON A SAMPLE DATABASE LED TO 85 RUNNING OUR EXPLORATION ON A SAMPLE DATABASE LED TO 85- -98% IMPROVEMENT ON TIME. TIME PER ITERATION: 2 98% IMPROVEMENT ON TIME. TIME PER ITERATION: 2- -7 SEC 7 SEC 29
Random-Grid Random-Grid+Misclassified AIDE 1600 1400 Number of Samples Number of Samples 1200 1000 800 600 400 200 0 40% 50% 60% 70% 80% 90% 100% F F- -measure (%) measure (%) 30
AIDE Random Random-Grid Predicting areas (accuracy >70%): Both random solutions fail for small areas (>5,450 samples) For medium and large areas Random needs > 1180 samples and Random-Grid > 1275 samples 6000 Number of Samples Number of Samples 4000 2000 0 Large Medium F F- -measure (%) Small AIDE Random Random-Grid measure (%) 1500 Number of Samples Number of Samples Predicting multiple areas (accuracy >60%): AIDE needs <500 samples in all cases. Random and Random-Grid >1000 samples in all cases. 1000 500 0 1 3 5 7 Number of Areas Number of Areas 31