
New Paradigm for Approximate Query Processing - Insights from DAQ
Explore the world of approximate query processing and its benefits in supporting real-time decisions, with a focus on Sampling-based Approximate Querying (SAQ) and Deterministic Approximate Querying (DAQ). Understand confidence intervals, probability distributions, and the challenges associated with interpreting them. Learn about the limitations and shortcomings of the existing methodologies in this domain.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015
Outline Approximate Query Processing SAQ DAQ Bitwise DAQ Scheme Evaluation Conclusion 2
Approximate Query Processing Data volume is growing exponentially Queries are interactive to support real-time decisions Decisions are resilient to small errors Quick Approximate Answer eg: Average Revenue estimate $12M is about as good as $12,345,678 Exploratory analysis is better than demands responsiveness Slow Exact Answer 3
SAQ Sampling-based Approximate Querying Run query on a small random subset of data Error in estimate presented as confidence interval Avg revenue = $12.3 0.1 million with 95% confidence Can be online error eventually shrinks to zero => exact estimate 4
Confidence Intervals Avg revenue = $12.3 0.1 million with 95% confidence What does this mean? Probability Distribution of Average Revenue With 95% probability, true average revenue lies in 12.3 0.1 million With 5% probability, true average revenue lies outside 12.3 0.1 million 95% 12.2 12.3 12.4 5 5%
Confidence Intervals eg: Avg revenue = $12.3 0.1 million with 95% confidence How should we interpret the tails? Probability Distribution of Average Revenue Tails occur 5% of the time. In Avg Regional Revenue for 100 states, 5 estimates are in the tails There is no bound on error in the tail. The Avg Revenue could be as small as $1M or as large as $100M 95% 12.2 12.3 12.4 6 5%
SAQ: Shortcomings Semantics of the tails of confidence intervals are hard to interpret. Intervals are very broad for outlier aggregates like MAX or Top 100. Intervals are hard to manipulate. No closed algebra. Confidence interval bounds are unintuitive Need to see more of the data to find outliers. Slow convergence. Is 100 10 greater than 90 20? How do we add these intervals? 95% 12.2 12.3 12.4 7 5%
DAQ Deterministic Approximate Querying 5,321,656 Pop quiz! Estimate the sum. + 3,151,709 + 1,362,296 _________________________ = ? ? ? ? ? = 9,???,??? = 9,7??,??? = 9,83?,??? a. Approximately 2.4 million? a. Approximately 2.4 million? b. Approximately 9.7 million? b. Approximately 9.7 million? c. Approximately 13.8 million? c. Approximately 13.8 million? d. Approximately 17.0 million? d. Approximately 17.0 million? 8
DAQ Deterministic Approximate Querying Use deterministic intervals instead of probabilistic (confidence) intervals Guaranteed upper and lower bounds Avg revenue = $12.3 0.2 million Can be online Error interval eventually becomes degenerate => exact estimate DAQ UI Mockup 9
SAQ vs DAQ (at a glance) SAQ DAQ Complex semantics using confidence intervals due to the tail . Simple semantics using deterministic intervals as there is no tail . Slow for outlier aggregates like MAX or Top 100 and heavy-tailed data. Fast for outlier aggregates like MAX or Top 100 and heavy-tailed data. No closed algebra. No clear semantics for predicates and arithmetic operations on estimates. Closed relational algebra. Clear semantics for predicates and arithmetic ops using interval algebra. 10
Conceptual DAQ Scheme Hierarchically partition the attribute s domain Estimates are represented as intervals [a,b] 150 0 150 0 45 126 150 0 20 45 100 133 126 150 0 7 20 27 45 55 75 100 120 133 11
Conceptual DAQ Scheme Hierarchically partition the attribute s domain Estimates are represented as intervals [a,b] e.g., Count B-Tree 12
Interval Algebra Predicate evaluation Interval representation for relations City Shire Rivendell Gondor Est. Population [110,120] [ 70, 90] [ 80,120] Which cities have population > 100? City Shire Rivendell Gondor Est. Population [110,120] [ 70, 90] [ 80,120] City Shire Rivendell Gondor Est. Population [110,120] [ 70, 90] [ 80,120] , Certainly > 100 Potentially > 100 13
Formal Definition Interval estimates for attributes and relations Operators consume and produce intervals Closed relational algebra Online DAQ scheme monotonically shrinking intervals converges to exact estimate 14
Bitwise DAQ Scheme Similar to the decimal digit-wise sum example Uses Bitsliced Index representation 15
Bitwise DAQ Scheme Use most significant m bits for evaluation Remaining n-m bits set to all-0 and all-1 for bounds Error bound decreases exponentially: 2n-m 16
Bitwise DAQ Scheme Algorithms Average aggregation upto m bits 17
Bitwise DAQ vs. Baseline Exponentially decreasing error bounds in estimating Avg 18
Bitwise DAQ Scheme Algorithms Less Than predicate upto m bits 19
Bitwise DAQ vs. Baseline Predicate evaluation: 6x speedup using 8 bits for < 1% error 20
Bitwise DAQ vs. Baseline Top 100: 3.5x speedup for < 1% error on Uniform, Zipf data 21
Bitwise DAQ vs. SAQ Top 100: DAQ performs better for heavy-tailed data (Zipf) 22
Bitwise DAQ Summary Suffers for Simpler aggregates (sum, avg) Uniform data Excels at Predicates Outlier aggregates Heavy-tailed data Compression improves performance further Can operate directly on compressed bitvectors > 8x speedup for Top 100 on Zipf data Embarrassingly parallelizable NUMA-aware parallelization 23
Future Work B-tree like indices Extension to other operators (Group By, Join) Extension to other data types Query optimization Combining SAQ and DAQ 24