Uncertain Data Range Queries: Techniques and Applications
Explore techniques for handling range queries on uncertain data points, including uncertain point probabilities and cumulative distribution functions. Learn how to compute probabilities using CDF and solve range query problems efficiently.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Range Queries on Uncertain Data Range Queries on Uncertain Data Jian Li, Tsinghua University Haitao Wang, Utah State University ISAAC 2014
One dimensional range queries One dimensional range queries Input: a set of points on a line Given a query interval ?, return the points in the interval ? A trivial solution: balanced binary search tree
An uncertain point An uncertain point p p p can appear in different locations with probabilities Give a query interval ?, Pr[? ?]: the probability of p in ?, called the ?-probability of p 0.1 0.3 0.2 0.4 ? Pr[? ?] = 0.5
An uncertain point An uncertain point p p: A general case : A general case The location of p is specified by its PDF (probability density function) ??? , which is a step function or histogram Give a query interval ?, Pr[? ?]: the ?-probability of p ??? 0.25 0.22 0.2 0.15 ? ?
The cumulative distribution function (CDF) The cumulative distribution function (CDF) ? CDF: ??? = a piecewise linear function ??? ?? ??? 1 ??? ??? 0 ? ?
Computing the Computing the ?- -probability using CDF probability using CDF ? CDF: ??? = a piecewise linear function A query interval ? =[??, ??] Pr[p ?]= ??(??) ???? ??? ??? ?? ??(??) ???? ? ?? ??
Range query problems on uncertain Range query problems on uncertain points points Input: a set P of n uncertain points For any query interval ?: top-1 query: return the point in P with largest ?-probability top-k query: return the k points in P with largest ?- probabilities threshold query: given any threshold t, return the points in P with ?-probabilities t Goal: build data structures on P to quickly answer these queries
An application on deterministic data An application on deterministic data
An application on deterministic data An application on deterministic data (cont.) (cont.) A query interval ?=[7,+ ) top-1 query: find the movie whose total percentage of the ratings 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings 7 are 80%
Previous work: only on threshold Previous work: only on threshold queries queries A heuristic solution using R-trees, Cheng et al. VLDB 04 fast in practice, but O(n) time in the worst case Theoretical results: Agarwal et al. PODS 09 preprocessing: O(n log2n) space and O(n log3n) expected time query: O(m+log3n) time, where m is the output size A special case: t is fixed for all queries, preprocessing: O(n) space and O(n log n) time query: O(m + log n) time, where m is the output size Heuristic solutions in 2-D or higher-D, Tao et al. 2005 O(n) time in the worst case
An application on deterministic data An application on deterministic data (cont.) (cont.) A query interval ?=[7,+ ) top-1 query: find the movie whose total percentage of the ratings 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings 7 are 80%
Variations Variations The query interval ? =[??, ??] unbounded query: either ??= or ??= + bounded query: otherwise ??? : PDF of each uncertain point p uniform distribution: ??? has only one interval histogram distribution: otherwise four variations ??? ?
Our results: Our results: uniform unbounded uniform unbounded preprocessing time O(n log n) space query time top-1 O(n) O(log n) top-k O(n log n) O(n) O(k + log n) threshold O(n log n) O(n) O(m + log n)
Our results: Our results: histogram unbounded histogram unbounded preprocessing time O(n log n) space query time top-1 O(n) O(log n) top-k O(n log n) O(n) T threshold O(n log n) O(n) O(m + log n) T=O(k) if k = (log n loglog n) and O(log n + k log k) otherwise
Our results: Our results: uniform bounded uniform bounded preprocessing time O(n log n) space query time top-1 O(n) O(log n) top-k O(n log2n) O(n log n) T threshold O(n log2n) O(n log n) O(m + log n) T=O(k) if k = (log n loglog n) and O(log n + k log k) otherwise
Future work: histogram Future work: histogram bounded bounded No new results Previous work only on threshold queries, P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n log2n) space and O(n log3n) expected time query: O(m+log3n) time, where m is the output size
The The ?- -probability: unbounded probability: unbounded Given a query interval ? =[??, ??]: Pr[p ?]= ??(??) ???? If ??= , ???? = 0 and Pr[p ?]= ??(??) This is why the unbounded case is easier ??? ??(??) ???? ??(??) ? ?? ??
The arrangement of CDFs The arrangement of CDFs Key: the intersections of all CDFs with line ? ?? top-1: the highest intersection top-k: the highest k intersections threshold: the intersections above the threshold t ?(??) ??? t ? ??
Top Top- -1: unbounded 1: unbounded Preprocessing: compute the upper envelop of all CDFs Query: find the intersection of ? ?? with the upper envelop ??? ? ??
Difficulty for top Difficulty for top- -k queries k queries Arrangements of segments: difficult! Arrangements of lines: much easier! Uniform case: change each CDF to a line ??? 1 0 ? ?(??)
Uniform unbounded Uniform unbounded Given an arrangement of n lines, for any query vertical line ? ?? top-k: return the top k intersections threshold: return the intersections above t ??? t ? ??
A A half half- -plane range reporting data plane range reporting data structure structure Problem: Given a line arrangement, for any query point q, return the lines above q Data structure: Partition lines into layers: each layer consists of lines in the upper envelop after removing the previous layers
Threshold query: uniform unbounded Threshold query: uniform unbounded Given ? =( , ??] and the threshold t determine the intersections of ?(??) and the upper envelops above t for each such intersection, walk along the envelop towards left and right to find the lines that intersect ?(??) above t query time: O(log n + m) ?(??) t ??
Top Top- -k query: k query: uniform unbounded uniform unbounded Use a heap: O(log n + k log k) query time Observation: largest k elements in O(k) sorted arrays a selection algorithm on sorted matrices, Frederickson and Johnson, 82 ----> O(log n + k) time ?(??) ??
Our results: uniform unbounded Our results: uniform unbounded preprocessing time O(n log n) space query time top-1 O(n) O(log n) top-k O(n log n) O(n) O(k + log n) threshold O(n log n) O(n) O(m + log n)
Uniform Uniform bounded bounded Transform the problem to the unbounded case If ?? the left endpoint of the blue interval Pr[p ?] = Pr[p ? ] for ? =( , ??] It becomes the unbounded case! ? ?? ?? ?
Uniform Uniform bounded bounded (cont.) (cont.) Classify blue intervals into three types L-type: left endponits ?? R-type: right endponits ?? M-type: each contains ? ? ?? ??
Uniform Uniform bounded bounded (cont.) (cont.) Top-1 queries: L-type and R-type: use a persistent data structure to maintain O(n) upper envelops in the preprocessing M-type: transform to segment dragging queries in 2D Top-k queries: L-type and R-type: use a binary tree T, and on each node, build a data structure as in the unbounded case build a fractional cascading structure on T M-type: transform to a range query in 3D Threshold queries: Similar as for top-k queries
Histogram unbounded Histogram unbounded A segment query problem Given a set of n segments, for any point q, return all segments vertically above q P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n) space and O(n log n) time query: O(log n + m) time q
Thank you for your attention! Thank you for your attention!