Complexity of Parallel Query Processing in Shared-Nothing Architectures
The challenges of parallel query processing on big data in shared-nothing architectures such as MapReduce, focusing on communication costs and number of rounds. Explore the MPC model and compute conjunctive queries efficiently with tight bounds for relations of equal size. Discover results for skew-free and skewed input databases, along with insights on simple joins and hypergraphs.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SKEW IN PARALLEL QUERY PROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014
MOTIVATION Understand the complexity of parallel query processing on big data on shared-nothing architectures (e.g. MapReduce) even in the presence of data skew Dominating parameters of computation: Communication cost Number of communication rounds 2
THE MPC MODEL Computation proceeds in synchronous rounds Local Computation Global Communication #bits received at each rounds L 1 1 1 2 2 2 INPUT (size= M) round 1 round r OUTPUT . . . . . . . . . . . . p p p 3
THE MPC MODEL maximum load The data is evenly distributed Maximizes parallelism Equivalent to sequential computation No parallelism What is the minimum load L of an MPC algorithm that computes a Conjunctive Query Q in one round? [Beame, K, Suciu, PODS 2013] Tight upper and lower bounds for relations of equal size (M bits) and no skew 4
RESULTS Computing a Conjunctive Query Q in the MPC model in one round for relations with different sizes and skew Matching upper and lower bounds for any skew-free input database and different relation sizes Almost matching upper and lower bounds in the presence of skew Matching bounds in the case of simple joins 5
CONJUNCTIVE QUERIES Full Conjuctive Queries w/o self-joins: Q(x, y, z) = R(x, y), S(y, z), T(z, x) [the triangle query] The hypergraph of the query Q: Variables as vertices Atoms as hyperedges x T R y z S 6
EXAMPLE: CARTESIAN PRODUCT The cartesian product: Q(x,y) = S1(x), S2(y) with cardinalities m1, m2 ALGORITHM Organize the p servers in a rectangle The load will be To minimize L choose S2(y) (*, h2(y)) S1(x) (h1(x), *) The algorithm is optimal 7
LOWER BOUNDS (1) For a cartesian product Q = S1 S2 Su the lower bound for load is For a Conjunctive Query Q(x1, , xk) = S1( ), , Sl( ) any subset of relations Sj1, Sj2, , Sju without shared variables (an edge packing for the hypergraph of Q) gives a lower bound for the load The lower bound also holds with any fractional edge packing 8
LOWER BOUNDS (2) Theorem For a Conjunctive Query Q, where relation Sj has size Mj (in bits), any MPC algorithm that computes Q in one round with maximum load L must satisfy for some constant c and for any fractional edge packing u: Proof techniques: Using entropy to bound knowledge Friedgut s inequality to bound the maximum size of a query 9
HYPERCUBE ALGORITHM Q(x1, , xk) = S1( ), , Sl( ) For each variable xi define the share to be an integer pi such that: p = p1 .. pk Assign each of the p servers to a point on the k- dimensional hypercube: [p] = [p1] [pk] Hash each tuple to the appropriate subcube e.g. S3(x3, x4) (* , *, h3(x3), h4(x4), *, ) 10
EXAMPLE: THE TRIANGLE QUERY Algorithm: [Ganguly 92, Afrati 10, Suri 11] The p servers form a cube: [p1/3] [p1/3] [p1/3] Send each tuple to servers: R(a, b) (hx(a), hy(b), - ) S(b, c) (-, hy(b), hz(c) ) each tuple replicated p1/3times T(c, a) (hx(a), -, hz(c) ) (hx(a), hy(b), hz(c)) 11
ANALYSISOF HYPERCUBE (1) For a vector of shares p = (p1, , pk), how is relation Sj distributed to the servers? Ideally, each server receives tuples Example: relation R(x, y) of the triangle query Ideal load L = M / #cells = M/p2/3 If R has a single value in the x-column, the load will instead be M/p1/3 The load will be O(M/p2/3) if each value appears in the x and y columns at most M/p1/3 times p1/3 p1/3 12
ANALYSISOF HYPERCUBE (2) In general, a relation Sj is skew-free w.r.t. to p if for any subset of variables x of vars(Sj), every value appears at most If every relation is skew-free w.r.t. p then the maximum load of the HYPERCUBE algorithm is: 13
ANALYSISOF HYPERCUBE (3) The maximum load of the HYPERCUBE algorithm is always bounded by Join with shares px = py = pz = p1/3 For a skew-free database, the load is O(M/p2/3) Otherwise, the load is always bounded by O(M/p1/3) 14
COMPUTING THE SHARES The optimal shares Linear Program (LP) are computed by solving a 15
ANALYSISOF HYPERCUBE By using an LP duality argument, we can prove that the load matches the lower bound Theorem For a conjunctive query Q, where relation Sj has size Mj and is skew-free, there exist shares such that the HYPERCUBE algorithm runs with maximum load pk(Q) = set of all fractional edge packings 16
EDGE PACKINGS FOR THE TRIANGLE Q(x, y, z) = R(x, y), S(y, z), T(z, x) x T R y z S Egde packing u Load (asymptotic) (1/2, 1/2, 1/2) (MRMSMT)1/3/p2/3 (1,0,0) MR/p (0,1,0) MS/p (0,0,1) MT/p 17
THE PRESENCEOF SKEW A simple join Q(x,y,z) = S1(x, z), S2(y, z) Optimal shares px = py = 1, pz = p Standard parallel hash-join If the database has no skew, L = O(max{M1, M2} /p) If it is skewed, the load can be as bad as O(M) (all tuples are sent to the same server) For any value h of z, mj(h) = frequency of h in Sj 18
SKEW-AWARE JOIN (1) Q(x,y,z) = S1(x, z), S2(y, z) Idea: identify the heavy hitters and treat them differently h is a heavy hitter in Sj if mj(h) > Mj/p h is light otherwise CASE 1 (LIGHT) For all light values h, run the HyperCube algorithm (hash- join on z) on all p servers 19
SKEW-AWARE JOIN (2) CASE 2 (HEAVY) For any heavy hitter h (either in S1 or S2) Compute the residual query (a cartesian product) Q[z\h] = S1(x, h), S2(y, h) using ph exclusive servers. Choose ph such that The sum of the ph is O(p) The load for every residual query Q[z\h] is the same 20
SKEW: SIMPLE JOIN Theorem Any MPC algorithm that computes the join queryin one round must satisfy: The skew-aware join achieves the above optimal load 21
SKEW IN CONJUNCTIVE QUERIES For any conjunctive query Q, our algorithm computes the light values using HYPERCUBE Since there is no skew, this part is optimal For the heavy hitters, it considers the residual queries and assigns appropriately an exclusive number of servers The values of the heavy hitters and their frequency must be known to the algorithm 22
CONCLUSION Summary Upper and lower bounds for computing Conjunctive Queries in the MPC model in the presence of skew Open Problems What is the load L when we consider more rounds? How do other classes of queries behave? 23
Thank you ! 24
DUALITY: EDGE PACKING Fractional edge packing: assign uj to Sj such that for each variable xi, the sum of edges that contain it is at most 1 x T R 1/2 q(x, y, z) = R(x, y), S(y, z), T(z, x) 1/2 y z 1/2 S By duality, the minimum value of the LP is equal to the maximum value, over all edge packings pk(q), of 25