Optimizing Distributed Query Processing with Semijoin and Hybrid Approaches

group 6 n.w
1 / 20
Embed
Share

Learn about semijoin-based and hybrid approaches for optimizing distributed query processing. Explore how the hill-climbing algorithm minimizes communication costs in semijoin-based approach, and the challenges of cost estimation in distributed systems with the hybrid approach.

  • Semijoin
  • Query Processing
  • Distributed Systems
  • Optimization
  • Cost Estimation

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Group #6 Varsha Bondugula #2 Soumika Kallem #5 Achi Reddy kurri #12 Kalyan chakravarthy Ala #17

  2. OUTLINE Semijoin-based Approach Hybrid Approach SQ-Allocation Conclusion

  3. 8.4.3 Semijoin-based Approach The semijoin-based approach with the algorithm of takes full advantage of the semijoin to minimize communication cost. The query optimization algorithm is derived from an earlier method called the hill-climbing algorithm. The algorithm does not use semijoins, nor does it assume data replication and fragmentation.

  4. The hill-climbing algorithm proceeds as follows: It involves computing the cost of all the execution strategies that transfer all the required relations to a single candidate result site, and then choosing the least costly strategy. Let us denote this initial strategy as ES0. Then the optimizer splits ES0 into two strategies, ES1 followed by ES2, where ES1 consists of sending one of the relations involved in the join to the site of the other relation.

  5. The two relations are joined locally and the resulting relation is transmitted to the chosen result site. If the cost of executing strategies ES1 and ES2, plus the cost of local join processing, is less than that of ES0, then ES0 is replaced in the schedule by ES1 and ES2.

  6. The process is then applied recursively to ES1 and ES2 until no more benefit can be gained. The hill-climbing algorithm is a greedy algorithm which start with an initial feasible solution and iteratively improve it. The problem is that strategies with higher initial cost, which could nevertheless produce better overall benefits, are ignored.

  7. 8.4.4 Hybrid Approach The static and dynamic distributed optimization approaches have the same advantages and disadvantages as in centralized systems. The problems of accurate cost estimation and comparison of QEPs at compile-time are much more severe in distributed systems. In addition, relations (or relation fragments) may be replicated at several sites.

  8. Thus, site and copy selection should be done at runtime to increase availability and load balancing of the system. The hybrid query optimization technique using dynamic QEPs. Several hybrid techniques have been proposed to optimize queries in distributed systems.

  9. They essentially rely on the following two-step approach: 1. At compile time, generate a static plan that specifies the ordering of operations and the access methods, without considering where relations are stored. 2. At startup time, generate an execution plan by carrying out site and copy selection and allocating the operations to the sites.

  10. Consider the following query expressed in relational algebra: s(R1) 1 R2 1 R3 Figure below show a 2-step plan for this query: The first step can be done by a centralized query optimizer. The second step carries out site and copy selection, possibly in addition to choose-plan operator execution.

  11. we illustrate this second step based on the seminal paper by Carey and Lu [1986] on two-step query optimization. We consider a distributed database system with a set of sites S = {s1, ..,sn}. A query Q is represented as an ordered sequence of subqueries Q = {q1,..,qm}. The average load of the system is

  12. The average load of the system is defined as: The balance of the system for a given allocation of subqueries to sites can be measured as the variance of the site loads using the following unbalance factor

  13. 8.5 Conclusion We discussed the basic concepts and techniques for distributed query optimization. The details of the environment (centralized versus distributed) are captured by the search space and the cost model. The search space describes the equivalent execution plans for the input query. These plans differ on the execution order of operations and their implementation, and therefore on performance.

  14. The cost model is key to estimating the cost of a given execution plan. To be accurate, the cost model must have good knowledge about the distributed execution environment. The search strategy explores the search space and selects the best plan, using the cost model. The most popular search strategy is dynamic programming which enumerates all equivalent execution plans with some pruning.

  15. However, it may incur a high optimization cost for queries involving large number of relations. Thus, it is best suited when optimization is static (done at compile time) and amortized over multiple executions. Randomized strategies, such as Iterative Improvement and Simulated Annealing, have received much attention.

  16. They do not guarantee that the best solution is obtained, but avoid the high cost of optimization. Thus, they are appropriate for ad-hoc queries which are not repetitive. Mostly we join queries for two reasons: join queries are the most frequent queries in the relational framework and they have been studied extensively.

  17. Furthermore, the number of joins involved in queries expressed in languages of higher expressive power than relational calculus can be extremely large, making the join ordering more crucial. Distributing unions over joins is a simple and good approach since the query can be reduced as a union of join subqueries, which are optimized individually.

Related


More Related Content