Efficient Query Plan Compilation for Modern Hardware

efficiently compiling efficient query plans n.w
1 / 34
Embed
Share

Explore the efficient compilation of query plans for modern hardware, optimizing query performance on CPUs with a detailed look at processing models like Iterator/Volcano Style, Materialization model, and Vectorized/Batch Model.

  • Query Optimization
  • Modern Hardware
  • Processing Models
  • Database Performance
  • SQL Queries

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Efficiently Compiling Efficient Query Plans for Modern Hardware Thomas Neumann Presenter: Nitin Vinod

  2. Introduction When memory increases, query performance is determined by the CPU Classic iterator style is simple but does not have great performance, hand written execution plan is faster How do we run queries? Basically, we put data in and ask questions about it. SQL query would be converted into a query plan A query plan could be considered a tree which represents the flow of data. 1

  3. Query Plan Data flows from the leaves to the root, the output of the root is the result we require. The flow of data is represented by the red arrows. The processing model used answers the questions How is the query plan executed? How is the execution of operators organized? How is the data moved around? 2

  4. Processing Model There are 3 approaches Iterator/Volcano Style Processing Materialization model Vectorized/Batch Model 3

  5. Iterator Model Top down processing takes place in this model Each operator calls a next function This function will fetch a tuple or calls its children to retrieve tuples This process gets carried out from the root of the tree to its leaves. This is facilitated with the help of a loop. Image taken from slides prepared by Prof Andy Pavlo, CMU DB group 4

  6. Iterator Model Calls (next) the child successively until the bottom is reached On getting the values, they are pushed up to the function that called them. This is how data reaches the root. Image taken from slides prepared by Prof Andy Pavlo, CMU DB group 5

  7. Iterator Model One tuple can be passed through a pipeline of multiple operators. All the processing is done and then passed up, this approach is used in disk based system. Space may not be there to store intermediate details in memory, if a tuple is in memory all the processing required is finished right then. If this is not the case, there is a good chance that data might spill to disk. The idea is to do the maximum amount of work on a tuple while it is in memory Some examples are SQLite, MySQL and Oracle Works well with a limited number of tuples and is meant for general purposes 6

  8. Materialization Model Bottom up processing takes place in this model Each operator processes all the data at the same time and pushes all the output in one shot. This output is materialized as a single output and stored in an output buffer Image taken from slides prepared by Prof Andy Pavlo, CMU DB group The data then slowly moves upward 7

  9. Materialization Model Better for transactional data because the amount of data is lesser. Not good for analytical workloads Size in output buffer would be relatively small. Examples are Volt DB, Monet DB 8

  10. Vectorization Model Inside the output buffer, size of the data is determined and is sent out in a batch. Multiple tuples are sent Then the batch is moved up. Every time the loop is run, it starts with a fresh set of data. Used for OLAP systems Image taken from slides prepared by Prof Andy Pavlo, CMU DB group Example are Peloton and Presto. 9

  11. Disadvantages Poor code locality and complex book keeping(iterator), if a simple scan is done over a compressed stream, the scan operator has to remember where the tuple is. Pure pipelining is not possible, if more than one tuple is being produced(iterator). The data has to be materialized to be used, example join operation. Handwritten programs are much faster than the most of the vectorized systems as well as the other systems 10

  12. Proposed Query Compilation Strategy The idea is to keep data in the CPU registers as far as possible Data is not pulled by the operators but rather pushed towards them, this improves data locality. Queries are compiled into native machine code using the LLVM compiler framework (Using the LLVM framework allows the programmers to match the hand written programs, at times being even faster. Assembly language enables them to make more modifications when compared to a high level language like C++. Other benefits include being able to derive some use from future support for compilers and code optimization) 11

  13. Example Iterator or block oriented execution models are not well suited because they may break the pipeline if they produce tuples beyond register capacity An algebraic operator is a pipeline breaker if takes a tuple out of the CPU registers or materializes incoming tuples Data flow control is reversed, instead of pulling tuples they are pushed towards the consumer operators. Data is pushed from one pipeline breaker to the other. 12

  14. Query Processing Architecture Iterator or block oriented execution models are not well suited because they may break the pipeline if they produce tuples beyond register capacity An algebraic operator is a pipeline breaker if takes a tuple out of the CPU registers or materializes incoming tuples Data flow control is reversed, instead of puling tuples they are pushed towards the consumer operators. Data is pushed from one pipeline breaker to the other. 13

  15. Example Query If this were to follow an iterator style execution, it would work as mentioned in the previous slides. Now the queries are to be compiled in such a manner that all pipelining operations happen in CPU(without materialization) Fully in memory computation is assumed 14

  16. Example Query contd. There are 4 fragments that correspond to the pipeline fragments. The first fragment filters tuples from R1 and places them in the hash table of join (a,b). The second does the same for R2 and 2 The third transfers results from 2 into the hash table of join(z,c) The fourth and final fragment passes the tuples of R3 along the join hash tables and produces the result 15

  17. Compiling Algebraic Expressions The first fragment combines the scan of R1, the selection =7, and the build part of join(c=z) into one code fragment (boundary is blurred between operators, data centric not operator centric) Each code fragment performs all actions that can be done within one part of the execution pipeline, before materializing the result into the next pipeline breaker This contributes towards a highly irregular structure spread over multiple code fragments 16

  18. Compiling Algebraic Expressions contd. Now optimal assembly code can be generated, the complex operator logic aids in this as opposed to the iterator style which has a simplified interface and loses out because of virtual memory calls and frequent memory accesses Operator interface offers two functions produce and consume respectively 17

  19. Compiling Algebraic Expressions contd. Conceptually, the produce function asks the operator to produce its result tuples, which are then pushed towards the consuming operator by calling their consume functions For our running example, the query would be executed by calling join(a,b).produce This produce function would then in itself call x=7.produce to fill its hash table, and the operator would call R1.produce to access the relation. 18

  20. Compiling Algebraic Expressions contd. R1 is a leaf in the operator tree, i.e., it can produce tuples on its own. Therefore, it scans the relation R1, and for each tuple loads the required attributes and calls x=7.consume(attributes, R1) to hand the tuple to the selection. The selection filters the tuples, and if it qualifies it passes it by calling join(a=b)(attributes, x=7). 19

  21. Compiling Algebraic Expressions contd. The join sees that it gets tuples from the left side, and thus stores them in the hash table. After all tuples from R1 are produced, the control flow goes back to the join, which will call join(c=z).produce to get the tuples from the probe side etc. The final algebraic plan is converted into an imperative program instead of physical algebra. 20

  22. Compiling Algebraic Expressions contd. Initially C++ was tried out but scrapped due to optimization difficulties C++ also did not offer total control over the generated code. Instead, the Low Level Virtual Machine (LLVM) compiler framework was used to generate portable assembler code, which can then be executed directly using an optimizing JIT compiler provided by LLVM. The LLVM assembler is portable across machine architectures, as only the LLVM JIT compiler translates the portable LLVM assembler into architecture dependent machine code. 21

  23. Compiling Algebraic Expressions contd. LLVM is strongly typed which caught many bugs in the original textual C++ code generation Writing in assembly is difficult compared to writing in C++ and a lot of database logic is written in C++ as well. Also C++ methods may be called from LLVM and vice versa. This results in a mixed execution model For good performance, 99% of the code executed for tuples must be LLVM 22

  24. Complex Operators A complex query would not be efficient if it were translated into a single function LLVM might call C++ which would take over the control flow For good performance, 99% of the code executed for tuples must be LLVM 23

  25. Performance Tuning The speed dramatically increases such that code fragments unexpectedly end up becoming a bottleneck. Example - Hashing More than 50% of time was spent hashing on a test Branch prediction is very important, probability of 50% is too less, query compiler must produce code that allows good branch prediction 24

  26. Performance Tuning contd. While mixes up checking for an entry as well as checking if the end of the collision list was reached in the first image The first case will almost always be true as the hash table is expected to be filled and the second case false. The second image has code that is more prediction friendly. Improves lookups by 20% 25

  27. Advanced Parallelization Technique Tuples need not be processed one at a time Multiple tuples may be processed as long the block can be kept in memory Using SIMD registers enables this and prevents delays in branching. Parallelizing can be supported with almost no code changes. 26

  28. Evaluation The above mentioned techniques were implemented in the HyPer main memory DBMS and disk based DBMS All experiments were run on a Dual Intel X5570 Quad core CPU with 64GB main memory, RHCL 5.4 DBMS used were MonetDB, Ingres VectorWise and DB X, a commercial system HyPer is designed as a hybrid OLTP and OLAP system, it can handle both workloads concurrently. 27

  29. Systems Comparison TPC-C transactions were executed for OLTP(single threaded),12 warehouses were loaded without client wait times . Hand written C++ code fragments were used to compare the performance with LLVM Transactions per second is higher for LLVM by a little. Compiling C++ is slower by more than a factor of 10. 28

  30. Systems Comparison(OLAP) 5 queries were run and warm execution time was measured DB X is much slower because it is a general disk based system HyPer with LLVM is 2-4 times faster than the C++ version, for Q5(joins) difference is negligible However, for Q1 (aggregation), differences are large Compile time for C++ is not acceptable but is reasonable for LLVM (1556 vs 16) 29

  31. Code Quality Branch and cache effects were studied using the callgrind tool of valgrind. Number of branches, mispredicts, level 1 instruction cache misses, level 2 data misses and the number of executed instructions are shown in the table. 30

  32. Observations MonetDB is expected to have a low number of branch mispredictions because it performs operations in tight loops LLVM has a lower number of branches as well as mispredictions than MonetDB Q2 is an anomaly caused by HyPer When it comes to D1 and L2d, the numbers are almost the same. If it is not in level 1 cache, then it is probably not in level 2 cache either LLVM shows good performance compared to MonetDB(except for Q2) 31

  33. Conclusion Data-centric query processing is efficient LLVM performance is comparable to that of using C++ Using main stream compilation frameworks derives value from future compiler and processor improvements without re-reengineering the query engine 32

  34. References Efficiently Compiling Efficient Query Plans for Modern Hardware Thomas Neumann, TUM Munich CMU Database Group Lectures and slides Prof Andy Pavlo, CMU 33

More Related Content