
Discovering Queries Based on Example Tuples and Complex Database Challenges
Explore how to formulate SQL queries for complex databases, analyze keyword searches, propose project join queries, and determine candidate query generation in database systems based on provided examples and challenges.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Discovering Queries based on Example Tuples Yanyan Shen1, Kaushik Chakrabati2, Surajit Chaudhuri2, Bolin Ding2, Lev Novik2 1National University of Singapore, 2Microsoft Corporation
Complex Database Schema Source #tables #columns #text columns #foreign-key references IMDB 21 101 42 22 Axon (customer-support) 100 1263 614 63 CRM (customer- relationship) 347 5595 1074 586
Challenge: Querying Complex Databases SQL SELECTCustName, DevName, AppName Target schema FROMCustomer, Sales, Device, App WHERESales.CustId=Custom.CustId AND Sales.DevId=Device.DevId AND Sales.AppId=App.AppId Relevant tables Join path Any help to formulate a SQL query?
Can Keyword Search Help? Sales Customer Input: Mike ThinkPad Office SId CustId DevId AppId CustId CustName s1 c1 d1 a1 c1 Mike Jones *search for sales tuples s2 c2 d2 a2 c2 Mary Smith s3 c3 d3 a3 c3 Bob Evans Output:matched rows App Employee Device Mike Jones s1 ThinkPad X1 Office 2013 EmpId EmpName AppId AppName DevId DevName e1 Mike Stone d1 ThinkPad X1 a1 Office 2013 e2 Mary Lee d2 iPad Air a2 Evernote e3 Bob Nash d3 Nexus 7 a3 Dropbox Mike Stone o1 ThinkPad X1 Office 2013 Owner ESR OId EmpId DevId AppId ESRId EmpIdAppId Desc o1 e1 d1 a1 sr1 e1 a1 Office crash Where is schema information? o2 e2 d3 a3 sr2 e2 a3 Dropbox can t sync o3 e3 d2 a2 Ambiguity
Our Proposal *Who bought which product with which app installed. Input (Example table) Office Mike Mary ThinkPad iPad Bob Dropbox Output(Project join query) A B App C Customer Device CustId CustName AppIdAppName DevIdDevName Sales DevId SId CustId AppId
Roadmap Motivation & proposal Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion
Problem Statement Input: an example table T Office Mike Mary Bob ThinkPad iPad Dropbox Output: project join query such that (valid): every row ? in T is present in the query result (minimal): removing any edges or nodes from the join tree will lead to an invalid query Not minimal minimal Developer
Solution Overview Candidate Query Generation Candidate Query Verification Candidate Projection Column Retrieval Schema Graph Traversal Example Table Result Queries IR Engine maintaining inverted index on text columns (CI) Database Schema Database Instance
Roadmap Motivation & proposal Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion
Candidate Query Generation Office Mike Mary Bob ThinkPad iPad Dropbox Candidate Projection Column Retrieval For each column in the example table, find candidate projection columns in the database satisfying column constraint: contain all the keywords in the column Input column Candidate projection columns A Customer.CustName Employee.EmpName B Device.DevName C App.AppName ESR.Desc
Candidate Query Generation Office Mike Mary Bob ThinkPad iPad Dropbox Candidate Query Enumeration Follow candidate network generation algorithm[1] No join is required! CQ1 CQ2 Owner CQ3 Owner Sales A B Device B A C A B C Device Employee C Customer Device App App Employee ESR CQ4 CQ5 Owner Owner 1. Generate join tree ? 2. Generate mapping ? 3. Check minimal: - Every leaf node contains a column that is mapped by an input column C B A B App Device Employee Device App C ESR A ESR Employee [1] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. VLDB 2002.
Roadmap Motivation & proposal Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion
Algorithm 1: VerifyAll Iterate over candidate queries in outer loop and rows in ET in inner loop (or vice versa) and verify whether a candidate query ?? contains a row ? in its output. A candidate is valid iff it contains all the rows in ET. VerifyAll is wasteful as most candidate queries are invalid! Performing (CQ,r)-verification is expensive! ThinkPad Office Mike Mary Bob iPad Dropbox Non-empty result implies ??2satisfies row 1 Empty result implies ??2fails for row 2 ??2,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId ??2,1 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName, Mike ) AND CONTAINS(DevName, ThinkPad ) AND CONTAINS(AppName, Office ) AND CONTAINS(EmpName, Mary ) AND CONTAINS(DevName, iPad )
Opportunity of Pruning Office Mike Mary Bob ThinkPad iPad Dropbox (CQ2,2) fails implies (CQ5, 2) fails ??2,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName, Mary ) AND CONTAINS(DevName, iPad ) Failure dependency Verifying candidates with smaller join trees is more beneficial! ??5,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App, ESR WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND ESR.AppId=App.AppId AND CONTAINS(EmpName, Mary ) AND CONTAINS(DevName, iPad )
Algorithm 2: SimplePrune Order candidate queries in increasing join tree size Keep a list of CQ-row verifications performed so far that failed Iterate over ordered candidate queries in the outer loop and rows in the inner loop. When verify candidate Q, check if its failure result can be implied by the verifications in the list. If so, prune Q immediately. Otherwise, verify Q for all the rows.
Observation Office Mike Mary Bob ThinkPad iPad Dropbox limited pruning!
Opportunity Office Mike Mary Bob ThinkPad iPad Dropbox Evaluating common sub-structure on certain row may prune multiple invalid candidates!
Filter F2 F1 Owner Owner A B A B Employee Device Employee Device ? (A)= Employee.EmpName ? (B)= Device.DevName ? (C)= * ?(A)= Employee.EmpName ?(B)= Device.DevName ?(C)= App.AppName ? (A)= Employee.EmpName ? (B)= Device.DevName ? (C)= * A B C A B C Mike Thinkpad Office Mary iPad Filter success and failure Filter evaluation query ?1succeeds, ?2 fails
Dependency Properties of Filters Filter-candidate dependency ?1 fails implies ??2 is invalid F1 Owner Inter-filter failure dependency B A Employee Device F2 Owner A B C C B A ?1 fails implies ?2 fails Mary iPad Employee Device A App B C Mary iPad Inter-filter success dependency ?2 succeeds implies ?1 succeeds
Adaptive Filter Selection J3 J1 J4 J2 Owner Owner Owner C App C C B A B A ESR App Employee Device Employee App Device A Employee (J1,1) (J1,2) (J1,3) (J2,1) (J2,2) (J2,3) (J3,1) (J3,2) (J3,3)(J4,1) (J4,2) (J4,3) CQ2 CQ3 CQ4 5 evaluations!
Adaptive Filter Selection J3 J1 J4 J2 Owner Owner Owner C App C C B A B A ESR App Employee Device Employee App Device A Employee (J1,1) (J1,2) (J1,3) (J2,1) (J2,2) (J2,3) (J3,1) (J3,2) (J3,3)(J4,1) (J4,2) (J4,3) CQ2 CQ3 CQ4 2 evaluations!
Filter Selection Problem Given the set of filters for all the candidate queries, select a set of filters with minimized cost such that all the candidate queries are verified as valid/invalid after evaluating the selected filters. Cost of ??: # of joins in the join tree of ?? Problem Complexity: NP-hard Greedy algorithm: approx. ratio:
Roadmap Motivation Problem statement Solution Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion
Experiment Settings Dataset: IMDB Example table generation Parameters: #rows, #columns, sparsity, value length for non-empty cells Implementations VerifyAll SimplePrune Filter Weave[2] Measures Number of verifications performed Execution time [2] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-drive schema mapping. SIGMOD 2012.
Results on Various Example Tables Vary #rows Filter performs 5X fewer verifications than VerifyAll and 2X fewer than SimplePrune Filter is robust to #rows, i.e. requires similar #verifications Filter runs 4X faster than VerifyAll and 3X faster than SimplePrune
Comparison with Weave Filter requires 10X fewer verifications Filter runs 4X faster than Weave
Conclusion Develop a new search interface for discovering queries Address challenges in query discovery Verify candidate queries efficiently Filter selection problem Greedy strategy
Thanks! Q&A