Data Integration with Dependent Sources: Query Answering System
An exploration of a system called IDS for integrating dependent sources in data processing. The paper addresses theoretical challenges, focusing on query answering with dependent sources. Investigates source selection, computation, coverage, configuration, cost, and more. Proposes solutions for choosing data sources for query answering, determining query order optimization, and calculating the captured tuple fraction for a subset of sources.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Data Integration with Dependent Sources Anish Das Sarma, Xin (Luna) Dong, Alon Halevy Yahoo! Research, AT&T Labs-Research, Google Inc. 1 March 7, 2025
Best guess, based on 7 websites Query Answering in Data Integration A = Ui Ai Ai s may have conflicts: D1,D2,D3: Paris D4,D5: Inria Q: France capital Mediated Schema Q1 A1 A5 Count number of sources Q5 A2A3A4 Q2 Assuming independence in counting! Sources can copy from each other Q4 D1 D5 Q3 Consider number of independent sources for each answer D2 D4 D3 2 March 7, 2025
Motivation [Solomon project, Luna Dong et. al.] Web data consists of an ecosystem of dependent sources Information extracted from AbeBooks.com Data from 877 bookstores 465 pairs of sources involved copying 314 copiers, 202 copy from a single source Some copy all tuples, some copy a fraction 3 March 7, 2025
Goal Build a system, IDS, for Integrating Dependent Sources This paper: Proposed a system for query answering with dependent sources Address theoretical challenges in building such a system Note: detecting dependencies is part of other work [Solomon] 4 March 7, 2025
Q Source Selection Computation Computation Coverage Data Sources Configuration Cost Source Ordering Query Answering Answer 5 March 7, 2025
1) Given a query Q, which data sources to use for answering this query? (cost-coverage tradeoff) Q Source Selection Computation Computation Coverage Data Sources Configuration Cost Source Ordering Query Answering Answer 6 March 7, 2025
2) In which order should we query a set S of sources to get answers as soon as possible? Q Source Selection Computation Computation Coverage Data Sources Configuration Cost Source Ordering Query Answering Answer 7 March 7, 2025
3) For a subset S of sources, what fraction of the total set of tuples is captured by tuples in S? Coverage is core part of previous problems. Q Source Selection Computation Computation Coverage Data Sources Configuration Cost Source Ordering Query Answering Answer 8 March 7, 2025
Answers to source selection and ordering problems depend on cost model Q Source Selection Computation Computation Coverage Data Sources Configuration Cost Source Ordering Query Answering Answer 9 March 7, 2025
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity result summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) 10 March 7, 2025
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) 11 March 7, 2025
Dependency Model: Example #tuples provided independently (1) Fraction-copying: S6 copies a random 0.8 fraction of tuples from S2 Edges depict copying 12 March 7, 2025
Dependency Model: Example (2) Selection- copying: S2 copies all tuples with A<4 from S1. S1 (3) Histogram-copying combines selection- and fraction-copying S3 S2 S4 13 March 7, 2025
Query Answering This talk: query to find all tuples (``select * ) Given set S= {S1, ,Sn}, we want Q(S) = UQ(Si) Technical point: Assume each tuple t annotated with the source S providing it; i.e., ``tuple is (t,S) Extension of results for other queries in paper Selections, projections, joins 15 March 7, 2025
Cost Model Given set S={S1, ,Sn} of sources to query, we consider three models for cost of querying T: Linear cost model: Cost = i |Si| Data stored locally, and scanning (I/O) cost dominates Number-of-sources cost model: Cost = |T| When ``charged for every source (e.g., web services) Arbitrary source cost model: Cost = i ci Each source has an arbitrary cost ci 16 March 7, 2025
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) 17 March 7, 2025
Coverage Problem Coverage Problem:What fraction of total tuples are covered by a subset? Example: What is the coverage of {S4,S5}? Total #tuples = 300 S5 gets all tuples from S1, S2 S4 provides 50 new tuples Coverage = 250/300 18 March 7, 2025
Cost Minimization Problem Cost Min. Problem: Which sources to query to: (1) get all tuples, (2) minimize total cost? Linear Cost Model: {S3,S4,S5} (cost = 100 + 100 + 200 = 400) Num-sources Cost Model: {S5,S6} (cost = 2) 19 March 7, 2025
Maximum Coverage Problem Max. Coverage Problem: Given max. cost bound, what sources to query to: (1) get max. tuples, (2) be within cost bound? Cost Bound = 1 (num-sources model): Query S6: Get 255 tuples: -80 from S2 -50 each from S3 and S4 -75 from S1 (through S3 and S4: because of independence of copying) 20 March 7, 2025
Source Ordering Problem Source Ordering Problem: What order to query sources to: obtain tuples ``as-fast-as-possible . (Intuitively, max-area-under-curve plotting cost versus tuples retrieved) Example: Query S6 -> S5 -> other sources - S6 gives 255 tuples - S5 gives remaining 45 tuples Query S1 -> S2 -> S3 -> S4 -> S5 -> S6 - S1 gives 100 tuples - S2 gives 100 tuples - S3 gives 50 tuples - S4 gives 50 tuples S5 Num. tuples S4 S6 S3 S2 S1 Source ordering 21
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) 25 March 7, 2025
Summary of results Paper presents detailed theoretical investigation: All four problems: coverage, cost-minimization, max- coverage, source-ordering All cost models: linear, num-sources, arbitrary Hardness results, polynomial-time algorithms, tractable sub-classes, approximations Next: 1-2 slides on each of the problems, highlighting main results Focus on fraction-copying model 26 March 7, 2025
Coverage Problem #P-complete in general Reduction from the problem of evaluating the number of satisfying assignments in a monotone 2-DNF formula 27 March 7, 2025
Coverage Problem PTIME when each copy-fraction is 0 or 1 Algorithm-A: Compute exact coverage of the subset of sources Compute total number of tuples among all sources 28 March 7, 2025
Coverage Problem PTIME with select-copying Lot of sources copy by applying selection queries Knowledge of selection predicate makes problem tractable 29 March 7, 2025
Coverage Problem PTIME randomized approximation algorithm 30 March 7, 2025
Coverage: Randomized Algorithm Polynomial-time randomized (MC) algorithm Algorithm runs in: O(NE log(1/d) / e2) Coverage is within error e with probability (1-d) Randomized algorithm: Randomly include/omit each edge with probability based on copy-fraction of edge Run Algorithm-A on the resulting graph Final coverage: average of multiple iterations 31 March 7, 2025
Coverage: Randomized Algorithm Coverage of S6: -Retain edges based on fraction -Compute coverage -Correct answer = 255/300 tuples Iteration-1: #covered tuples = 300 Iteration-2: #covered tuples = 200 Compute average coverage 32 March 7, 2025
Next Formal problem definitions Dependency model Cost model for query answering Coverage and optimization problems Algorithms and complexity results summary Coverage problem (CP) Cost minimization problem (CMP) Maximum coverage problem (MCP) Source ordering problem (SOP) 33 March 7, 2025
CMP, MCP, SOP: Complete Results a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Next: Sample of results from the table above Greedy algorithm for picking sources 34 March 7, 2025
Other Results a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm All problems intractable in general 35 March 7, 2025
Greedy Algorithm Greedy algorithm for cost-minimization, maximum coverage, source ordering: Pick the source S maximizing I(S)/c(S) I(S) = total number of new tuples c(S) = cost of querying S Greedy algorithm is optimal when all sources: Copy from at most one source Copy all or zero tuples (i.e., fraction=1) 36 March 7, 2025
Greedy Algorithm Results a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Optimal for Single-source copying: Copy from at most one source Copy all or zero tuples (i.e., fraction=1) 37 March 7, 2025
Greedy Algorithm Results a Number-of-sources cost model b Linear cost-model and arbitrary source cost model c With PTIME coverage algorithm Approximations in the general-case: Log-approx (in size of largest source) for cost-minimization (1-1/e)-approx for maximum-coverage 2-approx for source-ordering 38 March 7, 2025
Summary Contributions: Studied query answering with dependent sources Simple model to capture dependencies Four important problems: coverage, cost- minimization, maximum-coverage, source ordering Algorithms and complexity results for the problems Future: Build a system based on our theoretical foundations Specific open problems laid out in the paper 41 March 7, 2025
Thanks! 42 March 7, 2025