Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching
This paper discusses the Falcon system, focusing on hands-off crowdsourced entity matching for building cloud services. It introduces recent advancements, solutions, limitations, and contributions, emphasizing scalability and efficiency in large-scale data matching tasks.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services Paul Suganthan G. C. University of Wisconsin-Madison Joint work with Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, Youngchoon Park @WalmartLabs
Entity Matching Table A Table B Name City State Name City State Dave Smith Madison WI David D. Smith Madison WI Joe Wilson San Jose CA Daniel W. Smith Middleton WI Dan Smith Middleton WI 2
Recent Crowdsourced EM Work Example: verifying predicted matches a b c A (a,d) (b,e) (c,d) (c,e) (a,d) Y (b,e) N (c,d) Y (c,e) Y Verifying Blocking Matching (a,d) Y (c,e) Y d e B Limitations crowdsources only parts of workflow needs a developer to execute the remaining parts does not work when doing many EM tasks, not enough developers 3
Our Recent Solution : Corleone [SIGMOD-14] Introduces the idea of hands-off crowdsourcing crowdsources the entire EM workflow, requiring no developers Tables - Predicted matches A Candidate tuple pairs Predicted matches Accuracy Estimator Blocker Matcher - Accuracy estimates (P, R) User B Difficult Pairs Locator 4
Limitations of Corleone Does not scale to large tables (e.g., 50K-1M tuples) executes mostly a single-machine in-memory EM workflow Our developers Domain scientists @ UW-Madison Crowd workers EM service Domain scientists (self-service EM) takes weeks to match 5
Contributions of This Paper Scales up Corleone to tables of millions of tuples matches tables of 1-2.5M tuples for only $54-66 in 2-14 hours Recently deployed as a cloud service by Yash Govind, Erik Paulson, Mukilan Ashok Used extensively at multiple organizations e.g., Johnson Controls, Marshfield Clinic, a non-profit organization, WalmartLabs, different UW-Madison departments 6
Contributions of This Paper Our technical solution define basic operators & use them to model the EM workflow of Corleone as a DAG scales up operators (using MapReduce) e.g., executing complex blocking rules over large tables optimizes within and across operators e.g., use crowd time of an operator to mask machine time of another Results can potentially be applicable to other contexts e.g., scale up rules that involve string similarity measures scale up any DAG that involves complex rules, crowdsourcing, and machine learning 7
Consider a Simplified Corleone Workflow Tables - Predicted matches A Candidate tuple pairs Predicted matches Accuracy Estimator Blocker Matcher - Accuracy estimates (P, R) User B Difficult Pairs Locator 8
Blocking Many solutions, hash-based blocking is very popular e.g., only consider tuple pairs where A.state = B.state easy to understand and implement, highly scalable In practice however often does not work well due to dirty data, variations in data values, missing values results in low recall (i.e., drop many true matches) Corleone uses rule-based blocking jaccard(a.title, b.title) < 0.7 drop(a, b) exact_match(a.year, b.year) = 0 AND abs_diff(a.price, b.price) > 10 drop(a,b) Sample rules for blocking tables of books more powerful, gives higher recall, subsumes hash-based blocking but far more difficult to learn and scale 9
Blocking in Corleone Takes sample S from A X B (without materializing it) Trains a random forest F on S (to match tuple pairs) using active learning, where crowd labels pairs Example random forest F for matching books isbn_match N title_match Y N Y #pages_match No publisher_match N No No N Y Y Yes No Yes Extracts candidate blocking rules from F (isbn_match = N) No (isbn_match = Y) and (#pages_match = N) No (title_match = N) No (title_match = Y) and (publisher_match = N) No
Blocking in Corleone Use the crowd to evaluate precision of extracted rules (isbn_match = N) No (isbn_match = Y) and (#pages_match = N) No (title_match = N) No (title_match = Y) and (publisher_match = N) No Select a subset of rules & execute on A & B (isbn_match = N) No (title_match = N) No 11
Modeling the Blocking Step as a DAG of Basic Operators A rules R get blocking rules feature generation sample S matcher M active learn sample S B A select optimal sequence rules E cand set C rule seq F B execute blocking rules evaluate rules 12
Modeling Both Blocking and Matching as a DAG of Operators A rules R get blocking rules feature generation sample S matcher M active learn sample S B A select optimal sequence rules E cand set C rule seq F B execute blocking rules evaluate rules apply matcher feature generation active learn matcher N matches C Eight basic operators Involve complex rules, crowdsourcing, ML Can be used to compose a variety of EM workflows
Scaling Up the Operator Executing Blocking Rules A execute blocking rules rules C B jaccard(a.title, b.title) < 0.7 drop(a, b) exact_match(a.year, b.year) = 0 AND abs_diff(a.price, b.price) > 10 drop(a,b) Sample rules for blocking tables of books Existing solutions do not work well assume rules as black-boxes and hence enumerates entire A x B or considers simple rules that are single predicate 14
Our Solution Observe that rules often contain well-known similarity functions e.g., edit distance, Jaccard, cosine Exploit properties of these functions build index-based filters, then use them to avoid enumerating A x B Example of filtering using size jaccard(a.title, b.title) < 0.7 drop(a, b) jaccard(a.title, b.title) 0.7 keep(a, b) The Big Lebowski keep String: The Big Short Number of tokens = 3 Spirited Away don t keep Possible size range: 3*0.7 x 3/0.7 15
Our Solution jaccard(a.title, b.title) < 0.7 drop(a, b) exact_match(a.year, b.year) = 0 AND abs_diff(a.price, b.price) > 10 drop(a,b) I2 Indexes over A I3 I1 For each b in B C = C1 C3) (C2 Apply the rules to (a, b) such that a C Developed four MapReduce implementations All indexes may not fit in memory of a mapper Balance between memory available for indexes at mappers and amount of work done at reducers Use a rule-based optimizer to choose an implementation 16
Optimizing the EM DAG Key idea: use crowd time to mask machine time build indexes, speculatively execute rules, mask pair selection A rules R get blocking rules feature generation sample S matcher M active learn build indexes sample S B A select optimal sequence rules E cand set C rule seq F B execute blocking rules evaluate rules speculatively execute rules apply matcher feature generation active learn matcher N matches C mask pair selection speculatively execute matcher
Empirical Evaluation Data Set Products Table A 2,554 Table B 22,074 # Matches 1,154 Description Electronic products at Amazon and Walmart Songs within a single table Citations in Citeseer and DBLP Songs Citations 1,000,000 1,000,000 1,823,978 2,512,927 1,292,023 558,787 Ran Hadoop on a 10-node cluster each node has 8-core Intel Xeon 2.1GHz processor and 8GB RAM Mechanical Turk settings turker qualifications: 100 approved HITs with 95% approval rate payment: 2 cents per question 18
Overall Performance Accuracy (%) Run Time Cost Candidate Set Size Data Set (# Questions) Total Time Machine Time Crowd Time ? ? ?? 81.9 $57.6 (960) 13h 25m Products 90.9 74.5 52m 13h 7m 536K - 11.4M 97.6 $54 (900) 11h 58m Songs 96.0 99.3 2h 7m 11h 25m 1.6M 51.4M 95.2 $65.5 (1087) 14h 37m Citations 92.0 98.5 2h 32m 13h 33m 654K 1.06M Blocking took 2m to 1h 13m Optimization reduces total machine time by 11 - 70% 19
Falcon in the Wild (CloudMatcher.io) Matching drug descriptions for Marshfield Clinic tables of size 453K vs 451K data is sensitive labeled by a scientist, 1h 37m to label 830 pairs 2h 10m of machine time optimization reduces by 49% to 1h 6m Matching organizations for a team of economists at UW supervised by Yash Govind and Mukilan Ashok tables of size 2616 vs 21530 crowd cost $61, precision >= 94%, recall >= 98% machine time 12m, crowd time 23h 12m when labeled by an economist: label time 50m, machine time 11m can match in an hour vs weeks using a developer (@ $15/hour) 20
Related Work Parallel execution of DAGs of operators considers each operator as blackbox e.g., [F. Hueske et al. ICDE 13, I. Gog et al. EuroSys 15] RDBMS-style solutions for data cleaning define operators for EM at coarse granularities e.g., [Wisteria VLDB 15, BigDansing SIGMOD 15] Crowdsourcing crowdsourced query processing [CrowdDB SIGMOD 11] verifying predicted matches [Wang et al. VLDB 12] minimize number of questions [Whang et al. VLDB 13] find best UI to pose questions [Marcus et al. VLDB 11] minimizing crowd latency [D. Haas et al. VLDB 16] 21
Conclusions Corleone ideally suited for building EM cloud services but does not scale to large tables Proposed Falcon to scale up Corleone matches tables of 1-2.5M tuples for $54-66 in 2-14 hours scales up DAGs with complex rules, crowdsourcing, ML Future directions exploring new optimizations consider more complex workflows provide CloudMatcher.io as a public service 22