Enhancing Record Linkage Quality for Integrated Data with Value Diversity

linking records with value diversity n.w
1 / 47
Embed
Share

Pei Li from the University of Milan Bicocca explores the challenges of linking records with value diversity, aiming to improve data integration quality. The study delves into the complexities of entity resolution, examining the evolution of values over time and issues arising from erroneous data sources. The goal is to enhance linkage quality for temporal records with high diversity, focusing on records within the same group.

  • Record Linkage
  • Data Integration
  • Value Diversity
  • Entity Resolution
  • Data Quality

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Linking Records with Value Diversity Pei Li University of Milan Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh Srivastava October, 2012

  2. Some Statistics from DBLP -How many Wei Wang s are there? -What are their authoring histories? 2

  3. Some Statistics from YellowPages -Are there any business chains? -If yes, which businesses are their members? 3

  4. Record Linkage What is record linkage (entity resolution)? Input: a set of records Output: clustering of records A critical problem in data integration and data cleaning A reputation for world-class quality is profitable, a business maker . William E. Winkler Current work (surveyed in [Elmagarmid, 07], [Koudas, 06]) : assume that records of the same entities are consistent often focus on different representations of the same value e.g., IBM and International Business Machines 4

  5. New Challenges In reality, we observe value diversity of entities Values can evolve over time Catholic Healthcare (1986 - 2012) Dignity Health (2012 -) Different records of the same group can have local values ID 001 Name F.B. Insurance Address Vernon 76384 TX Phone 877 635-4684 URL txfb-ins.com 002 F.B. Insurance #1 Lufkin 75901 TX 936 634-7285 txfb.org 003 F.B. Insurance #5 Cibolo 78108 TX 877 635-4684 Some sources may provide erroneous values ID 001 Name Meekhof Tire Sales & Service Inc URL www.meekhoftire.com Source Src. 1 002 Meekhof Tire Sales & Service Inc www.napaautocare.com Src. 2 5 5

  6. My Goal To improve the linkage quality of integrated data with fairly high diversity linking temporal records [VLDB 11] [VLDB 12 demo][FCS Journal 12] linking records of the same group [Under preparation for SIGMOD 13] 6

  7. Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 7

  8. r1: Xin Dong R. Polytechnic Institute r4: Xin Luna Dong University of Washington r2: Xin Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r3: Xin Dong University of Washington -How many authors? -What are their authoring histories? 1991 1991 1991 1991 1991 2004 2005 2006 2007 2008 2009 2010 2011 r11: Dong Xin Microsoft Research r8:Dong Xin University of Illinois r12: Dong Xin Microsoft Research r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r7: Dong Xin University of Illinois 8

  9. r1: Xin Dong R. Polytechnic Institute r4: Xin Luna Dong University of Washington r2: Xin Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r3: Xin Dong University of Washington -Ground truth 1991 1991 1991 1991 1991 2004 2005 2006 2007 2008 2009 2010 2011 r11: Dong Xin Microsoft Research r8:Dong Xin University of Illinois 3 authors r12: Dong Xin Microsoft Research r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r7: Dong Xin University of Illinois 9

  10. r1: Xin Dong R. Polytechnic Institute r4: Xin Luna Dong University of Washington r2: Xin Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r3: Xin Dong University of Washington -Solution 1: -requiring high value consistency 1991 1991 1991 1991 1991 2004 2005 2006 2007 2008 2009 2010 2011 r11: Dong Xin Microsoft Research 5 authors false negative r8:Dong Xin University of Illinois r12: Dong Xin Microsoft Research r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r7: Dong Xin University of Illinois 10

  11. r1: Xin Dong R. Polytechnic Institute r4: Xin Luna Dong University of Washington r2: Xin Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r3: Xin Dong University of Washington -Solution 2: -matching records w. similar names 1991 1991 1991 1991 1991 2004 2005 2006 2007 2008 2009 2010 2011 r11: Dong Xin Microsoft Research 2 authors false positive r8:Dong Xin University of Illinois r12: Dong Xin Microsoft Research r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r7: Dong Xin University of Illinois 11

  12. Opportunities Continuity of history Smooth transition ID r1 r2 r7 r3 r4 r8 r9 r10 r11 r5 r6 r12 Name Xin Dong Xin Dong Dong Xin Xin Dong Xin Luna Dong Dong Xin Dong Xin Dong Xin Dong Xin Xin Luna Dong Xin Luna Dong Dong Xin Affiliation R. Polytechnic Institute University of Washington University of Illinois University of Washington University of Washington University of Illinois Microsoft Research University of Illinois Microsoft Research AT&T Labs-Research AT&T Labs-Research Microsoft Research Co-authors Wozny Halevy, Tatarinov Han, Wah Halevy Halevy, Yu Wah Wu, Han Ling, He Chaudhuri, Ganti Das Sarma, Halevy Naumann He Year 1991 2004 2004 2005 2007 2007 2008 2009 2009 2009 2010 2011 Seldom erratic changes 12

  13. Intuitions ID r1 r2 r7 r3 r4 r8 r9 r10 r11 r5 r6 r12 Name Xin Dong Xin Dong Dong Xin Xin Dong Xin Luna Dong Dong Xin Dong Xin Dong Xin Dong Xin Xin Luna Dong Xin Luna Dong Dong Xin Affiliation R. Polytechnic Institute University of Washington University of Illinois University of Washington University of Washington University of Illinois Microsoft Research University of Illinois Microsoft Research AT&T Labs-Research AT&T Labs-Research Microsoft Research Co-authors Wozny Halevy, Tatarinov Han, Wah Halevy Halevy, Yu Wah Wu, Han Ling, He Chaudhuri, Ganti Das Sarma, Halevy Naumann He Year 1991 2004 2004 2005 2007 2007 2008 2009 2009 2009 2010 2011 Less reward on the same value over time Less penalty on different values over time Consider records in time order for clustering 13

  14. Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 14

  15. Disagreement Decay Intuition: different values over a long time is not a strong indicator of referring to different entities. University of Washington (01-07) AT&T Labs-Research (07-date) Definition (Disagreement decay) Disagreement decay of attribute A over time t is the probability that an entity changes its A-value within time t. 15

  16. Agreement Decay Intuition: the same value over a long time is not a strong indicator of referring to the same entities. Adam Smith: (1723-1790) Adam Smith: (1965-) Definition (Agreement decay) Agreement decay of attribute A over time t is the probability that different entities share the same A-value within time t. 16

  17. Decay Curves Decay curves of address learnt from European Patent data 1 0.9 0.8 0.7 Decay 0.6 0.5 0.4 Patent records: 1871 0.3 Real-world inventors: 359 0.2 0.1 In years: 1978 - 2003 0 0 5 10 Year 15 20 25 Disagreement decay Agreement decay 17

  18. Applying Decay E.g. r1 <Xin Dong, Uni. of Washington, 2004> r2 <Xin Dong, AT&T Labs-Research, 2009> No decayed similarity: w(name)=w(affi.)=.5 sim(r1, r2)=.5*1+.5*0=.5 Decayed similarity w(name, t=5)=1-dagree(name , t=5)=.95, w(affi., t=5)=1-ddisagree(affi. , t=5)=.1 sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9 Un-match Match 18

  19. Applying Decay ID r1 r2 r7 r3 r4 r8 r9 r10 r11 r5 r6 r12 Name Xin Dong Xin Dong Dong Xin Xin Dong Xin Luna Dong Dong Xin Dong Xin Dong Xin Dong Xin Xin Luna Dong Xin Luna Dong Dong Xin Affiliation R. Polytechnic Institute University of Washington University of Illinois University of Washington University of Washington University of Illinois Microsoft Research University of Illinois Microsoft Research AT&T Labs-Research AT&T Labs-Research Microsoft Research Co-authors Wozny Halevy, Tatarinov Han, Wah Halevy Halevy, Yu Wah Wu, Han Ling, He Chaudhuri, Ganti Das Sarma, Halevy Naumann He Year 1991 2004 2004 2005 2007 2007 2008 2009 2009 2009 2010 2011 All records are merged into the same cluster!! Able to detect changes! 19

  20. Decayed Similarity & Traditional Clustering PARTITION CENTER MERGE DECAY 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 F-1 Precision Recall Patent records: 1871 Decay improves recall over baselines by 23-67% Real-world inventors: 359 In years: 1978 - 2003 20

  21. Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 21

  22. Early Binding Compare a new record with existing clusters Make eager merging decision for each record Maintain the earliest/latest timestamp for its last value 22

  23. Early Binding ID Name Affiliation Co-authors From To C1 r1 Xin Dong R. P. Institute Wozny 1991 1991 2004 r2 r3 Xin Dong Xin Dong Univ. of Washington Univ. of Washington Halevy, Tatarinov Halevy 2004 2004 2005 r4 Xin Luna Dong Univ. of Washington Halevy, Yu 2004 2007 r10 Dong Xin University of Illinois Ling, He 2009 2009 C2 ID r7 r8 Name Dong Xin University of Illinois Dong Xin University of Illinois Avoid a lot of false positives! Affiliation Co-authors Han, Wah Wah From 2004 2004 To 2004 2007 earlier mistakes prevent later merging!! r9 Dong Xin Microsoft Research Wu, Han 2008 2008 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2008 2009 r12 Dong Xin Microsoft Research He 2008 2011 C3 ID Name Affiliation Co-authors From To r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2009 2010 23

  24. Adjusted Binding Compare earlier records with clusters created later Proceed in EM-style 1. Initialization: Start with the result of initialized clustering 2. Estimation: Compute record-cluster similarity 3. Maximization: Choose the optimal clustering 4. Termination: Repeat until the results converge or oscillate 24

  25. Adjusted Binding Compute similarity by Consistency: consistency in evolution of values Continuity: continuity of records in time sim(r, C)=cont(r, C)*cons(r, C) Case 1: r.t C.early C.late Case 2: C.late r.t C.early Case 3: r.t C.early C.late Case 4: C.late r.t C.early record time stamp cluster time stamp 25

  26. Adjusted Binding r7 C3 Once r8 is merged to C4, r7 has higher continuity with C4 DongXin@UI -2004 r8 r8 has higher continuity with C4 DongXin@UI -2007 C4 r9 DongXin@MSR -2008 r10 DongXin@UI -2009 r11 DongXin@MSR -2009 r12 DongXin@MSR -2011 C5 r10 has higher continuity with C4 26

  27. Adjusted Binding ID r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 Name Xin Dong Xin Dong Xin Dong Xin Luna Dong Xin Luna Dong Xin Luna Dong Dong Xin Dong Xin Dong Xin Dong Xin Dong Xin Dong Xin Affiliation R. Polytechnic Institute University of Washington University of Washington University of Washington AT&T Labs-Research AT&T Labs-Research University of Illinois University of Illinois Microsoft Research University of Illinois Microsoft Research Microsoft Research Co-authors Wozny Halevy, Tatarinov Halevy Halevy, Yu Das Sarma, Halevy Naumann Han, Wah Wah Wu, Han Ling, He Chaudhuri, Ganti He Year 1991 2004 2005 2007 2009 2010 2004 2007 2008 2009 2009 2011 C1 C2 Correctly cluster all records C3 27

  28. Temporal Clustering Full algorithm has the best result PARTITION CENTER MERGE DECAY ADJUST FULL ALGO. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 F-1 Precision Recall Adjusted Clustering improves recall without reducing precision much Patent records: 1871 Real-world inventors: 359 In years: 1978 - 2003 28

  29. Experimental Results Data sets: #Records 1871 72 738 #Entities 359 8 18+potpourri Years 1978-2003 1991-2010 1992-2011 Patent DBLP-XD DBLP-WW PARTITION CENTER MERGE FULL ALGO. PARTITION CENTER MERGE FULL ALGO. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 F-1 (b) Results of WW data Precision Recall F-1 Precision Recall (a) Results of XD data 29

  30. Demonstration CHRONOS: Facilitating History Discovery by Linking Temporal Records ITIS Lab http://www.itis.disco.unimib.it 30

  31. Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 31

  32. -Are there any business chains? -If yes, which businesses are their members? 32

  33. 2 chains -Ground Truth 33

  34. 0 chain -Solution 1: -Require high value consistency 34

  35. 1 chain -Solution 2: -Match records w. same name 35

  36. Challenges Erroneous values ID r1 r2 r3 name Taco Casa Taco Casa Taco Casa Scalability 6.8M Records phone state URL domain tacocasa.com tacocasa.com tacocasa.com, tacocasatexas.com AL AL AL 900 900 r4 r5 r6 r7 r8 r9 r10 Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Elva s Taco Casa 900 900 701 702 703 704 AL AL TX TX TX TX TX tacocasatexas.com tacocasatexas.com tacocasatexas.com tacodemar.com Different local values 36

  37. Two-Stage Linkage Stage I Stage I: Identify cores containing listings very likely to belong to the same chain Require strong robustness in presence of possibly erroneous values Graph theory High Scalability ID r1 r2 r3 name Taco Casa Taco Casa Taco Casa phone state AL AL AL URL domain tacocasa.com tacocasa.com tacocasa.com, tacocasatexas.com 900 900 r4 r5 r6 r7 r8 r9 r10 Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Elva s Taco Casa 900 900 701 702 703 704 AL AL TX TX TX TX TX tacocasatexas.com tacocasatexas.com tacocasatexas.com tacodemar.com 37

  38. Two-Stage Linkage Stage II Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values Reward strong evidence ID r1 r2 r3 name Taco Casa Taco Casa Taco Casa phone state AL AL AL URL domain tacocasa.com tacocasa.com tacocasa.com, tacocasatexas.com 900 900 r4 r5 r6 r7 r8 r9 r10 Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Elva s Taco Casa 900 900 701 702 703 704 AL AL TX TX TX TX TX tacocasatexas.com tacocasatexas.com tacocasatexas.com tacodemar.com 38

  39. Two-Stage Linkage Stage II Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values Reward strong evidence ID r1 r2 r3 name Taco Casa Taco Casa Taco Casa phone state AL AL AL URL domain tacocasa.com tacocasa.com tacocasa.com, tacocasatexas.com 900 900 r4 r5 r6 r7 r8 r9 r10 Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Elva s Taco Casa 900 900 701 702 703 704 AL AL TX TX TX TX TX tacocasatexas.com tacocasatexas.com tacocasatexas.com tacodemar.com 39

  40. Two-Stage Linkage Stage II Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values Apply weak evidence ID r1 r2 r3 name Taco Casa Taco Casa Taco Casa phone state AL AL AL URL domain tacocasa.com tacocasa.com tacocasa.com, tacocasatexas.com 900 900 r4 r5 r6 r7 r8 r9 r10 Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Elva s Taco Casa 900 900 701 702 703 704 AL AL TX TX TX TX TX tacocasatexas.com tacocasatexas.com tacocasatexas.com tacodemar.com 40

  41. Two-Stage Linkage Stage II Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values No penalty on local values ID r1 r2 r3 name Taco Casa Taco Casa Taco Casa phone state AL AL AL URL domain tacocasa.com tacocasa.com tacocasa.com, tacocasatexas.com 900 900 r4 r5 r6 r7 r8 r9 r10 Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Taco Casa Elva s Taco Casa 900 900 701 702 703 704 AL AL TX TX TX TX TX tacocasatexas.com tacocasatexas.com tacocasatexas.com tacodemar.com 41

  42. Experimental Evaluation Data set 6.8M records from YellowPages.com Effectiveness: Precision / Recall / F-measure (avg.): .96 / .96 / .96 Efficiency: 6.9 hrs for single-machine solution 40 mins for Hadoop solution 80K chains and 1M records in chains Chain name USPS - United States Post Office12,776 SUBWAY State Farm Insurance McDonald's Edward Jones # Stores 11,278 8,711 7,450 6,781 42

  43. Experimental Evaluation II Sample Random AI UB FBIns #Records 2062 2446 322 1149 #Chains 30 1 7 14 Chain size [2, 308] 2446 [2, 275] [33, 269] #Single-biz records 503 0 5 0 ITIS Lab http://www.itis.disco.unimib.it 43

  44. Related Work Record similarity: Probabilistic linkage Classification-based approaches: classify records by probabilistic model [Felligi, 69] Deterministic linkage Distance-base approaches: apply distance metric to compute similarity of each attribute, and take the weighted sum as record similarity [Dey,08] Rule-based approaches: apply domain knolwedge to match record [Hernandez,98] Record clustering Transitive rule [Hernandez,98] Optimization problem [Wijaya,09] 44

  45. Conclusions In some applications record linkage needs to be tolerant with value diversity When linking temporal records, time decay allows tolerance on evolving values When linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values 45

  46. Future Work Data Integration Data Quality Temporal Database 46

  47. Thanks! 47

Related


More Related Content