Detecting Nearly Duplicated Records in Location Datasets

Detecting Nearly Duplicated Records in Location Datasets
Slide Note
Embed
Share

Location datasets often contain nearly duplicated records, causing challenges in data management and user confusion. This study by Yu Zheng, Xing Xie, Shuang Peng, and James Fu from Microsoft Research Asia proposes a machine learning-based approach to infer similarity between location entities based on multiple fields such as name, address, coordinates, and categories. By analyzing real datasets and considering features like name similarity, address similarity, and category similarity, the method aims to improve the accuracy of identifying duplicate points of interest (POIs). The research delves into the issues caused by variations in entity presentations from different resources and channels, ultimately offering a solution to enhance data quality in geographic services.

  • Location datasets
  • Duplicate records
  • Machine learning
  • Geographic data
  • Data management

Uploaded on Mar 07, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Detecting Nearly Duplicated Records in Location Datasets Yu Zheng Xing Xie, Shuang Peng, James Fu Microsoft Research Asia Search Technology Center

  2. Background Web maps and local search engines are frequently-used The quality of the services depends on geographic data

  3. Background Point of interests Collected by people holding GPS-enabled devices in the physical world Accurate GPS coordinates Less accurate address Yellow page Inputted by people in a cyber environment, e.g., online Accurate address Inaccurate GPS coordinates (translated by geocoding) Name Address GPS Position Phone Num. Category Type 701 5th Ave Seattle, WA The Matt s Bar 116.325, 35.364 1-56987452 Caf YP 314 7th Ave Redmond, WA Silver Cloud Inn 116.451, 35.209 1-25698716 Hotel POI

  4. Problem Nearly duplicated POIs The same entity in the physical world With slightly different presentations of name, address, Caused by multiple resources Different vendors and channels Different types: POI and YP Results Bring trouble to data management Confuse users Example: Seattle Premier Outlet Mall Seattle Premium Outlet

  5. What we do Infer the similarity between two location entities Based on a machine learning based approach Consider multiple fields: name, address, coordinates, categories Identify some useful features Evaluate our method using real datasets

  6. Methodology Similarities between two entities Name similarity Address similarity Category similarity Train a inference model Using these similarities as features A small human label training set Apply to a large scale dataset

  7. Name similarity Edit distance does not work The concept of IDF Shared part: ??= ???? ?1.????,?2.???? , Different part: ??= ???? ?1.????,?2.???? Output ?1 and ?2 as features ??= ?1,?2, ,?? ??= ? 1,? 2, ,? ? |??| ?1= ???(?? ??) Edit Dist. Difference Results Same part Record names ?=1 Galaxies Cafe Cafe Same 9 Galaxies Coffee House Galaxies Coffee House ?2= ????? ?????(??) Espresso Darer Espresso Diana Darer Espresso 4 Diff. Diana ? ??? ?? = ??? |{??,?? ??.????}|

  8. Address similarity Example: The same building having two different address presentation 79 Beaver St, New York, NY 10005-2812 92 Water St, New York, NY 10005-3511 the geospatially closer two records are located, the higher the probability these two records might be nearly duplicated City structure New York City 1xxxx City Queen 113xxx Manhattan 100xxx Borough Upper East 1002x Lower East 1000x Area 5th Street Wall Street Street

  9. Address similarity Insert YP data into the city structure according to their address Calculate the mean coordinates of each leaf node Insert POI data into the city structure in terms of their coordinates Find out the co-parent node in the structure np np R1 R1 R2 np R1 R2 R2 A) B) C)

  10. Category similarity Map each entity to a category hierarchy Find the co-parent node of two entities The lower lever the co-parent is on the high similar E.g., some shops usually provide coffee, lunch and wine simultaneously. Therefore, different people would classify these shops into different categories Education Entertaiment Level 1 Restaurant Cinema Level 2 Italian Restaurant Chinese Restaurant Level 3

  11. Experiments- Settings Beijing Dataset In total 0.7 million entities 0.3m POIs and 0.4m YPs Human labeled Decision tree + Bagging Baselines Exact match Rule-based: edit distance and geo-distance Datasets Training Set Test Set Total D1 D2 D3 D4 200 400 600 800 200 400 600 800 400 800 1200 1600

  12. Experiments - Results Single feature study S1 and S2 are name similarity S3 denotes address similarity S4 represents category similarity 1 0.9 0.9 0.8 0.8 Precision 0.7 Recall S1 S1 0.7 S2 S2 0.6 0.6 S3 S3 0.5 S4 0.5 S4 0.4 0.4 400 800 1200 1600 400 800 1200 1600 Number of entity pairs Number of entity pairs

  13. Experiments - Results Feature combination Overall accuracy Duplicated Non-duplicated Features Pre. Rec. Pre. Rec. ?1+ ?2 0.860 0.857 0.852 0.864 0.858 ?1+ ?3 0.800 0.767 0.746 0.819 0.782 ?1+ ?2+ ?3 0.864 0.859 0.853 0.869 0.861 ?1+ ?2+ ?4 0.864 0.859 0.853 0.869 0.861 ?1+ ?2+ ?3+ ?4 0.885 0.866 0.858 0.891 0.875

  14. Experiments- results Duplicated Non-duplicated Overall accuracy Features Pre. Rec. Pre. Rec. Exact Match 1 0.183 0.558 0.100 0.598 Rule-based method 0.780 0.701 0.736 0.808 0.755 Our approach 0.885 0.866 0.858 0.891 0.875 0.95 0.9 0.85 D1 D2 D3 D4 0.8 0.75 0.7 0.65 precision (Y) recall (Y) precision (N) Performance Measures recall (N) overall

  15. Conclusion A classification model using Name similarity Address similarity Category similarity Determine the nearly duplicated location data With a overall accuracy of 0.89

  16. Thanks! yuzheng@microsoft.com Yu Zheng Microsoft Research Asia

More Related Content