Matching Tabular Data to Knowledge Graph Using Probability Models

1 / 14

Embed Share

"Discover how MTab facilitates the conversion of tabular data into a structured knowledge graph by leveraging probability models. Learn about its key ideas, assumptions, and framework, along with steps for pre-processing and estimating entity candidates and column types."

joer756 Follow

Uploaded on Apr 30, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

MTab Matching Tabular Data to Knowledge Graph using Probability Models Phuc Nguyen Phuc Nguyen, Natthawut Kertkeidkachorn, Ryutaro Ichise, Hideaki Takeda , Natthawut Kertkeidkachorn, Ryutaro Ichise, Hideaki Takeda Semantic Web Challenge, 30, October, 2019 1

Matching tables to DBpedia 2 Challenge website: http://www.cs.ox.ac.uk/isg/challenges/sem-tab

MTab: Assumptions 1. DBpedia: completed, corrected 2. Tables: vertical relational 3. Tables: independent 4. All cells in a column have the same - Entity type - Data type 5. Table header is in the first row of table 3

MTab: Key Ideas MTab combines the voting algorithm and the probability models. Tackle two major problems - Entity lookup: No result - Literal matching: cell values could not exactly matched to KG. 4

MTab framework 5

Step 1: Pre-Processing 1. Text decoding: use fix text for you (ftfy) [1] to correct noisy textual data 2. Language prediction: use pre-trained fast text model (Facebook) [2] 3. Data type prediction: use Duckling (Facebook) [3] 4. Entity type prediction: use SpaCy [4] 5. Entity lookup: - Query: each cell or neighbor cells in the same row. - Target: 1) DBpedia lookup, 2) DBpedia endpoint, 3) Wikipedia, 4) Wikidata - Parameters: language, limit of ranking result [1] Speer, R.: ftfy. Zenodo (2019), version 5.5 [2] Joulin et al. Bag of tricks for efficient text classification. In: EACL 2017. pp. 427 431. ACL (April 2017) [3] Duckling, link: https://github.com/facebook/duckling [4] Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017) 6

Step 2: Cell Estimate Entity Candidates based on lookup results Estimate Entity Candidates based on lookup results Aggregate scoring from lookup results Normalize those scores 7

Step 3: Column Estimate Type Candidates Use data type of duckling and SpaCy to categorize columns to two types: - Numerical columns - Textual columns Textual Columns Numerical Columns 8

Step 3: Column Estimate Type Candidates: Numerical columns Numerical Columns Use EmbNum [1] to find relevance relations (results is a ranking of relevance relations). Infer domain of relation to find corresponding types [1] Nguyen et al: Embnum: Semantic labeling for numerical values with deep metric learning. JIST 2018 9

Step 3: Column Estimate Type Candidates: Textual Columns Textual Columns 1. Type candidate signals from numerical columns 2. Aggregated signals from entity lookup types for all column cells 3. Aggregated signals from types of SpaCy entity type for all column cells 4. The Normalized Levenshtein distance between table header and DBpedia classes 10

Step 4: Relation (Column-Column) Estimate Relation Candidates Estimate relation scores of two cells in the same row Aggregate these scores for all rows Entity-Entity Columns Entity-Literal Columns relation relation 11

Step 5: Re-estimate Entity Candidates 1. Entity candidates given lookup results (Step 2) Entity candidates given type of entities (Step 3) Entity candidates given cell values and entity labels - Normalized Levenshtein distance - Heuristic abbreviation rules Entity candidate given other cell values in the same row (Step 4) The highest estimation score is the output of CEA tasks 2. 3. 4. Step 6, 7: Re-Estimate Type and Relation Candidates with majority voting based on CEA results 12

Results (Primary Score) SemTab CEA (F1) CTA (AH) CPA (F1) Round 1 1.000 1.000 (F1) 0.987 Round 2 0.911 1.414 0.881 Round 3 0.970 1.956 0.844 Round 4 0.983 2.012 0.832 13

Summary Novelty MTab is built on top of multiple lookup services. MTab adopted many new signals (literal) from table elements. Limitation Accuracy strongly relies on the lookup results. Computation-intensive MTab is built on specific assumptions Future work Improve efficiency: Match only some parts of tables Improve effectiveness: Relaxing MTab assumptions 14

Matching Tabular Data to Knowledge Graph Using Probability Models

Download Presentation

Presentation Transcript

Related

More Related Content