Capturing Semantics for Imputation with Pre-trained Language Models
In this study, the authors introduce novel imputation techniques leveraging pre-trained language models to address missing data challenges. By shifting focus from symbolic similarity metrics to semantic understanding, they propose IPM-Multi and IPM-Binary approaches, demonstrating improved accuracy. The motivation stems from existing methods facing sparsity issues in inferring corrections. Experiment results and insights presented at ICDE 2021 shed light on the potential of integrating semantics in data imputation processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Capturing Semantics for Imputation with Pre-trained Language Models Yinan Mei1, Shaoxu Song1, Chenguang Fang1, Haifeng Yang2, Jingyun Fang 2, Jiang Long 2 1BNRist, Tsinghua University, China 2 Data Governance Innovation Lab, HUAWEI Cloud BU, China ICDE 2021
Outline 2 Background 1. Motivation 2. Overview 3. Proposal: IPM-Multi 4. Proposal: IPM-Binary 5. Experiments 6. ICDE 2021
1. Background: Missing Data 3 Missing data are prevalent Optional inputs in the information collection systems Different schemas in integrating heterogenous data sources ICDE 2021
2. Motivation 4 Many existing imputation techniques depend on the symbolic similarity metrics String edit distance, term jaccard similarity However, relying solely on symbolic similarity does not always lead to accurate results k-NN (k=3) with term jaccard similarity r r title title category category modelno modelno price price NaN NaN t1 t1 tribeca varsity jacket hard shell case for ipod touch dallas cowboys tribeca varsity jacket hard shell case for ipod touch dallas cowboys fva3778 fva3778 18.54 18.54 (mp3 accessories) (mp3 accessories) t2 t2 oklahoma sooners iphone 4 case black shell oklahoma sooners iphone 4 case black shell electronics general electronics general fva3161 fva3161 29.99 29.99 t3 t3 belkin sport armband for ipod nano black belkin sport armband for ipod nano black mp3 accessories mp3 accessories f8z514tt064 f8z514tt064 18.88 18.88 t4 t4 case logic 11.6 hard shell netbook sleeve case logic 11.6 hard shell netbook sleeve electronics general electronics general 154722 154722 19.99 19.99 t5 t5 fellowes hd precision cordless mouse fellowes hd precision cordless mouse mice mice 98904 98904 44.84 44.84 ... ... ... ... ... ... ... ... ... ... ICDE 2021
2. Motivation 5 Some imputation techniques [1] inferring corrections from co- occurring attribute values suffer the sparsity problem. co-occurrence w.r.t values are not always available r r title title category category modelno modelno price price NaN NaN t1 t1 tribeca varsity jacket hard shell case for ipod touch dallas cowboys tribeca varsity jacket hard shell case for ipod touch dallas cowboys fva3778 fva3778 18.54 18.54 (mp3 accessories) (mp3 accessories) t2 t2 oklahoma sooners iphone 4 case black shell oklahoma sooners iphone 4 case black shell electronics general electronics general fva3161 fva3161 29.99 29.99 t3 t3 belkin sport armband for ipod nano black belkin sport armband for ipod nano black mp3 accessories mp3 accessories f8z514tt064 f8z514tt064 18.88 18.88 t4 t4 case logic 11.6 hard shell netbook sleeve case logic 11.6 hard shell netbook sleeve electronics general electronics general 154722 154722 19.99 19.99 t5 t5 fellowes hd precision cordless mouse fellowes hd precision cordless mouse mice mice 98904 98904 44.84 44.84 ... ... ... ... ... ... ... ... ... ... [1] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. R e. Holoclean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow., 10(11):11901201, 2017. ICDE 2021
2. Motivation 6 co-occurrences w.r.t. tokens are widely existed External knowledge (Corpus for pre-training) To-impute data (Model fine-tuning) Impute the missing values by further considering semantics w.r.t the fine-grained tokens. r title category modelno price tribeca varsity jacket hard shell case for ipod touch dallas cowboys NaN (mp3 t1 18.54 accessories) oklahoma sooners iphone 4 case black shell electronics general t2 29.99 belkin sport armband for ipod nano black mp3 t3 18.88 accessories ... ... ... ... ... ICDE 2021
3. Overview 7 IPM consists of two alternatives: IPM-Multi: Multiclass classification; direct solution IPM-Binary: Binary classification; require less training data ICDE 2021
4. Proposal: IPM-Multi 8 Imputation <-> Multiclass classification #????? = |???(?)| Step 1: Fine-tuning in IPM-Multi Unsupervised -> Masked entry recovery Step 2: Imputing in IPM-Multi [belkin, , ipod, ..., 18.88] [tribeca, , ipod, ..., 18.54] r r title title category category modelno modelno price price belkin ipod .. ipod [MASK] NaN tribeca t3 t1 18.88 18.54 (mp3 accessories) Fine-tuned Pre-trained Pre-trained Language Model Language Model ... ... ... ... ... 0.01 0.01 0.21 0.11 0.65 0.82 mp3 mp3 electronic general general electronic mice mice accessories accessories ICDE 2021
4. Proposal: IPM-Multi 9 Model Architecture r title category modelno price t3 18.88 [MASK] belkin sport armband for ipod nano black mp3 electronic general mice accessories 0.01 0.21 0.65 Multiclass Classifier Contextualized Embeddings ? ? ? ? ? ?[???] ?????? ?[???] ?18.88 ??????? Transformer Layer Transformer Layer Embeddings ?[???] ?????? ?[???] ?18.88 ??????? [CLS] belkin sport 18.88 [SEP] ICDE 2021
4. Proposal: IPM-Multi 10 Low Redundancy Large Domain Size Limited Training Data Over-fitting # Domain Size (# Class) # Avg samples Dataset # Tuples 8884/3453=2.57 Flipkart 8884 3453 3453 0.01 Multiclass Classifier Contextualized Embeddings ? ? ? ? ?[???] ? ?[???] ? ? ? Transformer Layer Transformer Layer Embeddings ?[???] ?[???] ? . ? ?... [CLS] [SEP] ICDE 2021
5. Proposal: IPM-Binary 11 Imputation <-> Binary classification Generate candidates -> Identify correct filling Step 1: Candidates Generation in IPM-Binary r title category modelno price NaN t1 tribeca varsity jacket hard shell case for ipod touch dallas cowboys 18.54 (mp3 accessories electronics general) t2 oklahoma sooners iphone 4 case black shell electronics general 29.99 t3 belkin sport armband for ipod nano black mp3 accessories 18.88 t4 case logic 11.6 hard shell netbook sleeve electronics general 19.99 ?? ?1 = {?2,?3,?4} can ?1[?????????] = ??????????? ?? ?? ?? = { electronics general , mp3 accessories } ICDE 2021
5. Proposal: IPM-Binary 12 Step 2. Fine-tuning in IPM-Binary (Model Architecture) Probability of being correct 0.75 Binary Classifier ? ? ? ? ? ?[???] ? ???3 ?[???] ? ? ??????? Transformer Layer Transformer Layer ?[???] ?mp3 ?????? ?[???] ? ??????? [CLS] [SEP] belkin [SEP] mp3 r title category modelno price t3 18.88 mp3 accessories belkin sport armband for ipod nano black Candidate Filling Complete Attribute Values ICDE 2021
5. Proposal: IPM-Binary 13 Step 2. Fine-tuning in IPM-Binary (Build Training Corpus) Unsupervised -> Masked entry recovery Positive Sample: {(belkin ipod 18.88,??3 ???????????,1)} Negative Samples: belkin ipod 18.88,??,0 ,?? ??3 ??????????? Negative sampling from candidates belkin ipod 18.88,??,0 ,?? ??3 ..,?? ???(?3[????]) r r title title category category modelno modelno price price t2 t2 oklahoma sooners iphone 4 case black shell oklahoma sooners iphone 4 case black shell electronics general electronics general 29.99 29.99 [MASK] [MASK] t3 t3 belkin sport armband for ipod nano black belkin sport armband for ipod nano black 18.88 18.88 (mp3 accessories) (mp3 accessories) t5 t5 fellowes hd precision cordless mouse fellowes hd precision cordless mouse mice mice 98904 98904 44.84 44.84 ICDE 2021
6. IPM-Binary 14 3. Imputing in IPM-Binary Select the candidate with the highest probability mp3 accessories electronics general Candidates r title category modelno price [tribeca, , ipod, ..., electronics general] [tribeca, , ipod, ..., mp3 accessories] tribeca ipod NaN t1 18.54 (mp3 accessories) belkin ipod .. t3 mp3 accessories 18.88 Fine-tuned Pre-trained Language Model ... ... ... ... ... 0.9 0.4 mp3 accessories ICDE 2021
7. Experiments 15 Dataset Dataset Restaurant Walmart Amazon Buy Housing Phone Zomato Flipkart #Tuples 864 2300 13797 651 5000 4312 50279 8884 # Avg Class 49 251 939 62 966 335 93 3453 Comparison methods Category Name Neighbor kNNE, MIBOS Clustering CMI Statistical Model ERACER, DLM, HoloClean Generative Model MIDAS, MIWAE, HI-VAE, DataWig Error Correction Baran ICDE 2021
7. Experiments 16 Comparison with Existing Approaches ICDE 2021
7. Experiments 17 Varying Amount of Training Data Empirically, IPM-Multi may have better performance than IPM- Binary, if ???? ???? ?????? ????> 10. ICDE 2021
Q & A 18/18 ICDE 2021