
Efficient Set Containment Join in Computer Science and Engineering
Explore TT-Join, a method to efficiently perform set containment join in the field of Computer Science and Engineering. The research discusses the problem statement, existing solutions, and experimental studies, offering insights into intersection-oriented and union-oriented approaches for set containment join.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
TT-Join: Efficient Set Containment Join Computer Science and Engineering Jianye Yang1 , Wenjie Zhang1 , Shiyu Yang1 , Ying Zhang2 , Xuemin Lin1 1 The University of New South Wales, Australia 2 University of Technology, Sydney, Australia
Outline Set Containment Join Existing Solutions Our Approach Experimental Studies Conclusion 2
Set Containment Join Job Seekers Job Advertisements People ID Acquired Skills Job ID Required Skills ?1 ?1,?2,?3,?5 ?1 ?1,?2,?3 ?2 ?1,?2,?4 ?2 ?1,?2,?4 ?3 ?1,?3,?6 ?3 ?1,?3,?4 ?4 ?2,?4,?5 ?4 ?2,?5 Who is qualified for each position? Human Resource 3
Set Containment Join Problem Statement (Set Containment Join) Given two collections and ? of records, the set containment join between and ?, denoted by ?, is to find all pairs ?,? , such that ? , ? ?, and ? ?. That is ? ? ?= = ?,? |? ?, ? ? ?, and ? ? . 4
Existing Solutions Existing solutions Based on the computing paradigms, we classify them into two categories, namely intersection-oriented methods and union-oriented methods. 5
Existing Solutions: intersection-oriented Key Idea Build inverted index on ? ?, and then apply the intersection operator for each ? ?to calculate ?. (e.g., [SIGMOD 2013], [DASFAA 2005], [ICDE 2015], [KAIS 2015], [SSDBM 2016]) 6
Existing Solutions: intersection-oriented People ID Acquired Skills Job ID Required Skills ?1 ?1,?2,?3,?5 ?1 ?1,?2,?3 ?2 ?1,?2,?4 ?2 ?1,?2,?4 ?3 ?1,?3,?6 ?3 ?1,?3,?4 ?4 ?2,?4,?5 ?4 ?2,?5 ?1: ??(??) ??(??) ???? = ?? Result pair of ?1: ??,?? Result pair of ?2: ??,?? Result pair of ?3: Result pair of ?4: ??,??, ??,?? 7
Existing Solutions: union-oriented Key Idea Buildinverted index for record signatures on ?, generate candidate records for each? ? ?, and then verify candidate pairs to calculate ?. (e.g., [VLDB 1997], [VLDB 2000], [EDBT 2002], [TODS 2003], [ICDE 2015]) 8
Existing Solutions: union-oriented People ID Acquired Skills Job ID Required Skills ?1 ?1,?2,?3,?5 ?1 ?1,?2,?3 ?2 ?1,?2,?4 ?2 ?1,?2,?4 ?3 ?1,?3,?6 ?3 ?1,?3,?4 ?4 ?2,?4,?5 ?4 ?2,?5 Example: Bitmap signature with ? = 4 To set a bit: ? ??? ? for?? Signature People ID Signature Job ID ?1 0111 ?1 0111 ?2 1110 ?2 1110 ?3 0111 ?3 1101 ?4 1110 ?4 0110 9
Existing Solutions: union-oriented Job ID Skills Signature People ID Skills Signature ?1 ?1,?2,?3 0111 ?1 ?1,?2,?3,?5 0111 ?2 ?1,?2,?4 1110 ?2 ?1,?2,?4 1110 ?3 ?1,?3,?4 1101 ?3 ?1,?3,?6 0111 ?4 ?2,?5 0110 ?4 ?2,?4,?5 1110 Generate candidates Build inverted index Step 1: Enumerate all subsets of the signature Step 2: Take the union of corresponding inverted lists ?1: ??(????) ??(????) ?????? = ??,?? Verify candidates Result pair of ?1: ??,??, ??,?? 10
Existing Solutions: union-oriented Job ID Skills Signature People ID Skills Signature ?1 ?1,?2,?3 0111 ?1 ?1,?2,?3,?5 0111 ?2 ?1,?2,?4 1110 ?2 ?1,?2,?4 1110 ?3 ?1,?3,?4 1101 ?3 ?1,?3,?6 0111 ?4 ?2,?5 0110 ?4 ?2,?4,?5 1110 Generate candidates Build inverted index ?3: ??(????) ??(????) ?????? = ??,?? Verify candidates Result pair of ?3: Issues: Large amount of subsets & False positives 11
Existing Solutions: properties Advantage Limit Intersection- Oriented Large size of inverted lists 1. Need to enumerate all subsets 2. Need to verify all candidate pairs Verification free Union- Oriented Small size of inverted list 12
Our Approach: Overview Motivation Design a new union-oriented such that (i) more effective signatures are employed; (ii) we need not verify all candidate pairs. IS-Join TT-Join Exploit data distribution and use the least frequent element as the signature of the record. Same index size as IS-Join; Same pruning power as ?IS-Join; Naturally validate a large number of candidates. ?IS-Join To enhance the pruning power, use the ?least frequent elements as the signature of the record. 13
Our Approach: IS-Join Observation Many real-life data are skewed, i.e., some elements appear more frequently than others. ???= (??)2 ? ?2 ? ???= (??)2 ? ?2 ? ?? 1+ ???? ? 14
Our Approach: IS-Join Job ID Skills Signature People ID Skills Signature ?1 ?1,?2,?3 ?3 ?1 ?1,?2,?3,?5 ?1,?2,?3,?5 ?2 ?1,?2,?4 ?4 ?2 ?1,?2,?4 ?1,?2,?4 ?3 ?1,?3,?4 ?4 ?3 ?1,?3,?6 ?1,?3,?6 ?4 ?2,?5 ?5 ?4 ?2,?4,?5 ?2,?4,?5 Build inverted index Generate candidates ?1: ??(?1) ??(?2) ??(?3) ???5 = ??,?? Verify candidates Result pair of ?1: ??,??, ??,?? 15
Our Approach: ?IS-Join (? = 2) Job ID Skills Signature People ID Skills Signature ?1 ?1,?2,?3 ?2,?3 ?1 ?1,?2,?3,?5 ?1,?2,?3,?5 ?2 ?1,?2,?4 ?2,?4 ?2 ?1,?2,?4 ?1,?2,?4 ?3 ?1,?3,?4 ?3,?4 ?3 ?1,?3,?6 ?1,?3,?6 ?4 ?2,?5 ?2,?5 ?4 ?2,?4,?5 ?2,?4,?5 Build inverted index Generate candidates ?2: ??(?1) ??(?2) ???4 = ??:?,??:?,??:?,??:? 16
Our Approach: ?IS-Join (? = 2) Job ID Skills Signature People ID Skills Signature ?1 ?1,?2,?3 ?2,?3 ?1 ?1,?2,?3,?5 ?1,?2,?3,?5 ?2 ?1,?2,?4 ?2,?4 ?2 ?1,?2,?4 ?1,?2,?4 ?3 ?1,?3,?4 ?3,?4 ?3 ?1,?3,?6 ?1,?3,?6 ?4 ?2,?5 ?2,?5 ?4 ?2,?4,?5 ?2,?4,?5 Build inverted index Generate candidates ?2: ??(?1) ??(?2) ???4 = ??:?,??:?,??:?,??:? Verify candidates ??,?? Result pair of ?2: ??,?? Issue: Pruning cost increases when k increases 17
Our Approach: TT-Join (? = 2) ?-length least frequent prefix Given a record ? = ?1, ,??, we define ??, ,?? ?+1 as its ?-length least frequent prefix, denoted by ????? . Job ID Skills ????? ?1 ?1,?2,?3 ?3,?2 ?2 ?1,?2,?4 ?4,?2 ?3 ?1,?3,?4 ?4,?3 ?4 ?2,?5 ?5,?2 ?-length least frequent prefix tree on 18
Our Approach: TT-Join (? = 2) Main Idea Traverse ?? following a depth-first strategy. On each node, we switch to traverse ? by matching corresponding elements. Verification is executed when reaching a leaf node of ? . prefix tree on ? ????-tree on 19
Our Approach: TT-Join (? = 2) prefix tree on ? ????-tree on 20
Our Approach: TT-Join (? = 2) ?1 prefix tree on ? ????-tree on 21
Our Approach: TT-Join (? = 2) No matching ?1 prefix tree on ? ????-tree on 22
Our Approach: TT-Join (? = 2) No matching ?2 prefix tree on ? ????-tree on 23
Our Approach: TT-Join (? = 2) ?1 Find matching ?3 prefix tree on ? ????-tree on 24
Our Approach: TT-Join (? = 2) Switch to ? ?1 ?3 ?2 prefix tree on ? ????-tree on 25
Our Approach: TT-Join (? = 2) Check if exists in prefix ?1 ?3 ?2 prefix tree on ? ????-tree on 26
Our Approach: TT-Join (? = 2) ?2 Dose exist in prefix ?1 ?3 ?2 prefix tree on ? ????-tree on 27
Our Approach: TT-Join (? = 2) Verify?1 and ?3.??? = ?1,?2,?3 ?3 prefix tree on ? ????-tree on 28
Our Approach: TT-Join (? = 2) Generate result for?3: ?1 ?3 Result: ?1 prefix tree on ? ????-tree on 29
Our Approach: TT-Join (? = 2) Pass result to child nodes Result: ?1 prefix tree on ? ????-tree on 30
Our Approach: TT-Join (? = 2) Generate result pairs ??,?? Result: ?1 prefix tree on ? ????-tree on 31
Our Approach: Cost Comparison ???= (??)2 ? ?2 ? ?? ?+ ???? IS-Join: ? ????= (??)2 ? ?2 ? ?? ?+ ???? ?IS-Join: ? ???= (??)2 ? ?2 ? ?? ?+ ?? ???+ ???? TT-Join: ? 32
Experimental: Algorithms Algorithm TT-Join LIMIT PIEJoin PRETTI+ PTSJ DivideSkip Adapt FreqSet Description Our approach (k=4 under all settings) Intersection-oriented [KAIS 2015] Intersection-oriented [SSDBM 2016] Intersection-oriented [ICDE 2015] Union-oriented [ICDE 2015] Adapted algorithm [ICDE 2008] Adapted algorithm [SIGMOD 2012] Adapted algorithm [SIGMOD 2010] 34
Experimental: Performance Tuning Effect of ? on running time 35
Conclusion We classify the existing solutions into two categories and show the advantages and limits of the methods in each category. We propose a new union-oriented method, namely TT- Join. Our comprehensive experiments on 20 real-life datasets demonstrate that our TT-Join significantly outperforms the state-of-the-art algorithms on most of the datasets, and can achieve up to two orders of magnitude speedup. 37
Thank you! Questions? 38