Prefix-Based Filtering Algorithm for String Similarity Search

a pivotal prefix based filtering algorithm n.w

1 / 55

Embed Share

Explore a pivotal prefix-based filtering algorithm designed for efficient string similarity search. Dive into the importance of search, speed, and handling dirty data, along with similarity search queries, edit distance calculations, and challenging time complexities. Lastly, discover the applications, challenges, and a filter-and-verification framework for enhancing search accuracy.

gwynev Follow

Uploaded on Mar 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng

Search is Important Google Searches per Year Source: http://www.internetlivestats.com/google-search-statistics/

Speed Matters Source:

Data is Dirty DBLP Complete Search Typos Argyrios Zymnis Argyris Zymnis relaxed Typo in title related

Similarity Search Query All the strings similar to the query String Dataset

Edit Distance ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. For example: ED(sigcom, sigmod) = 2 sigcom substitute c with m sigmom substitute m with d sigmod

Problem Definition Query string s = yotubecom and = 2 ed(s, r4) <= 2 output r4as a result string dataset R

Application Spell Checking Copy Detection Entity Linking Bioinformatic .

Challenge Na ve Method Time complexity: for each query ? |?||?|

Filter-and-Verification Framework Query string s No Filter: Yes Verify: ED(r,s) ? Index Results Signature(s) Signature(r) = ? Dataset R Threshold

Preliminary: q-gram q-gram of the substring with length q youtbecom yoouuttbbeeccoom 2-gram

Preliminary: q-gram 1 edit operation destroies at most q grams. yout ecom yoouutteeccoom d dd edit operations destroy at most q grams. if r and s have more than q mismatch grams, ED(r, s)> .

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf q(r) : The sorted q-gram set of string r Pre(r) suffix(r) Pre( ) is the prefix of q( ) |Pre( )|= q +1 Pre(s) q(s): The sorted q-gram set of string s Prefix Filter: If pre(r) pre(s) = , ED(r,s) >

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf q(r) : The sorted q-gram set of string r Pre(r) suffix(r) g1 g2 g5 g6 g11g12 g13 >g10>g10>g10>g10>g10>g10 Pre( ) is the prefix of q( ) |Pre( )|= q +1 g3 g4 g7 g8 g9 g10 g12 Pre(s) q(s): The sorted q-gram set of string s Prefix Filter: If pre(r) pre(s) = , ED(r,s) >

Preliminary: disjoint q-gram One edit operation destroies at most 1 disjoint gram. yout ecom yout d e d om edit operations destroy at most disjoint grams. if r and s have more than mismatch disjoint grams, ED(r, s)>

Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf q(r) : The sorted q-gram set of string r Pre(r) suffix(r) Piv(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= +1 and the q-grams in Piv( ) are disjoint Piv(s) Pre(s) q(s): The sorted q-gram set of string s If piv(s) pre(r) = and piv(r) pre(s) = , ED(r,s) >

Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf q(r) : The sorted q-gram set of string r Pre(r) suffix(r) last(r) g5 g8 g10 Piv(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= +1 and the q-grams in Piv( ) are disjoint >g10>g10>g10>g10>g10>g10>g10 Piv(s) g1 g3 g6 g9 g11g13 last(s) Pre(s) q(s): The sorted q-gram set of string s Pivotal Prefix Filter: If last(s)> last(r) and piv(r) pre(s) = , ED(r,s) >

Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf q(r) : The sorted q-gram set of string r Pre(r) suffix(r) last(r) g1 g4 g6 g9 g12g13 >g10>g10>g10>g10>g10>g10>g10 Piv(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= +1 and the q-grams in Piv( ) are disjoint Piv(s) g3 g7 g10g11 last(s) Pre(s) q(s): The sorted q-gram set of string s Pivotal Prefix Filter: If last(r)> last(s) and piv(s) pre(r) = , ED(r,s) >

Pivotal Prefix Filter If last(r)> last(s) and piv(s) pre(r) = , ED(r,s) > If last(s)> last(r) and piv(r) pre(s) = , ED(r,s) > Existence: There must exist +1 disjoint grams in the prefix The Pivotal Prefix is a subset of the Prefix The pivotal prefix filter dominates the prefix filter Signature size are O( ) and O(q ) respectively

Related Work Method |Sig(r)| |Sig(s)| Prefix Filter O(q ) O(q ) Mismatch Filter O(q ) O(q ) O(l) Qchunk Filter O( ) Pivotal Prefix Filter O( ) O(q ) Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(q ) Qchunk Filter[Qin SIGMOD11] : Shorten one to O( ) but increased the other one to O(l) Adaptive Prefix[Wang SIGMOD12] Increase prefix length to reduce candidate number Orthogonal and can be integrated into our method Flamingo[Li ICDE08] Based on count filter. Accelerating counting process. Orthogonal and can be integrated into our method

Pivotal Search Algorithm Indexing Build inverted indexes for both the prefix and the pivotal prefix of the data strings Querying Generate prefix and pivotal prefix for the query string Probe the prefix index with the pivotal prefix of the query Probe the pivotal prefix index with the prefix of the query Verify the candidates and output results

Pivotal Prefix Selection Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have. min ???(?) ????? ?? ???????? ???? ?? ? For query string: ? ???(?) min ???(?) ????????? ?? ? For data string: ? ???(?)

Optimal Pivotal Prefix Selection Dynamic Programming: Object: Select m= +1 optimal pivotal q-grams from the first n=q +1 grams in the prefix Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix Select as last pivotal q-gram

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first n-2 q-grams Select as last pivotal q-gram

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first m-1 q-grams Select as last pivotal q-gram Recursive formula: ? ?,? = min 1 ? ?( ???? ? ?? + ? ?,? 1) ???? ? ?? ????? ?? ???????? ???? ??? ????? ??? ????????? ??? ???? ??????

Filter-and-Verification Framework Query string s No Filter: Yes Verify: Index Results Signature(s) Signature(r) = ? alignment filter? If yes, ED(r,s) ? Dataset R Threshold Complexity Improvement: Improved from ?(min ? , ? ) to ?(? 2)

Alignment Filter Intuition of Alignment Filter: suppose in the best case we need erriedit operations to transform ??to a substring of r, then ED r,s > ?=1 +1???? +1????> ,ED r,s > If ?=1

Alignment Filter Substring edit distance (sed) ??? ??,? is the minimum edit distance between ??and any substring of r. Alignment filter: If ?=1 +1???(??,?) > , ?? ?,? >

Alignment Filter Accelerating Calculation: The computation complexity of sed(??, r) is O(q|r|). By position filter, ??can only align to a substring xiof r where |xi|<2 + ?. Thus if ?=1 The complexity reduced to O q . +1???(?? ,??) > , ED(?, ?)> . Complexity Improvement: Improved from ?(min ? , ? ) to ?(? 2)

Experiments Settings: C++, g++ 4.8.2 with -O3 flags 64bit Ubuntu Server 12.04 LTS version Intel Xeon E5-2650 2.00GHz processor and 16GB memory.

Evaluating Pivotal Prefix Filter Average Search Time Mismatch: From EDJoin CrossFiler: Cross Filter PivotalFilter: PivotalFilter CrossSelect: CrossFilter + Pivotal Prefix Selection PivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Pivotal Prefix Filter Candidate Number Mismatch: From EDJoin CrossFiler: Cross Filter PivotalFilter: PivotalFilter CrossSelect: CrossFilter + Pivotal Prefix Selection PivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Alignment Filter Average Search Time NoFilter: without any filter ContentFilter: From EDJoin AlignFilter: Alignment Filter

Evaluating Alignment Filter Candidate Number NoFilter: without any filter ContentFilter: From EDJoin AlignFilter: Alignment Filter Real: Number of results

Comparison with State-of-the-arts PivotalSearch: Our method Adaptive: [Wang2012] Flamingo: [Li2008] Qchunk: [Qin 2011]

Scalability

Conclusion Pivotal prefix filter Pivotal search algorithm Optimal pivotal prefix selection Alignment filter

Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html THANK YOU Q & A

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Motivation and Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Complexity Space Complexity: ?(? |?|) Time Complexity:

Pivotal Prefix Selection Existence of Pivotal Prefix: There must exist at least +1 disjoint q-grams in the prefix pre(r) for any string r Evaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is. |?+[?]| min ???(?) For query string: ? ???(?) min ???(?) |?????????[?]| For data string: ? ???(?)

Complexity Space Complexity: Prefix Inverted Index Size: ? (? + 1) ? Pivotal Prefix Inverted Index Size: ? ( + 1) ? Query Time Complexity: Preprocess Query s: ? ? + ? ??? ? + ? Probing Inverted Indexes: ? ? ?? where ??is the average length of probed prefix inverted lists Verification Complexity: ? ? ? where c is the number of candidates and l is average string length

Complexity Space Complexity: Prefix Inverted Index Size: ? (? + 1) ? Pivotal Prefix Inverted Index Size: ? ( + 1) ? Query Time Complexity: Preprocess Query s: ? ? + ? ??? ? + ? Probing Inverted Indexes: ? ? ?? where ??is the average length of probed prefix inverted lists Verification Complexity: ? ? ? where c is the number of candidates and l is average string length

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf q(r) : The sorted q-gram set of string r Pre(r) g1 g2 g5 g6 g9 g10 g11 Pre( ) is the prefix of q( ) |Pre( )|= q +1 >g10>g10>g10>g10>g10>g10>g10 g3 g4 g7 g8 g11g12 g13 Pre(s) q(s): The sorted q-gram set of string s Prefix Filter: If pre(r) pre(s) = , ED(r,s) >

Alignment Filter non-consecutive errors: youtubecom yoytupecxm q=3, the 3 non-consecutive errors destroy 8 q-grams consecutive errors: youtubecom youtzpxcom q=3, the 3 consecutive errors only destroy 5 q-grams

Indexing Fix a global gram order =2 q=2 We use gram frequency ascending order im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4 Global gram order

Prefix-Based Filtering Algorithm for String Similarity Search

Download Presentation

Presentation Transcript

Related

More Related Content