
Locally Consistent Decomposition for Edit Distance Sketching
Explore the concepts of locally consistent decomposition of strings in the context of edit distance sketching. Learn about Hamming distance, edit distance, longest common subsequence, and variants of edit distance algorithms. Dive into computing edit distance using various approaches and algorithms like Wagner-Fischer, Masek-Paterson, and more.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Locally consistent decomposition of strings with applications to edit distance sketching Michal Kouck Charles U./Rutgers U. Based on joint work with: Sudatta Bhattacharya and Mike Saks
Comparing two strings ? a b c z d e f i j k l m n ? a b c d e f z i j k l m n Questions: Are they equal? Are they similar? How similar are they?
Hamming distance ? a b c z d e f i j k l m n ? a b c d e f z i j k l m n Hamming distance ???(?,?): the number of bit flips that take ? into ?.
Hamming distance ? a b a b a b a b a b a b a ? b a b a b a b a b a b a b ? and ? look similar but ???(?,?) = ?
Edit distance ? a b c z d e f w h i k l m ? a b c d e f g h i j k l m Edit distance ??(?,?): that transform ? into ?. the number of 1) bit flips/substitutions, 2) insertions, and 3) deletions
Longest common subsequence ? a b c z d e f w h i k l m ? a b c d e f g h i j k l m Longest common subsequence = maximum non-crossing matching. dual problem to edit distance.
Variants of edit distance Levenshtein distance: vanilla edit distance. Ulam distance: large alphabet, each symbol appears at most once. Edit distance with moves: additional operation block move.
Computing edit distance Wagner-Fischer 74, Masek-Paterson 80, Grabowski 16 ?(?2 / log2?) Ukkonen 85 Myers 86, Landau-Vishkin 88, Landau-Myers-Schmidt 98 ?(??) ?(? + ?2) and many others Backurs-Indyk 15 (? + ?2) (if SETH is true) ? = ??(?,?)
Computing edit distance ? a b c z d e f w h i k l m ? a b c d e f g h i j k l m Algorithm: Try all possible splits of ? and ?, and recurse on sub-problems. ? ?2 algorithm
Fine-grained complexity Backurs-Indyk 15: An algorithm for edit distance in time ?(?2 ?) implies an algorithm for SAT in time 2(1 ?)?. (contradicting Strong Exponential Time Hypothesis (SETH).) formula instance of edit distance (?,?) 2?/2 length ? variables
Fine-grained complexity Backurs-Indyk 15: An algorithm for edit distance in time ?(?2 ?) implies an algorithm for SAT in time 2(1 ?)?. (contradicting Strong Exponential Time Hypothesis (SETH).) Abboud-Hansen-Vassilevska Williams-Williams'16, Abboud-Backurs-Vassilevska Williams'15, Bringmann-K nnemann'15: ?(?2) algorithms for edit distance imply circuit lower bounds.
Approximating edit distance Landau-Myers-Schmidt 98, ..., Andoni-Krauthgamer-Onak 10, , Chakraborty-Das- Goldenberg-K.-Saks 18, ? ?1+?-time ?(1)-approximation Andoni-Nosatzki 20 Abboud-Backurs 17: (1 + 1/???? log)-inapprox. in time ?2 ?
Finding large alignment [Saha 14] a b c z d e f w h i k l m ? ? a b c d e f g h i j k l m With high probability alignment of size between ? 20?2 and ? ?. Time ? ? .
Finding large alignment [K.-Saks 23] a b c z d e f w h i k l m ? ? a b c d e f g h i j k l m Deterministic deletion rule for Saha s algorithm: Delete: 1x ?, 3x ?, 5x ?, 7x ?, 9x ?, 11x ?, Alignment of size between ? ?2 and ? ?.
Edit distance of multiple strings ? ? ? ? ?,?, ,? ? ? ? ??(?,?) Want: Compute edit distance for each pair of strings.
Preprocessing ? digest ED(?,?) ? digest ED(?,?) ? digest Goldenberg-Rubinstein-Saha 20: Preprocessing time per string Exact edit computation per pair of strings ?(?) ? ?2
Edit distance sketches ? sketch? ??(?,?) ? sketch? Goal: Minimize the sketch size.
Edit distance sketches ? sketch? ??(?,?) ? sketch? Question: How big does the sketch has to be? Variants: Threshold ? edit distance vs. approximate edit distance?
Threshold ? edit distance sketches ? sketch? ??(?,?) ? sketch? Sketch for threshold ? edit distance of size ?(?2log ? log log ?).
Threshold ? Hamming distance sketches Sketch ?,? ? ???(?) 0? ??? ? ???(?) ? ???(?) Error correct ? ? ,?(?) ? ? ??? ? ???(?) ? systematic linear error correcting code for correcting up-to ? errors. Sketch for threshold ? Hamming distance of size ?(?).
Threshold ? edit distance sketches ? sketch? ??(?,?) ? sketch? sketch size ?(?8) ?(?3) Belazzougui-Zhang 16: Jin-Nelson-Wu 21: Kociumaka-Porat-Starikovskaya 21: ?(?2) Bhattacharya-K. 23: (?) ?(?2) ?(?2log ? log log ?) K.-Saks 24:
Embedding edit distance into Hamming distance ? ? a b c z d e C D F B J C K ? ? a b c d e f C D I B J C K ??(?,?) ???(?(?),?(?)) Goal: Minimize the distortion of the embedding.
Embedding edit into 1 distance Embedding distortion ?(?2/3) ?(log ? log ?) 2?( log ? log log ? ) ?(?) Bar-Yossef-Jayram-Krauthgamer-Kumar 04 Cormode-Muthukrishnan 02 (with moves) Ostrovsky-Rabani 07 Chakraborty-Goldenberg-K. 16 (random) Lower bounds 3/2 ( log ?1/2 ?(1)) (log ?) Andoni-Deza-Gupta-Indyk-Raskhodnikova 03 Knot-Naor 05 Krauthgamer-Rabani 09
Algorithm for embedding ?? to ??? Chakraborty-Goldenberg-K. 16: Input:? 0,1? and random bits ? 0,1?. Interpret ? as hash functions 1, 2, 3?:{0,1} {0,1}. ? 1 ? For ? 1 to 3? do 1. If ? ? then ? 0 1 0 0 1 0 1 Output ?? ? ? + ?(??) Else Output 0 ?(?) 0 0 1 1 1 0 0 0 1 0 0 1 1 2. ? ?(??)
Why it works ? a b c z d e f g h i k l m a a b c c z z d d e f l m a a b c c d e e f f f l m ? a b c d e f g h i j k l m
Threshold ? edit distance sketches ? sketch? ??(?,?) ? sketch? Sketch for threshold ? edit distance of size ?(?2log ? log log ?).
Workshop DIMACS Workshop on Efficient Algorithms for High Dimensional Metrics: New Tools Dates: May 6-9, 2024 Organizers: Alexandr Andoni, Michal Kouck , Barna Saha, Mike Saks
Locally consistent decomposition of strings with Sudatta Bhattacharya
String decomposition ? a b c d e e f g h i k l m f g h i k l m ? a b c d e e f g h i k l m f g h i j k l m Blocks of ? and ? match each other except for blocks with edits. Blocks are small.
String decomposition ? ??, ? ??, ? ? ? ? ??? ?1 ?2 ?? 1 1. Each block has grammar of size ?(?). ? ??? ? ? ?? 1 ?1 ?2 2. For a pair of strings ?, ?: ? ? ? ? ?? ?? 1 ?1 ?2 ? ?) ?,???? ?? ?? ?,? = ??(???? ?? ?=1
Warmup: Random strings ? 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 1 0 0 1 ? 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 ? chosen at random, and ? chosen so that ?? ?,? ?. random function : 0,1log ? {0, ,8?log?} split whenever ?????? = 0.
Warmup: Random strings ? 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 1 0 0 1 ? 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 ? chosen at random, and ? chosen so that ?? ?,? ?. random function : 0,1log ? {0, ,8?log?} split 011 = 0.
Warmup: Random strings ? 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 1 0 0 1 ? 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 ? chosen at random, and ? chosen so that ?? ?,? ?. random function : 0,1log ? {0, ,8?log?} split 011 = 0.
Warmup: Random strings ? 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 1 0 0 1 ? 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 random function : 0,1log ? {0, ,8?log?} Pr ?????? ???? ????????? 4? log ? 8? log ?=1 2 expected size of each block 8?log?.
Non-random strings ? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 random function : 0,1log ? {0, ,8?log?} split whenever ?????? = 0. low entropy ? no split.
Compression ? 0 0 0 0 0 0 0 0 1 2 0 2 0 1 2 1 0 0 0 0 08 04 ?1,2 ?2,0 ?1,2 0 1 1. Compress repetitions 2. Compress singletons in pairs pairs/triples chosen using Cole-Vishkin. working alphabet
Locally consistent parsing ? a b c d e l m h i j k ? b c d e h i j k l m Partition ? and ? into pairs/triples so that stretches of symbols that look the same are partitioned the same way.
Locally consistent parsing ? a b c d e l m h i j k ? b c d e h i j k l m Partition ? and ? into pairs/triples so that stretches of symbols that look the same are partitioned the same way.
Color reduction procedure [Cole-Vishkin84] ? = ? a b c d e l m h i j k ? 1 2 1 3 2 2 1 2 3 1 3 Input: Proper coloring ?: 1, ,? {1, ,?} Output: Proper coloring ?: 1, ,? 1,2,3 such that ? ? depends only on ? ? ? , ,?(? + ?), where ? = log ?, and 1 ? ? ,? ? + 1 ,? ? + 2 , for all ? ? 2. Start a new pair/triple at each position ? where ? ? = 1.
Compression ? 0 0 0 0 0 0 0 0 1 2 0 2 0 1 2 1 0 0 0 0 08 04 ?1,2 ?2,0 ?1,2 0 1 1. Compress repetitions 2. Compress singletons in pairs pairs/triples chosen using Cole-Vishkin. working alphabet
String decomposition algorithm Repeat: Split Compress until each block of size 2 : 2 {0, ,100?log?} c: 2 Compression builds a grammar for each block Decomposition of ? into grammars, each of size ? ? .
Applications Edit distance sketch of size ? ?2. ? ??? ? ? ?? 1 ?1 ?2 ? ? ? ? ?? ?? 1 ?1 ?2 Apply sketch for Hamming distance of size ?(?) [Clifford-Kociumaka-Porat 19] ? = ? ?2
Threshold ? edit distance sketch of size ?(?2log ? log log ?) with Mike Saks
Tree of grammars grammar size ? 1 1 1 ??1 ? ?1 ?2 ?? 1 2 2 2 2 2 ??2 2 2 2 ?/2 ?2 ?1 ?3 ?4 ?5 ?? 3 ?? 2 ?? 1 ?/4 ?(1)
Tree of grammars grammar size ? 1 1 1 ??1 ? ?1 ?2 ?? 1 2 2 2 2 2 ??2 2 2 2 ?/2 ?2 ?1 ?3 ?4 ?5 ?? 3 ?? 2 ?? 1 ?/4 ?(1)
Encoding the grammars Grammar: {? ??, ? ??, } 3 Encode each grammar by 0-1 vector of length | 3|. 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ?(?). ??(?,?) ??? ???? ? ,???? ? Watermarking a grammar ? multiplying it by ??.
Fingerprints Karp-Rabin fingerprint ?: 0,1 N ?,?? = ? ? ? = ? ? ? ? ? ? ?(?) Ostrovsky-Rabani fingerprint ?: 0,1 N ?,??? ?,? ? ?? ?,? > ?2log ? log log ? ? ? ?(?) ? ? = ? ? 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Tree of grammars grammar size ? 1 1 1 ??1 ? ?1 ?2 ?? 1 2 2 2 2 2 ??2 2 2 2 ?/2 ?2 ?1 ?3 ?4 ?5 ?? 3 ?? 2 ?? 1 ?/4 ?(1)
Hierarchical mismatch recovery scheme capacity Load ?= min(??, ?) 0 ?1 ?2 00 01 Load of leaves - # of differences ? 000 001 010 011 ?? ? Want: Sketch ? and ? to recover differences in leaves that do not pass via 1/?- overloaded nodes on the way to root.
Tree of grammars grammar size ? 1 1 1 ??1 ? ?1 ?2 ?? 1 2 2 2 2 2 ??2 2 2 2 ?/2 ?2 ?1 ?3 ?4 ?5 ?? 3 ?? 2 ?? 1 ?/4 ?(1)