String Matching Algorithms and Cross-Correlation Techniques
Efficient algorithms like Knuth-Morris-Pratt and Boyer-Moore for string matching. Delve into more complex string matching problems like mismatches and wildcards. Understand cross-correlation methods without reversal and with unequal lengths. Discover how to compute correlations of vectors and analyze cross-correlation patterns.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Algorithms in Action Fast Fourier Transform Haim Kaplan, Uri Zwick Tel Aviv University March 2016 Last updated: March 28, 2017 1
String Matching abraabracadabracadabraabara abracadabra abracadabra Given a text of length ? and a pattern of length ?, find all occurrences of the pattern in the text. The na ve algorithm runs in ? ?? time. Several classical algorithms run in ? ? + ? time. [Knuth-Morris-Pratt (1977)] [Boyer-Moore (1977)] 2
More String Matching Problems abraabracadabracadabraabara abracadabra abracadabra Count the number of matches/mismatches in each alignment of the pattern with the text. (Find all aligments with at most ? mismatches.) Allow a wildcard ( don t care ) ( ) that match any (single) symbol in the pattern and/or text. Traditional string matching techniques are not so efficient for these extensions. 3
(Cross-)Correlation ?0 ?1 ?2 ?3 ?0 ?1 ?2 ?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 ?0= ?0?0+ ?1?1+ ?2?2+ ?3?3 ?1= ?1?0+ ?2?1+ ?3?2 ?2= ?2?0+ ?3?1 ?3= ?3?0 4
(Cross-)Correlation A convolution without the initial reversal, with a shift of indices. ??+???= ? ?? ??= ???? ?= ?+? 1 ? ? ? = (? 1), ,? 1. The correlation of two vectors of length ? can be computed in ? ?log? time.
(Cross-)Correlation (unequal lengths) ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 6
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 7
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 8
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 ?0= ?0?0+ ?1?1+ ?2?2+ ?3?3 9
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 ?0= ?0?0+ ?1?1+ ?2?2+ ?3?3 ?1= ?1?0+ ?2?1+ ?3?2+ ?4?3 10
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 ?0= ?0?0+ ?1?1+ ?2?2+ ?3?3 ?1= ?1?0+ ?2?1+ ?3?2+ ?4?3 ?2= ?2?0+ ?3?1+ ?4?2+ ?5?3 ?3= ?3?0+ ?4?1+ ?5?2 11
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 ?0= ?0?0+ ?1?1+ ?2?2+ ?3?3 ?1= ?1?0+ ?2?1+ ?3?2+ ?4?3 ?2= ?2?0+ ?3?1+ ?4?2+ ?5?3 ?3= ?3?0+ ?4?1+ ?5?2 ?4= ?4?0+ ?5?1 12
(Cross-)Correlation ?0?1?2?3?4?5 ?0?1?2?3 ? 3= ?0?3 ? 2= ?0?2+ ?1?3 ? 1= ?0?1+ ?1?2+ ?2?3 ?0= ?0?0+ ?1?1+ ?2?2+ ?3?3 ?1= ?1?0+ ?2?1+ ?3?2+ ?4?3 ?2= ?2?0+ ?3?1+ ?4?2+ ?5?3 ?3= ?3?0+ ?4?1+ ?5?2 ?4= ?4?0+ ?5?1 ?5= ?5?0 13
(Cross-)Correlation ??+???= ? ?? ??= ???? ?= ?+? 1 ? ? If ? is of length ? and ? of length ?, where ? ?, then ? = (? 1), ,? 1. Sometimes, only the values ? = 0, ,? ?, corresponding to a full overlap of ? with a shift of ?, are of interest. Exercise: The correlation of two vectors of length ? and ?, where ? ?, can be computed in ? ?log? time.
Counting mismatches [Fischer-Paterson (1974)] Let be the alphabet of the pattern and text. We may assume that ? + 1. (Why?) For every ? create two Boolean strings: ??? = 1 iff ? ? = ? ??? = 1 iff ? ? ? Correlation of ??and ??counts mismatches involving ?.
Counting mismatches abraabracadabracadabraabara abracadabra 011001101010110101011001010 10010101001 16
Counting mismatches abraabracadabracadabraabara abracadabra 011001101010110101011001010 10010101001 abraabracadabracadabraabara abracadabra 011001101010110101011001010 10010101001 17
Counting mismatches Let be the alphabet of the pattern and text. We may assume that ? + 1. (Why?) For every ? create two Boolean strings: ??? = 1 iff ? ? = ? ??? = 1 iff ? ? ? Correlation of ??and ??counts mismatches involving ?. Summing over all ? we get the total no. of mismatches. Complexity: ?( ?log?) word operations. (Each word assumed to hold log? bits.) Fast only if is small.
Counting mismatches with wildcards [Fischer-Paterson (1974)] For every ? create two Boolean strings: ??? = 1 iff ? ? = ? ??? = 1 iff ? ? ? and ? ? Complexity: ?( ?log?) word operations. 19
Counting mismatches with wildcards abraabraca*abracadabraabara abracada*ra 011001101000110101011001010 10010101001 abraabra*adabracadabraabara abracada*ra 011001100010110101011001010 10010101001 20
Counting mismatches with wildcards If we only want to find exact matches, replace each character ? by a specific log2| | bit string 21
Counting mismatches with wildcards a b r a b r a c a 001 010 011 001 010 011 001 100 001 Count mismatches of the binary strings as before (2 convolutions) A result of 0 corresponds to a match Complexity drops to ?(log ?log?). Can we get rid of the dependence on | | ? 22
?2-matching [Lipsky-Porat (2011)] Standard string matching uses the Hamming distance. Two characters either match or they do not. ? is not closer to ? than to ?. Suppose that each character is a real number. We want to find approximate matches. For each ? = 0,1, ,? ? we want to compute ? 1 2 ??= ?? ??+? ?=0 2 ? 1?? ?? ?2-distance: ? ?2= ?=0 23
?2-matching [Lipsky-Porat (2011)] ? 1 ? 1 ? 1 ? 1 2= 2 2 2 ?? ??+? ?? ????+?+ ??+? ?=0 ?=0 ?=0 ?=0 Constant. ?(?) time. Correlation. ? ?log? time. Easy in ? ? time. ?2-matching can be computed in ?(?log?) time. 24
Exact matches with wildcards [Clifford-Clifford (2007)] Replace each character by a positive integer. Replace the wildcard by 0. For each ? = 0,1, ,? ? compute ? 1 2 ??= ????+??? ??+? ?=0 There is an exact match at position ? iff ??= 0. 25
Exact matches with wildcards [Clifford-Clifford (2007)] ? 1 2 ??= ????+??? ??+? ?=0 ? 1 ? 1 ? 1 3??+? 2 3 2??+? 2 = ?? ?? + ????+? ?=0 ?=0 ?=0 Compute three correlations of appropriate sequences in ? ?log? time. Running time is independent of | | ! Assuming that each character fits in an log? -bit word and that operations on such words takes constant time.