Efficient Multi-word Parameterized Matching on Compressed Text

efficient multi word parameterized matching n.w
1 / 21
Embed
Share

Explore the concept of parameterized multi-word matching on compressed text, focusing on efficient algorithms and compressed pattern and text processing. Learn about the benefits of compressed parameterized matching and its applications in various fields like software maintenance and plagiarism detection.

  • Efficient Matching
  • Multi-word
  • Parameterized
  • Compressed Text
  • Algorithm

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Efficient Multi-word Parameterized Matching on Compressed Text Rajesh Prasad, Rama Garg IEEE 6th International Conference on Adaptive Science & Technology (ICAST), Lagos, Nigeria, Oct 29~ Oct 31, 2014 Speaker: Kuan-Chih Chen Date: 2018.06.26

  2. 2

  3. Abstract (1/2) Searching set of patterns ?1,?2,?3, ,??,? 1, inside body of a text T[1...n] is called multi-pattern matching problem. This matching is said to be parameterized match (p-match), if one can be transformed into the other via some bijective mapping. It is mainly used in software maintenance, plagiarism detection and detecting isomorphism in a graph. In the compressed parameterized matching problem, our task is to find all the parameterized occurrences of a pattern (set of patterns) in the compressed text, without decompressing it. Compressing the text before matching reduces the size and minimizes the matching time also. 3

  4. Abstract (2/2) In this paper, we develop an efficient algorithm for parameterized multi-word matching problem on the compressed text, where both patterns and text are compressed before actual matching is performed and pattern is treated as word. For compressing the pattern and text, we use efficient compression code: Word Based Tagged Code (WBTC) and bit-parallel algorithm is used for searching purpose. Experimental results show that our algorithm is up to three times faster than the search on the uncompressed text. 4

  5. Definition String Matching Problem Finding all the occurrences of a pattern P of length ? in a text T of length ? (? ?) T: ABACBBABA ? = 9 P: ABA ? = 3 Ans: True, at position 0 and 6 5

  6. Definition Multi-pattern Matching Problem Finding set of patterns ?1,?2, ,??? 1 of different length in a text T of length ? T: ABACBBABA P: ABAC, CBB Ans: True, at position 0(ABAC),3(CBB) 6

  7. Definition Multi-word Parameterized Matching Problem Finding set of patterns ?1,?2, ,??? 1 of different length(?1,?2, ,??,?? ? ) in a text T T is the set of words, patterns are also treated as words Parameterized Matching(p-match): DEDF ABAC Bijection mapping A D B E C F 7

  8. Definition Multi-word Parameterized Matching Problem Finding set of patterns ?1,?2, ,??? 1 of different length(?1,?2, ,??,?? ? ) in a text T T: ABA CBB ABA P: ABAC, CBB P : ABA, CBB Ans: ? 1,? 2can be found in T 8

  9. Compression - Word Based Tagged Code(WBTC) The code always ends with either 01 or 10 Add 00 and 11 as prefix repeatedly 01 10 0001 0010 1101 1110 000001 000010 T = AATT CCGG GATA CGAC AATT GATA GATA ACAC Sort by frequency: Encode: Index 0 1 2 3 4 4 4 Index 0 1 2 3 3 Word AATT CCGG GATA CGAC ACAC ACAC ACAC Word GATA AATT CCGG CGAC CGAC Frequency 2 1 3 1 1 1 1101 Frequency 3 2 1 1 0010 Index 0 1 2 Word GATA AATT CCGG Code 01 10 0001 9

  10. Predecessor String For String S = ?1?2, ,??, ??, Pred(S) =?1?2, ,??, ?? if ?? ?? = 0 ??= ? ? ( ? index,? < ?) S = A C C A T G A G S A 0 C 0 C 1 A 3 T 0 G 0 A 3 G 2 Pred(S) If word ?1 p-matches word ?2, then ???? ?1 = ???? ?2 Pred(ACCATGAG) = Pred(CAACTGCG) = 00130032 , ACCATGAG p-matches CAACTGCG 10

  11. Their algorithm Multi-word Parameterized Matching (MPM) algorithm Preprocessing Text Pred Compress Patterns Compress Searching Bit-parallel algorithm 11

  12. Preprocessing of Text ABAA BABAA BCBB ACBB DEDEB T: = {A, B, C, D, E} ABAA {BABA ABAA} BCBB ACBB {DEDE EDEB} Modified: Pred: 0021 0022 0021 0021 0021 0022 0020 Vocabulary Table: Word 0021 0022 Word 0021 0022 0020 0020 Index 9 10 Index 9 10 8 8 Frequency 5 2 1 0001 Codes 01 10 5? 18? 01 10 01 01 01 10 0001 01 (Compressed Text) T : 12

  13. Preprocessing of Text ABAA BABAA BCBB ACBB DEDEB T: = {A, B, C, D, E} ABAA {BABA ABAA} BCBB ACBB {DEDE EDEB} Modified: Pred: 0021 0022 0021 0021 0021 0022 0020 Sort by frequency: Encode: Vocabulary Table: Word 0021 0022 Word 0021 0022 0020 0020 Index 9 10 Index 9 10 8 8 Frequency 5 2 1 0001 Codes 01 10 5? 18? 01 10 01 01 01 10 0001 01 (Compressed Text) T : 14

  14. Preprocessing of Patterns ABAA, BABAA, BBAAB P: ABAA, BABA, BBAA Modified: 0021(9) 0022(10) 0101(17) Pred(index): index table word [?] Word 0021 0022 0020 Index 9 10 8 Codes 01 10 0001 15

  15. Build bit-table: The length of B is ? (max length of codes) Word ABAA Word 0021 0022 0020 Index 9 10 8 Pred 0021 Frequency 5 2 1 Codes (P) 01 B[Code] ?1[0]=1110 ?1[1]=1101 ?2[0]=1101 ?2[1]=1110 ?3[0]=1000 ?3[1]=0111 B[0]=1000 B[1]=0100 BABA 0022 10 ABAD? 0020 0001 Final Bit Table if P[i] = c, ?? bit of ?[c] = 0 (from right to left) ?1= 01 ?10 = 1110 ?3= 0001 ?30 = 1000 each bit uses AND operation to get final bits B[0] = 1110 & 1101 & 1000 = 1000 B[1] = 1101 & 1110 & 0111 = 0100 16

  16. Searching Bit-parallel algorithm: Vector D of length ? Each bit is set to 1 initially For each word of T Updating of D: D = (D << 1) | B[T[i]] Finally, D (m-1) = 0, P T (i-m) 01 10 01 01 01 10 0001 01 T: B[0]=1000 B[1]=0100 D = 1110 | B[0] = 1110 | 1000 = 1110 D = 1100 | B[1] = 1100 | 0100 = 1100 1st: 2nd: P(code:01, ABAA) T 17

  17. Searching Bit-parallel algorithm: Vector D of length ? Each bit is set to 1 initially For each word of T Updating of D: D = (D << 1) | B[T[i]] Finally, D (m-1) = 0, P T (i-m) 01 10 01 01 01 10 0001 01 T: B[0]=1000 B[1]=0100 D = 1110 | B[1] = 1110 | 0100 = 1110 D = 1100 | B[0] = 1100 | 1000 = 1100 1st: 2nd: P(code:10, BABA) T 18

  18. Searching Bit-parallel algorithm: Vector D of length ? Each bit is set to 1 initially For each word of T Updating of D: D = (D << 1) | B[T[i]] Finally, D (m-1) = 0, P T (i-m) 01 10 01 01 01 10 0001 01 T: B[0]=1000 B[1]=0100 D = 1110 | B[0] = 1110 | 1000 = 1110 D = 1000 | B[0] = 1000 | 1000 = 1000 1st: 3rd: D = 1100 | B[0] = 1100 | 1000 = 1100 D = 0000 | B[1] = 0000 | 0100 = 0100 2nd: 4th: 19

  19. Build bit-table: The length of B is ? (max length of codes) Word ABAA Word 0021 0022 0020 Index 9 10 8 Pred 0021 Frequency 5 2 1 Codes (P) 01 B[Code] ?1[0]=1110 ?1[1]=1101 ?2[0]=1101 ?2[1]=1110 B[0]=1100 B[1]=1100 BABA 0022 10 Final Bit Table each bit uses AND operation to get final bits B[0] = 1110 & 1101 = 1100 B[1] = 1101 & 1110 = 1100 T: 01 10 01 01 01 10 0001 01 D = 1110 | B[0] 1st: = 1110 | 1100 = 1110 D = 1000 | B[0] = 1000 | 1100 = 1100 3rd: D = 1100 | B[0] = 1100 | 1100 = 1100 D = 0000 | B[1] = 0000 | 1100 = 1100 2nd: 4th: 20

  20. Experimental results: Alphabet size 4 2 256 Text composition A, C, G, T 0, 1 size of file 16KB 12KB 4KB 21

  21. Experimental results: 22

Related


More Related Content