Efficient Suffix Tree Construction Algorithm

a space economical suffix tree construction n.w
1 / 37
Embed
Share

Explore the space-efficient suffix tree construction algorithm presented by Edward M. McCreight in 1976. Learn about the properties of suffix trees, definitions related to suffixes and prefixes, construction stages, and analysis of the brute-force search method. Discover how McCreight's algorithm enhances search performance by incorporating suffix links.

  • Suffix Tree
  • Algorithm
  • Space-Efficient
  • Edward M. McCreight
  • String Search

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. A Space-Economical Suffix Tree Construction Algorithm Edward M. McCreight Journal of the Association for Computing Machinery, Vol. 23, No. 2, April 1976, pp. 262-272. (1976) Presenter Jyun-Hao Lai Date 104 / 11 / 24

  2. Abstract A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching. This algorithm has the same asymptotic running time bound as previously published algorithms, but is more economical in space. Some implementation considerations are discussed, and new work on the modification of these search trees in response to incremental changes in the strings they index is presented.

  3. Suffix Tree S = bcabc$ Tree $ bcabc$S1 cabc$S2 abc$S3 bc$S4 c$S5 abc$ bc c abc$ $ $ abc$

  4. Suffix Tree Properties of a Suffix Tree Tree $ Each tree edge is labeled by a substring of S. abc$ Each internal node has at least 2 children. bc c There are n leaves. abc$ $ $ Sibling edges begin with different characters abc$

  5. Definitions Sufi: The suffix of S beginning at cgaracter position i. Headi: longest prefix of S(i, n) which is a prefix of S(j, n) for some j < i BCABC Taili: sufi- headi abc$ $ bc c Suf4 = BC Head4=BC Tail4= $ $ abc$ $ abc$

  6. Construction: Brute Force Stage i: insert S(i,n) Find extended locus of headiin Tree Si-1 If string is not headi, break edge and insert internal node Add a new node for taili, from locus of headi S= S=bcabc bcabc$ $ S4 insert bc$ abc$ cabc$ cabc$ abc$ bc Head = bc Tail = $ bcabc$ $ abc$

  7. Analysis Brute-force search: start from root and trace edges Worst-case: AAAAAAA$. |headi|=n-i for 1< i < n, n-i comparisons for each search (n2) McCreight suffix links improve performance

  8. Algorithm M Edward Edward Meyers scientist. Meyers McCreight McCreight is an American computer He co-invented the B-tree with Rudolf Bayer while at Boeing,and improved Weiner's algorithm to compute the suffix tree of a string.

  9. Suffix Links In the graph below, which is the suffix tree for ABABABC, the yellow circle is the root, the grey, blue and green ones are internal nodes, and the small black ones are leaves.

  10. Suffix Links About suffix links: If the string s leading up to some internal node X is longer than 1 character, the same string minus the first character (call this s-1) must be in the tree, too (it's a suffix tree, after all, so the suffix of any of its strings must be in the tree, too). Example Let S = ABAB, the string leading to the blue. Then after removing the first character, S-1 is BAB. And indeed that string is found in the tree, too. It leads to the green node

  11. Suffix Links The dotted lines indicate the suffix links ABAB -> BAB -> AB -> B (blue) (green) (gray1) (gray2)

  12. Another terminology Locus: Node k is called the locus k denotes uv. locus of the string uv, if the path from the root to Path(k) we denote the concatenation of the edge labels on the path from the root of T to the node k.

  13. Locus Path(k) = uv k=uv root u The locus of uv u v k uv

  14. Extended Locus The Extended Locus Extended Locus of a string u is the locus of the shortest extension of u, uw .s.t. uw is a node in T. Example: u=hello, w=p root The extended locus of u hell op uw

  15. Contracted Locus The Contracted Locus Contracted Locus of a string u is the locus of the longest prefix of u, x (x is possibly empty), s.t. x is a node in T. Example: u=hello, x=hell root The contracted locus of u hell op uw

  16. Lemma If headi-1 = xu for some character x and some string u then u is a prefix of headi.

  17. Algorithm M To build the suffix tree for ababc mcc inserts every step i the sufiinto tree Ti-1: Step1 ababc$ Step2 babc$ Step3 abc$ Step4 bc$ Step5 c$

  18. Algorithm M To do this we have to insert every step sufiwithout duplicating its prefix in the tree, so we need to find its longest prefix in the tree. Example: S = ababc Suf3=abc. Since we already have the word ab in the tree thus we need to start from there bulding our new suffix. Note that indeed ab=head3, taili=c. babc ababc c So what we do is finding the extended locus of headiin Ti-1and its incoming edge is split by a new node which spawns a new edge labeled taili.

  19. Algorithm M Notice that headiis the longest prefix of sufithat its extended locus exists within Ti-1. root babc ababc T2 The extended locus of head3 We have entered 2 suffixes by now.

  20. Algorithm M If Algorithm M could find the extended locus of headiin Ti-1in constant time, in average over all steps, then Algorithm M is linear in n. headi Ti-1 extended locus

  21. Algorithm M Properties P1: in Tievery internal node, except perhaps the locus of headi, has a valid suffix link. P2: in step i mcc visits the contracted locus of headiin Ti-1 . P2 yields that we can use the contracted locus of headi-1 to jump with the suffix link to some prefix of headi. P1 assure us that there is such suffix link. ( headi-1 suffix link some prefix of headi)

  22. Algorithm M Algorithm M SubStep A, B, C

  23. SubStep A headi-1 Identify 3 strings: xuw s.t. 1. headi-1= xuw 2. xu is the contracted locus of headi-1in Ti-2. 3. If the contracted locus of headi-1in Ti-2is the root then u= . 4. |x| 1. x= only if headi-1= .

  24. SubStep A Illustrating substep A in the ithstep: the move from Ti-1to Ti Contracted locus of xuw root root u u xu xu root c u xu w wy wv taili-1 y w Ti-2 Headi-1= xuw Ti-1 taili-1 Headi= uwv y Step i-1 Ti Step i

  25. SubStep A Our goal here is to go directly to the locus of u in the tree so that we could seach for w (substep B) and then for v (substep C). SubStep A u

  26. SubStep A Notice that: In the previous step headi-1was found. Since |x| 1 then by lemma : headi= uwv for some, yet to be discovered, string v. By P2, Algorithm M visited xu in the previous step (i-1), hence it can identify xu.

  27. SubStep A If u= then c root (suffix link root) else, c {suffix link of xu} (note that c=u) explanation: u thus by definition xu existed in Ti-2hence by P1: the internal node xu has a suffix link. By P2 we remember xu from step i-1 and we can now follow its suffix link.

  28. SubStep A uwv = headihence from the definition of head, the extended locus of uw exists in Ti-1. Now we can start going down the edges, from u to find the extended locus of uw. u headi

  29. 1 SubStepA suffix link suffix root suffix link

  30. SubStep B: Rescanning To rescan w: Find the edge that starts with the first character of w. Denote the edge s label z and the node it leads to f. If |w| > |z| then start a recursive rescan of w-z (or w|z|) from f. If |w| |z| , then w is a prefix of z , and we found the extended locus of uw. Construct a new node (if needed): uw . ( ) d uw

  31. SubStep B: Rescanning Illustrating substep B in the ithstep: rescanning substring w. root In this case uw already exists u root xu u c xu z c w f w w wv z f d y y v Ti-1 Ti Step i Headi-1= xuw Headi= uwv

  32. SubStep B:ReScanning root Make the suffix link of xuw point to d . Hence, we have defined a suffix link to the node constructed in step i-1. u xu c w w f d xuw v Ti Headi= uwv

  33. 2 SubStepB u w w Headi-1= xuw suffix link Headi-1 P1

  34. SubStep C - Scanning Scan the edges from d in order to find the extended locus of uwv. Since we don t know yet what is v we must scan each character in the path from d downward, comparing it to taili-1. The last node in this trek is the contracted locus of headiin Ti-1, which proves P2. When we reach the extended locus of uwv we construct the new node uwv, if needed. Construct the new leaf edge taili.

  35. SubStep C: Scanning root u Scanning for the requested v. xu w w Comparing each character of the downward path beginning at d to taili. When the comparison fails we have reached headi. d v t vt taili Ti Headi= uwv

  36. 3 SubStepC w w v v taili suffix(i)

  37. Analysis Using suffix links and SubStep B we can quickly find where to insert the next suffix in our current tree. We are not using more space than time, so the space usage is the same. By amortized analysis, the total running time becomes linear.

Related


More Related Content