Efficient Read Alignment Algorithms

Slide Note

Given a long reference sequence and short reads, the goal is to find the best matching location for each read in the reference while minimizing the number of mismatches. Pre-processing the reference with suffix trees allows for quick identification of matching locations. Suffix links in suffix trees aid in handling mismatches efficiently, enabling continuation from where the process gets stuck. This approach significantly reduces the space required by suffix trees and enhances the indexing process.

bvaa Follow

Uploaded on Mar 05, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Read Alignment Algorithms www.strandls.com

The Problem Given a very long reference sequence of length n and given several short strings (reads) of length m each, m << n Find the best matching location for each read in the reference Where the best location is that which minimizes the number of mismatches We ignore insertions and deletions for the moment; those will come later Provided the number of mismatches is at most, say 5% of m 2 www.strandls.com

Indexing the Reference What if we do not allow any mismatches at all? Pre-process the reference sequence so Each query find the best matching location of a read can be identified in time proportional to m and independent of n The resulting data structure is called an index Suffix trees are one possible index A trie of all suffixes of the reference sequence, with a $ marker at the end 3 www.strandls.com

Suffix Trees The Reference T C C G A C G C G A T T A C G A C Query C C G 4 www.strandls.com

Space Required by Suffix Trees n-1 internal nodes plus n leaves, so 2n-1 nodes 2n-2 tree pointers + n pointers into the reference So ~3n pointers 36GB! Can we make this smaller? 5 www.strandls.com

Indexing the Reference with Mismatches What if we allow mismatches? So we put the query through the suffix tree but get struck can t proceed further Next, resume by dropping the first character, but without redoing the work already done How? 6 www.strandls.com

Suffix Links in Suffix Trees The Reference T C C G A C G C G A T T A C G A C Query G G C 7 www.strandls.com

Indexing with Mismatches (Contd) For an internal node A with string x leading down from the root to that node and branching into xa and xb Let x=cy Then there exists a node B with string y leading down from the root to that node The suffix link from A leads to this node B Such a node exists So if you get stuck, you follow the suffix link in constant time and continue from where you left off, to find the longest perfect-match substring starting at each position in the read Or alternatively, find all substrings of a certain minimum length that match Check explicitly for the number of mismatches at each of these locations 8 www.strandls.com

Space Required by Suffix Trees & Links n-1 internal nodes plus n leaves plus n-1 suffix links, so 3n-1 nodes 3n-3 tree pointers + n pointers into the reference So ~4n pointers 48GB! Can we make this smaller? Can we fit this tree into an array? 9 www.strandls.com

A Succinct Data Structure The Reference C G A C $ All circular shifts, sorted lexicographically A C C G $ C G $ A C $ A C C G C C G $ A G $ A C C Burrows- Wheeler Transform Store only the first and last columns and the links back to the reference Used in bzip 10 www.strandls.com

A Succinct Data Structure The Reference C G A C $ 2 0 3 1 4 A C C G $ C G $ A C $ A C C G C C G $ A G $ A C C A G C $ G $ The reference can be reconstructed from the first and last columns Claim: The ith G in the first column corresponds to the ith G in the last column! Likewise for A,C,G,T. 11 www.strandls.com

Proof of Claim yG<xG if and only if Gy<Gx; That s it! So given a G in the first column, say corresponding to the string Gx It s rank r is trivial to find because the first column is sorted, just store counts for all 4 characters We need to locate the corresponding G in the last column In other words, the index of the string xG in the table Which is the rth G in the last column [The Select Query] So given a G in the last column, say corresponding to the string xG Find it s rank ramong G s in the last column [The Rank Query] We need to locate the corresponding G in the first column In other words, the index of the string Gx in the table Which is the rth G in the first column, trivial to find 12 www.strandls.com

Select and Rank Queries Given a binary array SELECT: Given index i, find the ith 1 RANK: Given index i, find how many 1s precede this location Use a separate array for each of the 4 characters RANK is easy, just keeps counts at milestones and answer queries by traversing to the nearest milestone in time 4n/ bytes of storage, O( ) time SELECT needs a bit more, keep counts for -rank milestones Go to the nearest rank milestone and traverse from there May need to traverse quite a bit though So need an extra data structure to get to the next 1, which you store at milestones So 8n/ bits storage, O( ) time Of course we need the 4 n-bit binary arrays as well So 4n bits + 48n/ bytes and O( ) time 13 www.strandls.com

String Matching using Rank-Selects Given a string Gx Assume inductively we have the band B of indices in the table corresponding to suffixes that begin with x We want the band B that begins with Gx Take the band B, take the last column, identify the rank of the first and last G in the last column, find their corresponding first column indices; that s the band All doable using RANK alone At the end you have the band containing all suffixes which begin with Gx Unless of course, there are none, in which case the band will vanish at some point We can use this to find matches for say all length 16 substrings of a read So 4n+48n/ bytes and O(m ) time per read 14 www.strandls.com

Indentifying Indices in the Reference We still have to go from a band in the table to indices in the reference 4n bits if we store explicitly We can use the same trick, store explicitly at milestones Then, if we have index i with string Gx, then we can go to index i+1 with string xG and so on till we get to a milestone 4n/ bytes storage Time per index is O( ) 15 www.strandls.com

Sorting Circular Shifts It remains to describe the construction of the table in the first place Given a string S=x0 x1 x2 .$ Consider string S =(x0 x1 x2) (x1 x2 x3) (x3 x4 x5) (x4 x5 x6) . Note (x2 x3 x4) and other triplets starting at 2 mod 3 are missing Rename S so identical tuples get the same number and distinct tuples get different numbers Recursively sort S How does x0 x1 x2 compare to x1 x2 x3 ? Already available from recursion How does x0 x1 x2 compare to x2 x3 x4 ? Compare x0 , x2 and then x1 x2 , x3 x4 We have info for comparing all pairs of suffixes! Sort the 2 mod 3 suffixes and then merge them in Time T(n)= 2T(n/3)+O(n) 16 www.strandls.com

A Generalization: Difference Covers Set D of indices mod v v 2v 3v Sorting suffixes of this string gives the sorted order of all suffixes which begin at indices j such that j mod v is in D Time taken to create this string is O(n |D|) This string has size |D|n/v 17 www.strandls.com

A Generalization: Difference Covers x<v x<v For any 2 indices i and j i-j mod v is the distance between some two beads in D D is a Difference Cover if distances between beads in D generate 0,1 ,v-1 www.strandls.com