On Genome Assembly

Slide Note

Genome assembly involves deciphering the complex puzzle of DNA sequences to reconstruct the complete genetic information of an organism. Techniques like genome sequencing and assembly examples are explored, along with strategies to characterize and select the most probable assemblies from a set of reads. The concept of overlap graphs and their role in identifying overlaps between reads for assembly are discussed.

kbra Follow

Uploaded on Mar 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

On Genome Assembly Abhiram Ranade IIT Bombay

Genome Constituent of living cells that determines hereditary characteristics a.k.a. DNA Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine ACCTGGA Human genome: 3 billion nucleobases Knowledge of sequence is very useful

Genome Sequencing Biochemical techniques can read genomes of length ~ 700 nucleobases Make many copies of the genome. Break the copies randomly into pieces of length ~ 700. Read the pieces. Try to infer what original genome must have been. Genome Assembly

Assembly Example Input pieces Reads : Assembly: Input pieces Reads : abcd, cdefghi, hijkl, hicd Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl abcd, cdefghi, hijkl abcdefghijkl

Strategy Characterize all possible assemblies of the given read set Assign a probability to each assembly Pick the assembly with the highest probability

All possible assemblies A = valid assembly of reads if each read appears in A at least once. Nothing else appears, i.e. A is made by pasting together the reads in possibly overlapping faction Can we compute/represent the set {A | A is a valid assembly} Overlap/String Graph!

Overlap Graph: Intuition Vertices = Reads. Edge from read u to read v if read u, read v likely to overlap in assembly. e.g. abcd, cdef => abcdef Assembly = walk in the graph: will encourage overlaps

Overlap graph Vertices: reads + empty read Edges: (ri, rj) : if long suffix of ri = prefix of rj abcd cdefghi Long = ? Real genomes: 50? 100? Any value that indicates overlap is not coincidental Edge label: portion of rj not belonging to overlap. abcd (Long = 2) efghi cdefghi

Overlap graph, long=2 efghi jkl abcd cdefghi hijkl hicd

Overlap graph, long=2 efghi jkl abcd cdefghi hijkl hicd

Walk => Assembly Assembly = Walk in the overlap graph which Starts at , Ends at Passes through every vertex at least once. Passes through every edge at least once? Assembled sequence = concatenation of labels along the walk. Every read appears in the sequence Walk revisits : reconstruction is incomplete, in several pieces.

Assembly => Walk Input: Assembly A Output: Walk which will generate A Visit vertices in the order of appearance in A Overlap graph characterizes assemblies. Variations on graph also studied.

Approaches to assembly Occam s Razor: Most likely = Shortest Shortest walk that visits every vertex at least once: NP-hard Shortest walk that visits every edge at least once: Chinese Postman problem. Polytime. Pragmatic: Use some greedy approach to find above. Model probability more accurately

A Twist: pair constraints Sequencing process may give additional constraints: distance from ri to rj in assembly is about D Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

Systematic estimation of probability of a given assembly

Algebraic representation of walks Walk is cyclic: number of times vertex entered = number of times it is exited. Walk = fluid flow Total fluid coming in = Total fluid going out Xij = fluid going from i to j. = number of times walk goes from i to j Formulate conditions on Xij and solve

Algebraic representation: Xij,j Xij = Number of times walk goes from i to j. j = Number of times walk goes over j Lij = Length of label of edge (i,j) Length of genome L = L may be known.

Maximum likelihood reconstruction (Medvedev-Brudno 08) Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2, rn) =Pr(A, r1,r2, ,rn) / Pr(r1,r2, ,rn) =Pr(r1,r2, ,rn|A) * Pr(A) / Pr(r1,r2, ,rn) Standard assumption: Unconditional probability Pr(A) same for all A. Maximize Pr(r1,r2, ,rn|A) and output A that maximizes.

Computing Pr(r1,r2,,rn|A) A = abcdefghicdefghijkl r1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd Process of generating reads: Pick a random starting point. Pick a length at random Pr(r2) = ? = 2/Length * Pr(read length = 7) = 2/Length * Pr(read length = 7)

Computing Pr(r1,r2,,rn|A) Generative model: For i=1 to n Pick starting point for ri Pick length Li Probability of generating ri: = Number of times ri appears in A/Length of A * Probability of getting the correct length = i/L * Pr(Li)

Computing Pr(r1,r2,,rn|A) Pr = i i/L * Pr(Li) We want to pick A for which this probability is maximum Score(A) = i i/L * Pr(Li) Best A will have max value of i i/L So now we have a program

Finding the best assembly Maximize i i/L s.t. L known approximately. L = Lge Lg(1+ ) Lg, Lij : constants. Solve for Xij 0, j 0, Convex optimization

Concluding Remarks Experiments seem to indicate our approach works well. www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf Computationally intensive, but well founded. May not be useful for large genomes linear time algorithms only! How to handle pair constraints: important open problem. Graphs are everywhere!

On Genome Assembly

Download Presentation

Presentation Transcript

Related

More Related Content