
Understanding Genome Sequencing and Assembly Methods
Explore different methods of genome sequencing and assembly such as shotgun sequencing, clone-by-clone sequencing, and whole genome shotgun approach. Learn how researchers break down the genome, sequence its parts, and put them back together to understand genetic information. Discover the significance of library creation, automated detectors, and fluorescent labeling in the process.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CSE182-L16 LW statistics/Assembly
Silly Quiz Who are these people, and what is the occasion?
Sequencing A break at T is shown here. Measuring the lengths using electrophoresis allows us to get the position of each T The same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotides May 25 Bafna
Automated detectors read the terminating bases. The signal decays after 1000 bases. May 25 Bafna
Sequencing Genomes: Clone by Clone Clones are constructed to span the entire length of the genome. These clones are ordered and oriented correctly (Mapping) Each clone is sequenced individually May 25 Bafna
Shotgun Sequencing Shotgun sequencing of clones was considered viable However, researchers in 1999 proposed shotgunning the entire genome. May 25 Bafna
Library Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone. May 25 Bafna
Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics & Repeats argue against the success of such an approach Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together May 25 Bafna
Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics & Repeats argue against the success of such an approach Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together
Shotgun Sequencing Shotgun sequencing of clones was considered viable for small genomes However, researchers in 1999 proposed shotgunning the entire genome.
Massively parallel sequencing Sanger sequencing allows us to sequence <=1000 bp in one lane, up to 96 lanes, in one run. Today, we can sequence many Mbp in a single run
Questions Algorithmic: How do you put the genome back together from the pieces? Statistical? How many pieces do you need to sequence, etc.? The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.
Lander Waterman Statistics The fragments are falling randomly on the genome Overlapping fragments form islands of contiguous sequence. Ideally, we want one island for each chromosome. How many fragments should we sequence? L G
Lander Waterman Statistics G = Genome Length L = Fragment Length N = Number of Fragments T = Required Overlap c = Coverage = LN/G a = N/G q = T/L s = 1-q L G
LW statistics: questions As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island of overlapping contigs. Q1: What is the expected number of islands? The number should increase at first, and gradually decrease.
Analysis: Expected Number Islands Computing Expected # islands. Let Xi=1 if an island ends at position i, Xi=0 otherwise. Number of islands = i Xi Expected # islands = E( i Xi) = i E(Xi)
Prob. of an island ending at i L i T E(Xi) = Prob (Island ends at pos. i) =Prob(clone began at position i-L+1 AND no clone began in the next L-T positions) ( ) L-T=ae-cs E(Xi)= E(Xi)=a 1-a i Gae-cs= Ne-cs Expected # islands=
Computing # islands As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. Q1: What is the expected number of islands? Ans: N exp(-c ) The number increases at first, and gradually decreases.
Expected # of clones in an island ecs Expected # of clones in an island = Q: How? Why do we care? Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.
Problem 1: size of contigs Islands might simply be too small in length = (1-T/L) = (1-50/500) = 0.9, c = 8. #Islands = N e-c = 36K Size of an island = 54K Not enough to make it an acceptable assembly! PLUS, there is the problem of Repeats, Chimerism etc.
Assembly Basics Three main components: Overlap Layout Consensus
Overlap Given a pair of fragments s1 and s2, do they belong together? Yes, if a prefix of s2 matches a suffix of s1 How would you compute such a match?
Overlap S[i,j] = optimum score of an alignment of s1[1..i] against a substring of that starts anywhere, but ends in j. s2[*..j] j i The best prefix-suffix alignment is given by: Maxi {S[i,n] }
Overlap Detection Compute the best prefix- suffix alignments between each pair of fragments. Keep the high-scoring ones as evidence of true overlap. What is the problem?
Overlap detection problem Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. G = 3000Mb, L=500 Coverage LN/G = 10 N = 10*3*109/500 = 6*107 Number of comparisons needed = 3.6 * 1015 Number of alignments per minute=6 Number of compute nodes = 100 Time needed (Number of years) = 3.6 1015 Not good! (Only a small fraction are true overlaps) (6 60 24 365 100)=11M
k-mer based overlap (Piegeonhole principle again) Consider a 25bp sequence. Expected number of occurrences in the genome 3*109*4-25 = 2*10-6 A 25-bp sequence appears is unique to the genome! Two overlapping sequences should share a 25-mer Two non-overlapping sequences should not! 25bp
Sorting k-mers Build a list of k-mers that appear in the sequences and their reverse complements Create a record with 4 entries: K-mer Sequence number Position in the sequence Reverse complementation flag Sort a vector of these according to k-mer How many records per k-mer are expected? If number of records exceeds threshold, discard (why?) K-mer S.id Pos.
Alignment module Coalesce k-mer hits into longer, gap-free partial alignments. These extended k-mer hits are saved. For each pair of sequences, form a directed graph. For each maximal path in the graph, construct an alignment. Refine alignment via banded DP
Problem2: Size Islands might simply be too small in length = (1-T/L) = (1-50/500) = 0.9, c = 8. #Islands = N e-c = 36K Size of an island = 54K Not enough to make it an acceptable assembly! PLUS, there is the problem of Repeats, Chimerism etc.
Solution 2: Clones can have mate- pairs Recall that we sequence about 1000bp of the end of a clone If we sequenced both ends, we get extra information, particularly if we know the length of the original clone.
Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs
Super-contigs are quite large Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small. Use the mate-pairs to order and orient the contigs, and make super-contigs.
Repeats & Chimerisms 40-50% of the human genome is made up of repetitive elements. Repeats can cause great problems in the assembly! Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly
Repeat detection Lander Waterman strikes again! The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig). Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands. Repeat
Detecting Repeat Contigs 1: Read Density Compute the log-odds ratio of two hypotheses: H1: The contig is from a unique region of the genome. The contig is from a region that is repeated at least twice
Detecting Chimeric reads Chimeric reads: Reads that contain sequence from two genomic locations. Good overlaps: G(a,b) if a,b overlap with a high score Transitive overlap: T(a,c) if G(a,b), and G(b,c) Find a point x across which only transitive overlaps occur. X is a point of chimerism
Whole genome shotgun Input: Shotgun sequence fragments (reads) Mate pairs Output: A single sequence created by consensus of overlapping reads First generation of assemblers did not include mate-pairs (Phrap, CAP..) Second generation: CA, Arachne, Euler
Assembly Use k-mers to detect potential overlaps Use alignments to build contig graphs Decide the unique contigs based on LW statistics Discard repeat contigs Break chimeric contigs Use mate-pairs to build scaffolds Fill gaps
Assembly Use k-mers to detect potential overlaps Use alignments to build contig graphs Decide the unique contigs based on LW statistics Discard repeat contigs Break chimeric contigs Use mate-pairs to build scaffolds
Consensus Derivation Consensus sequence is created by converting pairwise read alignments into multiple-read alignments
Summary Whole genome shotgun is now routine: Human, Mouse, Rat, Dog, Chimpanzee.. Many Prokaryotes (One can be sequenced in a day) Plant genomes: Arabidopsis, Rice Model organisms: Worm, Fly, Yeast A lot is not known about genome structure, organization and function. Comparative genomics offers low hanging fruit
Final exam syllabus Take home The entire course, but emphasis will be given to post-midterm lectures HMMs, Gene-finding, mass spectrometry, Micro-array analysis, genome sequencing and assembly