
Bioinformatics Algorithms: Understanding Sequence Alignment and CFTR Gene
Explore the world of bioinformatics algorithms with a focus on sequence alignment, dynamic programming, and the cystic fibrosis transmembrane conductance regulator (CFTR) gene. Learn about the significance of mutation analysis in cystic fibrosis patients and how bioinformatics aids in genetic diagnostics. Dive into the applications of dynamic programming in solving complex biological problems.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Dynamic Programming I Dynamic Programming I Definition of Dynamic Programming Sequence Alignment Longest Common Subsequence (LCS) Problem Global Alignment Scoring Matrices Local Alignment
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Definition of Dynamic Programming (Planning) Dynamic Programming: Solve the subproblems (they may overlap with each other) and save their solutions in a table, and then use the table to solve the original problem. Example: Compute Fibonacci number f(0)=0, f(1 =1 f(n)=f(n-1)+f(n-2) f(0) 0 f(1) 1 f(2) 1 f(3) 2 f(4) 3 f(n) Procedure FibonacciNum(n) f(0) = 0 f(1) = 1 for i = 2 to n f(i) = f(i-1) + f(i-2) Return f(n) Time complexity = O(n)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Sequence Alignment Example: Cystic Fibrosis Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands (abnormally high level of mucus in glands). CF primarily affects the respiratory systems in children. Mucus is a slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva
An Introduction to Bioinformatics Algorithms Cystic Fibrosis: Finding the Gene www.bioalgorithms.info CFTR: Cystic fibrosis transmembrane conductance regulator
An Introduction to Bioinformatics Algorithms Cystic Fibrosis: Mutation Analysis www.bioalgorithms.info If a high % of cystic fibrosis (CF) patients have a certain mutation in the gene and the normal patients don t, then that could be an indicator of a mutation that is related to CF A certain mutation was found in 70% of CF patients, convincing evidence that it is a predominant genetic diagnostics marker for CF
An Introduction to Bioinformatics Algorithms Cystic Fibrosis and CFTR Gene : www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Cystic Fibrosis and the CFTR Protein CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein is acting in the cell membrane of epithelial cells that secrete mucus These cells line the airways of the nose, lungs, the stomach wall, etc.
An Introduction to Bioinformatics Algorithms Aligning DNA Sequences www.bioalgorithms.info n = 8 m = 7 V = ATCTGATG matches mismatches insertions deletions 4 1 2 2 W = TGCATAC match mismatch A T C T G A T G A V W T G C A T C deletion indels insertion
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Longest Common Subsequence (LCS) Alignment without Mismatches Given two sequences v = v1 v2 vm and w = w1 w2 wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2< < it < m and a sequence of positions in w: 1 < j1 < j2< < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal
An Introduction to Bioinformatics Algorithms LCS: Example i coords: 0 www.bioalgorithms.info 1 2 2 3 3 4 5 6 7 8 elements of v -- -- A T C T G A T C elements of w -- 0 -- 5 -- 6 T 1 G 2 C 3 A 4 T 5 A 6 C 7 0 j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: positions in w: 2 < 3 < 4 < 6 < 8 Matches shown in red 1 < 3 < 5 < 6 < 7
An Introduction to Bioinformatics Algorithms Edit Graph for LCS Problem www.bioalgorithms.info A T C T G A T C j 0 1 2 3 4 5 6 7 8 Every path is a common subsequence. i 0 T 1 Every diagonal edge adds an extra element to common subsequence G 2 C 3 A 4 T LCS Problem: Find a path with maximum number of diagonal edges 5 A 6 C 7
An Introduction to Bioinformatics Algorithms Computing LCS www.bioalgorithms.info Let vi = prefix of v of length i: v1 vi and wj = prefix of w of length j: w1 wj The length of LCS(vi,wj) is computed by: si-1,j + 0 si,j -1 + 0 si-1,j -1 + 1, if vi = wj si,j = max i-1,j i-1,j -1 0 1 0 i,j -1 i,j
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Alignments as LCS (without mismatch) and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5. A T C G T A C 1 2 3 4 5 6 7 0 0 A T G 1 2 Every path in the edit graph corresponds to an alignment: 3 T 4 T 5 A 6 T 7
An Introduction to Bioinformatics Algorithms Alignment as LCS www.bioalgorithms.info A T C G T A C Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 1 2 3 4 5 6 7 0 0 A T G T 1 2 3 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667 4 T A 5 6 T 7
An Introduction to Bioinformatics Algorithms Alignment: Dynamic Programming www.bioalgorithms.info si,j = si-1, j-1+1 if vi = wj { max si-1, j si, j-1
An Introduction to Bioinformatics Algorithms Dynamic Programming Example www.bioalgorithms.info A T C G T A C Initialize 1st row and 1st column to be all zeroes. 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 A 1 0 T G 2 0 Or, to be more precise, initialize 0th row and 0th column to be all zeroes. 3 0 T 4 0 T 5 0 A 6 0 T 7 0
An Introduction to Bioinformatics Algorithms Dynamic Programming Example www.bioalgorithms.info A T C G T A C 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 A 1 0 1 1 1 1 1 1 1 Si,j = Si-1, j-1 { value from NW +1, if vi = wj value from North (top) value from West (left) T G 2 max Si-1, j Si, j-1 1 0 3 0 1 T 4 0 1 T 5 0 1 A 6 0 1 1 T 7 0
An Introduction to Bioinformatics Algorithms Alignment: Backtracking www.bioalgorithms.info Arrows show where the score originated from. if from the top if from the left if vi = wj
An Introduction to Bioinformatics Algorithms Backtracking Example www.bioalgorithms.info Find a match in row and column 2. A T C G T A C 1 2 3 4 5 6 7 0 i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). 0 0 0 0 0 0 0 0 0 A 1 0 1 1 1 1 1 1 1 T G 2 1 0 2 2 2 2 2 2 Since vi = wj, si,j = si-1,j-1 +1 3 0 1 2 T 4 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1 0 1 2 T 5 0 1 2 A 6 0 1 2 1 T 7 0 2
An Introduction to Bioinformatics Algorithms Backtracking Example www.bioalgorithms.info A T C G T A C 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 A Continuing with the dynamic programming algorithm gives this result. 1 0 1 1 1 1 1 1 1 T G 2 1 0 2 2 2 2 2 2 3 0 1 2 2 3 3 3 3 T 4 0 1 2 2 3 4 4 4 T 5 0 1 2 2 3 4 4 4 A 6 0 1 2 2 3 4 5 5 1 T 7 0 2 2 3 4 5 5
An Introduction to Bioinformatics Algorithms LCS Algorithm www.bioalgorithms.info 1. 1. LCS( LCS(v,w 2. for for i 1 to n 3. si,0 0 4. for for j 1 to m 5. s0,j 0 6. for for i 1 to n 7. for for j 1 to m 8. si-1,j 9. si,j max si,j-1 10. si-1,j-1 + 1, if vi = wj 11. if return return (sn,m, b b) v,w) ) { if si,j = si-1,j if si,j = si,j-1 if si,j = si-1,j-1 + 1 { bi,j if if
An Introduction to Bioinformatics Algorithms Now What? www.bioalgorithms.info A T C G T A C LCS(v,w) created the alignment grid 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 A 1 0 1 1 1 1 1 1 1 Now we need a way to read the best alignment of v and w 2 T 1 0 2 2 2 2 2 2 3 G T 1 0 2 2 3 3 3 3 4 1 2 0 2 3 4 4 4 5 T 0 1 2 2 Follow the arrows backwards from sink 3 4 4 4 A 6 1 2 2 0 3 4 5 5 1 T 7 2 2 3 4 0 5 5
An Introduction to Bioinformatics Algorithms Printing LCS: Backtracking www.bioalgorithms.info 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. else 11. PrintLCS 1. PrintLCS if if i = 0 or j = 0 return if if bi,j= PrintLCS print else if if bi,j= else PrintLCS(b b,v, v,i,j-1) PrintLCS(b,v b,v,i,j) return PrintLCS(b b,v, v,i-1,j-1) print vi else PrintLCS PrintLCS(b b,v, v,i-1,j)
An Introduction to Bioinformatics Algorithms LCS Runtime www.bioalgorithms.info It takes O(nm) time to fill in the nxm dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested for loop inside of another for loop to set up a nxm matrix.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info From LCS to Alignment: Change up the Scoring The Longest Common Subsequence (LCS) problem the simplest form of sequence alignment allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium - : mismatch penalty - : indel penalty
An Introduction to Bioinformatics Algorithms Simple Scoring www.bioalgorithms.info When mismatches are penalized by , indels are penalized by , and matches are rewarded with +1, the resulting score is: #matches (#mismatches) (#indels)
An Introduction to Bioinformatics Algorithms The Global Alignment Problem www.bioalgorithms.info Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score = - = 1 if match = - if mismatch : mismatch penalty : indel penalty si-1,j-1 +1 if vi = wj si,j= max s i-1,j-1 - if vi wj s i-1,j - s i,j-1 - {
An Introduction to Bioinformatics Algorithms Scoring Matrices www.bioalgorithms.info To generalize scoring, consider a (4+1) x(4+1) scoring matrix . In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character - . This will simplify the algorithm as follows: si-1,j-1 + (vi, wj) si,j = max s i-1,j + (vi, -) s i,j-1 + (-, wj) {
An Introduction to Bioinformatics Algorithms Measuring Similarity www.bioalgorithms.info Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation
An Introduction to Bioinformatics Algorithms Percent Sequence Identity www.bioalgorithms.info The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G A C G T G A C C T G A G A G A C G T G G C A G A G G C A G mismatch indel 70% identical
An Introduction to Bioinformatics Algorithms Making a Scoring Matrix www.bioalgorithms.info Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein s function, therefore some penalties, (vi , wj), will be less harsh than others.
An Introduction to Bioinformatics Algorithms Scoring Matrix: Example www.bioalgorithms.info A R N K Notice that although R and K are different amino acids, they have a positive score. A 5 -2 -1 -1 R - 7 -1 3 N - - 7 0 K - - - 6 Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK 1) + ( 2) + 5 + 7 + 3 = 11
An Introduction to Bioinformatics Algorithms Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine(side chain with the structure CH2CH(CH3)2, an essential amino acid, meaning you must obtain it from food, as opposed to being able to make it from other molecules, which your cells can do with the majority of the amino acids.) www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms Scoring matrices www.bioalgorithms.info Amino acid substitution matrices PAM BLOSUM DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level
An Introduction to Bioinformatics Algorithms PAM www.bioalgorithms.info Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all
An Introduction to Bioinformatics Algorithms PAMX www.bioalgorithms.info PAMx = PAM1x PAM250 = PAM1250 PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10
An Introduction to Bioinformatics Algorithms BLOSUM www.bioalgorithms.info Blocks Substitution Matrix Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity
An Introduction to Bioinformatics Algorithms The Blosum50 Scoring Matrix www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms Local vs. Global Alignment www.bioalgorithms.info The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i , j ) in the edit graph. In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment
An Introduction to Bioinformatics Algorithms Local vs. Global Alignment(cont d) www.bioalgorithms.info Global Alignment --T -CC-C-AGT -TATGT-CAGGGGACACG A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG T-CAGAT--C Local Alignment better alignment to find conserved segment tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc
An Introduction to Bioinformatics Algorithms Local Alignment: Example www.bioalgorithms.info Compute a mini Global Alignment to get Local Local alignment Global alignment
An Introduction to Bioinformatics Algorithms Local Alignments: Why? www.bioalgorithms.info Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence
An Introduction to Bioinformatics Algorithms The Local Alignment Problem www.bioalgorithms.info Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings
An Introduction to Bioinformatics Algorithms Local Alignment: Free Rides www.bioalgorithms.info Yeah, a free ride! Vertex (0,0) The dashed edges represent the free rides from (0,0) to every other node.
An Introduction to Bioinformatics Algorithms The Local Alignment Recurrence www.bioalgorithms.info The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Notice there is only this change from the original recurrence of a Global Alignment 0 si,j = max si-1,j-1 + (vi, wj) s i-1,j + (vi, -) s i,j-1 + (-, wj) {
An Introduction to Bioinformatics Algorithms The Local Alignment Recurrence www.bioalgorithms.info The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Power of ZERO: there is only this change from the original recurrence of a Global Alignment - since there is only one free ride edge entering into every vertex 0 si,j = max si-1,j-1 + (vi, wj) s i-1,j + (vi, -) s i,j-1 + (-, wj) {
An Introduction to Bioinformatics Algorithms Scoring Indels: Naive Approach www.bioalgorithms.info A fixed penalty is given to every indel: - for 1 indel, -2 for 2 consecutive indels -3 for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels
An Introduction to Bioinformatics Algorithms Affine Gap Penalties www.bioalgorithms.info In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments This is more likely. This is less likely.
An Introduction to Bioinformatics Algorithms Accounting for Gaps www.bioalgorithms.info Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -( + x) where >0 is the penalty for introducing a gap: gap opening penalty will be large relative to : gap extension penalty because you do not want to add too much of a penalty for extending the gap.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Summary of Algorithms for LCS, Global Alignment and Local Alignment