Understanding Evolutionary Distances and Nucleotide Substitution Models

substitution models and evolutionary distances n.w
1 / 17
Embed
Share

Explore key concepts in evolutionary distances, nucleotide substitutions, and Markov chains in the context of genetic evolution. Dive into the intricacies of substitution models and rates, shedding light on the impact of neutral substitution rates on genetic sequences over generations.

  • Evolutionary Distances
  • Substitution Models
  • Nucleotide Substitutions
  • Genetic Evolution
  • Markov Chains

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Substitution models and evolutionary distances Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

  2. Key Concepts Multiple substitutions at the same site. Markov chains Transition probability matrix Rate matrix Rate and frequency parameters, equilibrium frequencies Time-reversible Stationary Evolutionary distances derived from Markov chain Slide 2 Xuhua Xia

  3. Nucleotide substitutions ACACTCGGATTAGGCT coincidental ATACTCAGGTTAAGCT convergent single parallel Observed sequences ACACTCGGATTAGGCT ACAATCCGGTTAAGCT multiple T C C From WHL ACACTCGGATTAGGCT Actual number of substitutions during the evolution of the two daughter sequences: 12 Observed number of substitutions between the two daughter sequences: 3. Substitution models are for correcting multiple hits. Slide 3 Xuhua Xia

  4. Markov Chain Markov property (memoryless): P(Xn+1=j|Xn=i and A)= P(Xn+1=j|Xn=i) Stationary property: P(Xn+1=j|Xn=i) is constant for all n >0. S1 S2 SN S1 q11 q12 q1N { 1, 2, , N} Q = S2 q21 q22 q2N SN qN1 qN2 qNN Nucleotide substitution model A G C T A - a b c { A, C, G, T} Q = G g - d e C h i - f T j k l -

  5. Markov Chain Nuc. Freq. PAA PAG PAC PAT PGA PGG PGC PGT PCA PCG PCC PCT PTA PTG PTC PTT [At+1 Gt+1 Ct+1 Tt+1 ] = [At Gt Ct Tt ] P(t+1) = P(t)*M 0.970.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 [1 0 0 0] = [0.97 0.01 0.01 0.01] 0.970.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 = [0.9412 0.0196 0.0196 0.0196] [0.97 0.01 0.01 0.01]

  6. Markov Chain P(t) = P(0)*Mt 100 0.97 0.01 0.97 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 = 1 0 0 0 0.2626 0.2458 0.2458 0.2458 0.01 0.97 0.01 0.01 0.97 If substitution rate is really as high as 0.01 per generation per site, then the sequence will almost reach full substitution saturation in just about 100 generations. How far can we trace back history with realistic neutral substitution rate?

  7. Realistic neutral substitution rate 100000000 8 8 8 10 10 10 10 10 10 8 8 8 10 10 10 = 1 0 0 0 0.2638 0.2454 0.2454 0.2454 8 8 8 10 10 8 8 8 10 With neutral substitution rate of 10-8, the sequence will reach almost full substitution saturation in about 100000000 generations. Early simple organisms likely to have many generations per year, and may well have higher substitution rate because of less efficient DNA repair mechanisms In particular, neutral substitution rate is typically much higher than 10-8.

  8. Substitution Models: JC69 A = C = G = T = 1/4 1 4 1 4 3 4 1 4 = + 4 t P e ( ) ii t A G C T A -3 = 4 t P e Q = ( ) ij t G -3 C -3 = = T -3 3 D t /3 3 4 3ln 1 4 t D 3 4 = = 4 4 /3 t D (1 ) (1 ) p e e diff How to obtain transition probabilities? Three methods (following slides) 4 p diff = D 3

  9. Method 1 solve rate equations: JC69 dP ( ) A t dt = (3 ) + + + + = + [ ] 4 P P P P P P P ( ) ( ) ( ) ( ) ( ) ( ) ( ) A t G t C t T t A t A t A t 1 3 4 = + t P e ( ) A t 4 4 1 4 1 4 3 4 1 4 = + 4 t P e ( ) ii t The equilibrium frequency is obtained when dPA(t)/dt = 0, i.e., = 4 t P e ( ) ij t -4 PA(t) + = 0 PA(t) = 1/4 [1 Pii(t)] / 3 Slide 9

  10. Method 2: probability thinking (b) (a) A G A G C T A d G d C d T d (c) After time t, the expected number of substitution is 4 t Poisson distribution: P(x=0, ,t) = e-4 t, P(x>0, ,t) = 1- e-4 t C T Each nucleotide has a rate of being substituted by any of the 4 nucleotides (d) As a nucleotide (say A) can change to 3 others, we have ? ?,?,? =3 Because we have four nucleotides, each gets of P(x 1, ,t) pij(t) =1 ? 4?? (f) 4(1 ? 4??) ? = = 3 D t /3 3 4 3 4 t D 4 Changed to itself 3 4 Nothing changed = = 4 4 /3 t D (1 ) (1 ) p e e diff 4 p = ln 1 D piit = ? 4??+1 ? 4?? =1 = 1 ?( ?|?,?,?) (e) 3 4 4+3 4? 4??

  11. Method 3 (general): matrix exponential with(LinearAlgebra); Q:=Matrix([[-3*a,a,a,a],[a,-3*a,a,a],[a,a,-3*a,a],[a,a,a,-3*a]]); MatrixExponential(Q); Slide 13 Xuhua Xia

  12. Method 1: K80 dP dP dP dP ( ) ( ) ( ) ( ) A t dt G t dt C t dt T t dt = + 2 ) + + + ( ( ); ; ; . P P P P ( ) ( ) ( ) ( ) A t G t C t T t Solve this set of four equations will yield: 1 4 1 4 1 4 1 4 1 2 1 2 + = + + 4 2( ) t t P e e ( ) A t + = + = 4 2( ) t t P e e P ( ) G t 1 4 1 4 Q = = = = 4 t P P P e ( ) ( ) ( ) C t T t Y t 2 The equilibrium frequency (i.e., when frequencies change no more) is obtained when dP dP dP dP ( ) ( ) ( ) ( ) A t dt G t dt C t dt T t dt = = = = 0 Slide 14

  13. K80 1 4 1 4 1 4 1 4 1 2 1 2 + = + + 4 2( ) t t P e e ( ) A t ?? = 1 4ln 1 2? 1 + = + = 4 2( ) t t P e e P 2ln(1 2? ?) ( ) G t 1 4 1 4 Q ?? = 1 = = = = 4 t P P P e 4ln 1 2? ( ) ( ) ( ) C t T t Y t 2 PA(t)+ PG(t)+ 2PY(t) = 1, otherwise the derivations must be wrong. P Q ln(1 2 ) ln(1 2 ) 4 Q = + = 2 D t t 80 K 2 Slide 15 Xuhua Xia

  14. + = Method 2: probability thinking (a) (b) + = A G C d d T A G After time t, the expected number of changes is 2( + )t A d G C C T T d Focus on nucleotide A: Event 1 (e1): A has a rate of being substituted by any of the 4 nucleotides. Event 2 (e2): A has an additional rate of changing to G or to itself. (g) (c) Poisson distribution: P(e1,e2=0,t) = e-2( + )t, P(e1 1,t) = 1- e-4 t P(e2 1,e1=0,t)=1- P(e1,e2=0,t) - P(e1 1,t) = e-4 t - e-2( + )t = + 2 D t t (f) P Q ln(1 2 ) ln(1 2 ) 4 Q = + t 2 ln(1 2 ) 4 Q = t (d) ? ? ?,? =?(e1 1,t) +?(e2 1,e1= 0,?) (e) 4 2 4+? 4?? ? 2 ?+? ? ? ? =1 4+? 4?? ? 2 ?+? ? =1 = ? 4 2 4 2 1 4 ? 4?? 4 ? 4?? ? ? ?,? =1 ? v = 2 = ? 4 4

  15. Method 3 (general) matrix exponential with(LinearAlgebra); Q:=Matrix([[-(a+2*b),a,b,b],[a,-(a+2*b),b,b],[b,b,-(a+2*b),a],[b,b,a,-(a+2*b)]]); MatrixExponential(Q); Slide 17 Xuhua Xia

  16. Calculating distance SP1 AAG CCT CGG GGC CCT TAT TTT TTG || | ||| ||| | ||| ||| || SP2 AAT CTC CGG GGC CTC TAT TTT TTT What are P and Q? P = 4/24, Q = 2/24 P Q ln(1 2 ) ln(1 2 ) 4 Q = = 0.31507864 D 80 K 2 Comparison of distance: P=0.25 DJC69=0.304099 DK80=0.3150786 3 4 4 p = ln 1 D 69 JC 3 Slide 18 Xuhua Xia

  17. TN93 Distance By solving equations (1),(2),(3), we have Q 1 Q 2 P 2 Q ln(1 ) + ln( 1 ) ln( 1 ) Y 2 2 , , R C T Y t 2 Q P Q 2 R 2 = R Y t = Y 1 Y 2 + ln( 1 ) ln( 1 ) R Y 2 2 2 = R A G R Y t 2 R = + + + + + + 2 [ t + = ( ) ( ) D 93 2 1 TN A T + C + G C A + G + T + ( ) ( )] 1 t 2 T A G C G T C A + 4[ ] t t 2 1 R Y A G C T Slide 22 Xuhua Xia

Related


More Related Content