De Novo Motif Discovery with Gibbs Sampler Tool by Xuhua Xia

motif identification with gibbs sampler a tool n.w
1 / 35
Embed
Share

"Explore motif identification using Gibbs Sampler, a de novo motif discovery tool developed by Xuhua Xia. Learn about its applications in molecular biology and sequence analysis. Discover true motifs we want to find and understand their characteristics. Delve into the background of Gibbs, a pioneer in Monte Carlo algorithms."

  • Motif Identification
  • Gibbs Sampler
  • De Novo Discovery
  • Molecular Biology
  • Sequence Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Motif identification with Gibbs Sampler A tool for de novo motif discovery Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Molecular biology+BLAST/FASTA+PWM

  2. Background uofologo uofologo Named after Josiah Willard Gibbs (February 11, 1839 April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901. One of Monte Carlo algorithms Biological applications Identification of regulatory sequences of genes and functional motifs in proteins Pairwise sequence alignment and multiple sequence alignment. Classification of biological images 2

  3. True motifs we want to find uofologo uofologo SNC1 GTAAGTACAGAAAGCCACAGAGTACCATCTAGGAAATTAACATTATACTAACTTTCTACATCGTTGATACTTATGCGTATACATTCATATA... EFB1 GTATGTTCCGATTTAGTTTACTTTATAGATCGTTGTTTTTCTTTCTTTTTTTTTTTTCCTATGGTTACATGTAAAGGGAAGTTAACTAATA... TFC3 GTATGTTCATGTCTCATTCTCCTTTTCGGCTCCGTTTAGGTGATAAACGTACTATATTGTGAAAGATTATTTACTAACGACACATTGAAG YBL111C GCATGTGTGCTGCCCAAGTTGAGAAGAGATACTAACAAAATGACCGCGGCTCTCAAAAATAATTGACGAGCTTACGGTGATACGCTTACCG... SCS22 GTATGTTTGACGAGAATTGCTAGTGTGCGGGAAACTTTGCTACCTTTTTTGGTGCGATGCAACAGGTTACTAATATGTAATACTTCAG RPL23A GTATGTTAAAATTTTTATTTTCCACAATGCAATTTGGTTAAATTGATCATAAAGTAAAGTTCCAAGATTTCATTTTGCTGGGTACAACAGA... YBL059C-A GTAAGTATCCAGATTTTACTTCATATATTTGCCTTTTTCTGTGCTCCGACTTACTAACATTGTATTCTCCCCTTCTTCATTTTAG YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG SEC17 GTATGTAGTAGGGAAATATATCAAAGGAACAAAATGAAAGCTATGTGATTCCGTAATTTACGAAGGCAAATTACTAACATTGAAATACGGG... ERD2 GTATGTTACTATTTGGAGTTTCATGAGGCTTTTCCCGCCGTAGATCGAACCCAATCTTACTAACAGAGAAAGGGCTTTTTCCCGACCATCA... RPL19B GTATGTTTAACAGTGATACTAAATTTTGAACCTTTCACAAGATTTATCTTTAAATATGTTATGAATGTCATCCTTTGGAGAGAAATAGATA... LSM2 GTATGTTCATAATGATTTACATCGGAATTCCCTTTGATACAAGAAAACTAACGGGTATCGTACATCAATTTTTGAAAAAAGTCAAGTACTA... POP8 GTATGTATATTTTTGACTTTTTGAGTCTCAACTACCGAAGAGAAATAAACTACTAACGTACTTTAATATTTATAG RPS11B GTATGAAAGAATTATAACCTGAATGAGGTAATCAATGAAATATTCAGTACGGAAAGGAAAATTGCTCGAGGTAATATTATAATTTTAATGG... ...... Conventional gene name vs systematic name Introns and splicing signals: 5'ss, 3'ss, BPS Binding sites of transcription factors. 3

  4. uofologo Where are they & what they are like? uofologo SNC1 GTAAGTACAGAAAGCCACAGAGTACCATCTAGGAAATTAACATTATACTAACTTTCTACATCGTTGATACTTATGCGTATACATTCATATA... EFB1 GTATGTTCCGATTTAGTTTACTTTATAGATCGTTGTTTTTCTTTCTTTTTTTTTTTTCCTATGGTTACATGTAAAGGGAAGTTAACTAATA... TFC3 GTATGTTCATGTCTCATTCTCCTTTTCGGCTCCGTTTAGGTGATAAACGTACTATATTGTGAAAGATTATTTACTAACGACACATTGAAG YBL111C GCATGTGTGCTGCCCAAGTTGAGAAGAGATACTAACAAAATGACCGCGGCTCTCAAAAATAATTGACGAGCTTACGGTGATACGCTTACCG... SCS22 GTATGTTTGACGAGAATTGCTAGTGTGCGGGAAACTTTGCTACCTTTTTTGGTGCGATGCAACAGGTTACTAATATGTAATACTTCAG RPL23A GTATGTTAAAATTTTTATTTTCCACAATGCAATTTGGTTAAATTGATCATAAAGTAAAGTTCCAAGATTTCATTTTGCTGGGTACAACAGA... YBL059C-A GTAAGTATCCAGATTTTACTTCATATATTTGCCTTTTTCTGTGCTCCGACTTACTAACATTGTATTCTCCCCTTCTTCATTTTAG YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG SEC17 GTATGTAGTAGGGAAATATATCAAAGGAACAAAATGAAAGCTATGTGATTCCGTAATTTACGAAGGCAAATTACTAACATTGAAATACGGG... ERD2 GTATGTTACTATTTGGAGTTTCATGAGGCTTTTCCCGCCGTAGATCGAACCCAATCTTACTAACAGAGAAAGGGCTTTTTCCCGACCATCA... RPL19B GTATGTTTAACAGTGATACTAAATTTTGAACCTTTCACAAGATTTATCTTTAAATATGTTATGAATGTCATCCTTTGGAGAGAAATAGATA... LSM2 GTATGTTCATAATGATTTACATCGGAATTCCCTTTGATACAAGAAAACTAACGGGTATCGTACATCAATTTTTGAAAAAAGTCAAGTACTA... POP8 GTATGTATATTTTTGACTTTTTGAGTCTCAACTACCGAAGAGAAATAAACTACTAACGTACTTTAATATTTATAG RPS11B GTATGAAAGAATTATAACCTGAATGAGGTAATCAATGAAATATTCAGTACGGAAAGGAAAATTGCTCGAGGTAATATTATAATTTTAATGG... ...... ...... Xuhua Xia 4

  5. Variables uofologo uofologo Numeric Continuous: Body height E-value (0, ) nucleotide Frequency: (0,1) Discrete: number of student, sequence length Categorical: Gender: male, female, ... Canadian party affiliation: NDP, Liberal, Conservertive, None Amino acid: A, C, D, ..., V Dinucleotide: AA, AC, ..., TT Triplet: AAA, AAC, ..., TTT. Motif of length 7: ACCGATC, TACGCGT, ... 5

  6. Motif variable and values uofologo uofologo 12345678901234567890123456789012345678901234567890123456789012345678901234567890 GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG GGACUGGCUGGGC GACUGGCUGGGCG ACUGGCUGGGCGA ............. For a sequence of length 80 and a motif length of 13, the values of the motif variable are GGACUGGCUGGGC, GACUGGCUGGGCG, ACUGGCUGGGCGA, ... There are (80-13+1)=68 site-specific 13mers. These are possible values a motif (variable) can take Given a sequence of length M and a motif length of L, the number of values a motif variable can take is (M- L+1) Effective length (EL) 6

  7. What we want uofologo uofologo SNC1 GTAAGTACAGAAAGCCACAGAGTACCATCTAGGAAATTAACATTATACTAACTTTCTACATCGTTGATACTTATGCGTATACATTCATATA... EFB1 GTATGTTCCGATTTAGTTTACTTTATAGATCGTTGTTTTTCTTTCTTTTTTTTTTTTCCTATGGTTACATGTAAAGGGAAGTTAACTAATA... TFC3 GTATGTTCATGTCTCATTCTCCTTTTCGGCTCCGTTTAGGTGATAAACGTACTATATTGTGAAAGATTATTTACTAACGACACATTGAAG YBL111C GCATGTGTGCTGCCCAAGTTGAGAAGAGATACTAACAAAATGACCGCGGCTCTCAAAAATAATTGACGAGCTTACGGTGATACGCTTACCG... SCS22 GTATGTTTGACGAGAATTGCTAGTGTGCGGGAAACTTTGCTACCTTTTTTGGTGCGATGCAACAGGTTACTAATATGTAATACTTCAG RPL23A GTATGTTAAAATTTTTATTTTCCACAATGCAATTTGGTTAAATTGATCATAAAGTAAAGTTCCAAGATTTCATTTTGCTGGGTACAACAGA... YBL059C-A GTAAGTATCCAGATTTTACTTCATATATTTGCCTTTTTCTGTGCTCCGACTTACTAACATTGTATTCTCCCCTTCTTCATTTTAG YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG SEC17 GTATGTAGTAGGGAAATATATCAAAGGAACAAAATGAAAGCTATGTGATTCCGTAATTTACGAAGGCAAATTACTAACATTGAAATACGGG... ERD2 GTATGTTACTATTTGGAGTTTCATGAGGCTTTTCCCGCCGTAGATCGAACCCAATCTTACTAACAGAGAAAGGGCTTTTTCCCGACCATCA... RPL19B GTATGTTTAACAGTGATACTAAATTTTGAACCTTTCACAAGATTTATCTTTAAATATGTTATGAATGTCATCCTTTGGAGAGAAATAGATA... LSM2 GTATGTTCATAATGATTTACATCGGAATTCCCTTTGATACAAGAAAACTAACGGGTATCGTACATCAATTTTTGAAAAAAGTCAAGTACTA... POP8 GTATGTATATTTTTGACTTTTTGAGTCTCAACTACCGAAGAGAAATAAACTACTAACGTACTTTAATATTTATAG RPS11B GTATGAAAGAATTATAACCTGAATGAGGTAATCAATGAAATATTCAGTACGGAAAGGAAAATTGCTCGAGGTAATATTATAATTTTAATGG... ...... Suppose we have 100 intron sequences with length M1, M2, ..., M100, and we wish to find if there is a shared motif of length 7 in each sequence Intron 1 has (M1 7 + 1) 7mers EL1 Intron 2 has (M2 7 + 1) 7mers EL2 ... If we randomly take one 7mer from each sequence, then we have a sample of 100 7mer There are EL1*EL2*...*EL100 such samples. Only one sample contains all true 7mer motifs. How to find this particular sample containing the desired set of motifs? 7

  8. True motifs we want to find uofologo uofologo SNC1 GTAAGTACAGAAAGCCACAGAGTACCATCTAGGAAATTAACATTATACTAACTTTCTACATCGTTGATACTTATGCGTATACATTCATATA... EFB1 GTATGTTCCGATTTAGTTTACTTTATAGATCGTTGTTTTTCTTTCTTTTTTTTTTTTCCTATGGTTACATGTAAAGGGAAGTTAACTAATA... TFC3 GTATGTTCATGTCTCATTCTCCTTTTCGGCTCCGTTTAGGTGATAAACGTACTATATTGTGAAAGATTATTTACTAACGACACATTGAAG YBL111C GCATGTGTGCTGCCCAAGTTGAGAAGAGATACTAACAAAATGACCGCGGCTCTCAAAAATAATTGACGAGCTTACGGTGATACGCTTACCG... SCS22 GTATGTTTGACGAGAATTGCTAGTGTGCGGGAAACTTTGCTACCTTTTTTGGTGCGATGCAACAGGTTACTAATATGTAATACTTCAG RPL23A GTATGTTAAAATTTTTATTTTCCACAATGCAATTTGGTTAAATTGATCATAAAGTAAAGTTCCAAGATTTCATTTTGCTGGGTACAACAGA... YBL059C-A GTAAGTATCCAGATTTTACTTCATATATTTGCCTTTTTCTGTGCTCCGACTTACTAACATTGTATTCTCCCCTTCTTCATTTTAG YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG SEC17 GTATGTAGTAGGGAAATATATCAAAGGAACAAAATGAAAGCTATGTGATTCCGTAATTTACGAAGGCAAATTACTAACATTGAAATACGGG... ERD2 GTATGTTACTATTTGGAGTTTCATGAGGCTTTTCCCGCCGTAGATCGAACCCAATCTTACTAACAGAGAAAGGGCTTTTTCCCGACCATCA... RPL19B GTATGTTTAACAGTGATACTAAATTTTGAACCTTTCACAAGATTTATCTTTAAATATGTTATGAATGTCATCCTTTGGAGAGAAATAGATA... LSM2 GTATGTTCATAATGATTTACATCGGAATTCCCTTTGATACAAGAAAACTAACGGGTATCGTACATCAATTTTTGAAAAAAGTCAAGTACTA... POP8 GTATGTATATTTTTGACTTTTTGAGTCTCAACTACCGAAGAGAAATAAACTACTAACGTACTTTAATATTTATAG RPS11B GTATGAAAGAATTATAACCTGAATGAGGTAATCAATGAAATATTCAGTACGGAAAGGAAAATTGCTCGAGGTAATATTATAATTTTAATGG... ...... The sample that is interesting SNC1 TACTAAC TFC3 TACTAAC YBL111C TACTAAC SCS22 TACTAAT ...... ...... The sample that is boring: SNC1 AGAAAGC TFC3 TTCTCCT YBL111C GGCTCTC SCS22 AGAATTG A criterion is needed to keep the desired sample and discard the boring sample. Scan new sequences for new intron signals. PWM 8

  9. Criterion: Objective function uofologo uofologo p p N m Code = ij log F C 2 ij = = 1 1 i j i Kullback-Leibler information Kullback Leibler divergence Large-deviation rate function Information content YBL059W ...TACTAAC... YCL012C ...TACTAAC... HMRA1_1 ...TACTAAC... HMRA1_2 ...TACTAAC... TAN1 ...TACTAAC... AML1_2 ...TACTAAC... QCR10 ...TACTAAC... YHR199C-A ...TACTAAC... YIL156W-B ...TACTAAC... DID4 ...TACTAAC... YLR211C ...GACTAAC... TAD3_1 ...AACTAAC... TAD3_2 ...AACTAAC... YOL047C ...TACTAAC... Xuhua Xia 9

  10. Criterion: Objective function uofologo uofologo 1 2 0 1 2 3 0 4 0 0 0 5 6 7 0 p p N m Code = A C G T 14 14 14 ij log F C 0 0 0 14 0 0 0 0 0 0 14 2 ij = = 1 1 i j i 0 0 0 0 11 14 14 14 14 14 14 14 14 pij pi 0 0.3279 1 0.1915 0 0.2043 0 0.2763 A 0.1429 C G 0.0714 T 0.7857 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 YBL059W ...TACTAAC... YCL012C ...TACTAAC... HMRA1_1 ...TACTAAC... HMRA1_2 ...TACTAAC... TAN1 ...TACTAAC... AML1_2 ...TACTAAC... QCR10 ...TACTAAC... YHR199C-A ...TACTAAC... YIL156W-B ...TACTAAC... DID4 ...TACTAAC... YLR211C ...GACTAAC... TAD3_1 ...AACTAAC... TAD3_2 ...AACTAAC... YOL047C ...TACTAAC... A -2.4011 22.4650 0 0 22.4650 22.4650 0 C G -1.5181 T 16.5332 0 0 33.2840 0 0 0 0 0 0 0 0 33.2840 0 0 0 0 25.9118 0 0 F = 172.4886 We need to add pseudocounts to avoid taking logarithm of 0. Instead of using pij/pi, we may use (pij+0.00025)/(pi+0.001) Xuhua Xia 10

  11. Criterion: Objective function uofologo uofologo 1 2 0 1 2 3 0 4 0 0 0 5 6 7 0 p p N m Code = A C G T 14 14 14 ij log F C 0 0 0 14 0 0 0 0 0 0 14 2 ij = = 1 1 i j i 0 0 0 0 11 14 14 14 ??1 ?? 0.1429 0.3279 14 14 14 14 14 ??1log2 = 2log2 pij pi 0 0.3279 1 0.1915 0 0.2043 0 0.2763 A 0.1429 C G 0.0714 T 0.7857 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 YBL059W ...TACTAAC... YCL012C ...TACTAAC... HMRA1_1 ...TACTAAC... HMRA1_2 ...TACTAAC... TAN1 ...TACTAAC... AML1_2 ...TACTAAC... QCR10 ...TACTAAC... YHR199C-A ...TACTAAC... YIL156W-B ...TACTAAC... DID4 ...TACTAAC... YLR211C ...GACTAAC... TAD3_1 ...AACTAAC... TAD3_2 ...AACTAAC... YOL047C ...TACTAAC... A -2.4011 22.4650 0 0 22.4650 22.4650 0 C G -1.5181 T 16.5332 0 0 33.2840 0 0 0 0 0 0 0 0 33.2840 0 0 0 0 25.9118 0 0 F = 172.4886 We need to add pseudocounts to avoid taking logarithm of 0. Instead of using pij/pi, we may use (pij+0.00025)/(pi+0.001) Xuhua Xia 11

  12. Calculation of F uofologo uofologo ????? ? ??? ?? ? = ???log2 ?=1 ?=1 Cij 1 2 9 3 7 6 4 5 1 2 1 1 1 3 1 1 2 4 5 1 A C G T 12 10 13 10 10 35 18 10 11 10 40 8 1 2 2 2 19 8 14 13 40 7 12 10 40 18 1 10 40 10 40 37 40 36 40 2 19 40 40 40 pi and pij pi pi1 pi2 pi3 pi4 pi5 pi1 pi2 pi3 pi4 pi5 A 0.25 C 0.25 G 0.25 T 0.25 0.300 0.250 0.200 0.250 0.225 0.250 0.275 0.250 0.175 0.150 0.350 0.325 0.325 0.250 0.175 0.250 0.250 0.200 0.300 0.250 0.875 0.025 0.050 0.050 0.025 0.025 0.025 0.925 0.025 0.025 0.050 0.900 0.450 0.050 0.450 0.050 0.025 0.475 0.025 0.475 Fi1 Fi2 Fi3 Fi4 Fi5 Fi1 Fi2 Fi3 Fi4 Fi5 Cijlog2(pij/pi) A C G T 3.156 -1.368 -3.602 0.000 0.000 -4.422 -2.575 1.513 0.000 0.000 4.921 0.000 -2.575 0.000 63.257 -3.322 -4.644 -4.644 69.838 66.528 -3.322 -3.322 -3.322 -3.322 15.264 -3.322 -4.644 15.264 -3.322 17.594 -3.322 17.594 215.545 -4.644 6.796 -3.602 4.921 F = 3.156 0.000 6.318 0.000 -4.644 F =

  13. Rationale of Gibbs sampler uofologo uofologo 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Name YBL059W YCL012C HMRA1_1 HMRA1_2 TAN1 AML1_2 QCR10 Objective to achieve: find the k-mers from N sequences that maximize F Homework: Compile the red 6mers in the same way as in the previous slide and compute F, with pij/pi replaced by (pij+0.0001)/(pi+0.0004). Use pi = 0.25 as background frequencies. p p N m Code = ij log F C 2 ij = = 1 1 i j i

  14. A function of two variables uofologo uofologo Conventional way of finding the maximum of a function Z = F(x,y): Take partial derivatives of Z with respect to x and y, set the two partial derivatives to 0 and solve the simultaneous equations. We do not have a continuously distributed function to take partial derivatives from. 14

  15. Step 1: A random sample of 6mers uofologo uofologo 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Name YBL059W YCL012C HMRA1_1 HMRA1_2 TAN1 AML1_2 QCR10 Randomly sample a 6mer from each sequence: Ai: 27, 44, 30, 35, 1, 37, 7, 8, 35, 18, 32, 42, 46, 50 ( M-6+1) Progressively pick better 6mers to replace the current ones until no better 6mers to pick. The final 6mers represent a motif. How do I know which 6mer is better than an existing one in each sequence? Relevant review questions How many possible 6mers in YML059W with sequence length of 69? What information is needed to compute a PWM? Given a PWM, how to compute PWMS?

  16. Step 2: count Cij and Ci uofologo uofologo 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Name YBL059W YCL012C HMRA1_1 HMRA1_2 TAN1 AML1_2 QCR10 Ci1 4 2 2 6 Ci2 5 2 2 5 Ci3 4 3 1 6 Ci4 5 5 1 3 Ci5 3 4 2 5 Ci6 6 2 0 6 Ci 274 130 127 253 A C G T

  17. Step 3: Predictive update uofologo uofologo 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Name YBL059W YCL012C YOL047C Ci1 4 2 2 6 Ci2 5 2 2 5 Ci3 4 3 1 6 Ci4 5 5 1 3 Ci5 3 4 2 5 Ci6 6 2 0 6 Ci 274 130 127 253 A C G T Take CATACT in YBL059W out of Cij and put into Ci Ci1 4 1 2 6 13 Ci2 4 2 2 5 13 Ci3 4 3 1 5 13 Ci4 4 5 1 3 13 Ci5 3 3 2 5 13 Ci6 6 2 0 5 13 Ci 276 132 127 255 790 A C G T sum PWM pij = Cij/13 + 0.0001 pi = Ci/790 + 0.0004 PWMij = log2(pij/pi) PWMS for a 6mer. Pseudocount to avoid log2(0) A -0.1844 -0.1844 -0.1844 -0.1844 -0.5993 0.4004 C -1.1207 -0.1216 0.4630 1.1997 0.4630 -0.1216 G -0.0661 -0.0661 -1.0651 -1.0651 -0.0661-10.6543 T 0.5144 0.2514 0.2514 -0.4853 0.2514 0.2514

  18. Step 3: Predictive update uofologo uofologo 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG GTATGC PWMS TATGCA PWMS ATGCAT PWMS ...... Name YBL059W PWM A -0.1844 -0.1844 -0.1844 -0.1844 -0.5993 0.4004 C -1.1207 -0.1216 0.4630 1.1997 0.4630 -0.1216 G -0.0661 -0.0661 -1.0651 -1.0651 -0.0661-10.6543 T 0.5144 0.2514 0.2514 -0.4853 0.2514 0.2514

  19. Step 3: Predictive update uofologo uofologo YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG Pr ? ?? Pr ? ?0 Site 1 2 3 4 27 45 64 6mer GTATGC TATGCA ATGCAT TGCATA CATACT TTACTA ACATAG PWMS OddsRatio -0.672 0.380 -0.146 1.379 -0.524 2.433 -12.229 Pnorm 0.005 0.015 0.009 0.040 ???? = log2 0.628 1.301 0.904 2.601 Pr ? ?? Pr ? ?0 = 2???? ???? ????? = ???? ????? (???? ?????) ?????= 0.006 Originally picked 0.695 Largest odds ratio, the 6-mer to replace the originally picked 0.116 5.400 0.000 0.000 69 6 + 1 = 64 Two ways to pick a new 6mer: 1. The one with the largest PWMS 2. Imagine a dart-board with target areas proportional to Pnorm. Randomly throw a dart (using a computer of course) and pick up the corresponding L-mer. Note that large Pnorm value will have a high chance of being hit). Take the new 6mer out of Ci and put into Cij, and perform predictive update for 2nd sequence.

  20. Step 3: Predictive update uofologo uofologo YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG Pr(?|??) Pr ? ?0 Site 1 2 3 4 27 45 64 6mer GTATGC TATGCA ATGCAT TGCATA CATACT TTACTA ACATAG PWMS OddsRatio -0.672 0.380 -0.146 1.379 -0.524 2.433 -12.229 Pnorm 0.005 0.015 0.009 0.040 ???? = log2 0.628 1.301 0.904 2.601 Pr(?|??) Pr ? ?0 = 2???? ???? ????? = ???? ????? (???? ?????) ?????= 0.006 Originally picked 0.695 Largest odds ratio, the 6-mer to replace the originally picked 0.116 5.400 0.000 0.000 69 6 + 1 = 64 Two ways to pick a new 6mer: Counts with the new 6mer (TTACTA) added to Cij 1. The one with the largest PWMS A C G T 4 1 2 7 4 2 2 6 5 3 1 5 4 6 1 3 3 3 2 6 7 2 0 5 274 131 127 252 784 2. Imagine a dart-board with target areas proportional to Pnorm. Randomly throw a dart (using a computer of course) and pick up the corresponding L-mer. Note that large Pnorm value will have a high chance of being hit). 14 14 14 14 14 14 Take the new 6mer out of Ci and put into Cij, and perform predictive update for 2nd sequence.

  21. Predictive update summary uofologo uofologo YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG A C G T 4 1 2 7 4 2 2 6 5 3 1 5 4 6 1 3 3 3 2 6 7 2 0 5 274 131 127 252 784 Compute F1. Repeat updating from 1st sequence to the last sequence until F does not increase. 14 14 14 14 14 14 Remove the original 6mer (TCAAAT) from the count matrix Compute a new PWM to scan the next sequence, and so on with 3rd , 4th ,..., last sequence A C G T 4 1 2 6 4 1 2 6 7 2 0 4 277 132 127 254 790 4 3 1 5 3 6 1 3 2 3 2 6 14 14 14 14 14 14 Pick the winning 6mer (with the largest PWMS) to update the counting matrix. Scan all 6mers in the 2nd sequence (M= 67): 62 PWMS for 62 6mers PWM Xuhua Xia 21

  22. One complete cycle uofologo uofologo Randomly pick up a set of 6mers, one from each sequence. Perform predictive update For each sequence, take the current 6mer out of Cij and put into Ci. Generate a PWM Scan with the current PWM to pick up a new 6mer. Use the new 6mer to replace the original to obtain a new count matrix Continue with the next sequence. Each time we updated sequences 1 to N, we compute F If a new F is not greater than a previous F, stop and record the final F as F1.Max. Because it is faster to compute natural logarithm than logarithm of base 2 in most programming languages, PWM entries often use ln instead of log2. Consequent, the odds ratio would be ePWMS instead of 2PWMS. Pr ? ?? Pr ? ?0 ;Odds ratio = ePWMS ???? = ln Xuhua Xia 22

  23. Multiple cycles uofologo uofologo After completing the first cycle and having recorded F as F1.Max, we restart from the very beginning by randomly generating a new set of 6mer and go through the same process as the first cycle. Upon completing the second cycle, we record the final F as F2.Max. Stop the Gibbs sampler (typically) after two new Fi.Max values are not greater than what we got before. Report results associated with the largest Fi.Max: PWM, and PWMS for each final 6mer The aligned motifs Associated statistics Xuhua Xia 23

  24. uofologo uofologo Summary: To find a motif of length L Randomly pick up an L-mer from each sequence. Record the position of these L-mers as Ai() Obtain site-specific counts of the four nucleotides (or 20 aa) from the L-mers and global counts from the sequences excluding the L-mers. Record as Cij()and Ci(). Predictive update: This is the key step that would pick better L-mers Take the L-mer from S1 out of the Cij() and put it into Ci(). Generate a PWM from Cij() and Ci() Use PWM to scan S1 to obtain PWMS and Odds-ratio from all L-mers. Use one of two methods to choose the new L-mer L-mer with the highest PWMS Use the odds-ratios to compute Pnorm. Imagine a dart-board with target areas proportional to Pnorm. Randomly throw a dart and pick up the corresponding L-mer. Note that large Pnorm value will have a high chance of being hit). Take this new L-mer out of Ci() and into Cij(). Repeat this with S2, S3, , SN. Compute F Repeat this with S1, S2, , SN. Compute F until F does not increase. Record the final F as F1.Max. Start from the very beginning, i.e., "Randomly pick up " and record the final F as F2.Max. Stop after typically 2 rounds of iterations that do not yield larger Fi.Max. Report the results associated with the largest Fi.Max. PWM and PWMS for each final 6mer The aligned motifs Associated statistics If you suspect that there might be other motifs, mask the found motif and run Gibbs sampler again.

  25. Final Report: Final Frequency uofologo uofologo Final site-specific frequencies: A C G U 1 0.03687 0.03630 0.05062 0.87621 2 0.97973 0.00059 0.01848 0.00121 3 0.00116 0.97201 0.00419 0.02264 4 0.00116 0.00059 0.00062 0.99764 5 0.98687 0.00059 0.01133 0.00121 6 0.99759 0.00059 0.00062 0.00121 7 0.00116 0.97916 0.00776 0.01192 Q0 0.32059 0.16029 0.17864 0.34049 Code Count Freq ------------------------------- A 21314 0.3238 C 10793 0.1640 G 11434 0.1737 U 22279 0.3385 Final site-specific counts: Site A C G U 1 10 10 14 245 2 274 0 5 0 3 0 272 1 6 4 0 0 0 279 5 276 0 3 0 6 279 0 0 0 7 0 274 2 3 Final PWM [log2(Qij/Q0)]: A C G U 1 -3.12017 -2.14261 -1.81924 1.36367 2 1.61166 -8.09644 -3.27318 -8.13780 3 -8.11480 2.60033 -5.41330 -3.91082 4 -8.11480 -8.09644 -8.16958 1.55091 5 1.62214 -8.09644 -3.97821 -8.13780 6 1.63772 -8.09644 -8.16958 -8.13780 7 -8.11480 2.61089 -4.52422 -4.83577 Optimal BPS: UACUAAC The effect of background frequencies (contrast the red values in the red box and green values in the green box) Xuhua Xia 25

  26. Motif alignment uofologo uofologo

  27. Motif scores uofologo uofologo SeqName Motif Start PWMS SNC1 UACUAAC 45 12.9973 EFB1 UACUAAC 326 12.9973 TFC3 UACUAAC 71 12.9973 YBL111C UACUAAC 29 12.9973 SCS22 UACUAAU 67 5.5507 RPL23A UACUAAC 454 12.9973 YBL059C-A UACUAAC 51 12.9973 YBL059W UACUAAC 45 12.9973 SEC17 UACUAAC 71 12.9973 ERD2 UACUAAC 57 12.9973 RPL19B UACUAAC 337 12.9973 LSM2 UACUAAC 86 12.9973 POP8 UACUAAC 50 12.9973 RPS11B UACUAAC 486 12.9973 YBR062C AACUAAC 43 8.5135 ECM33 UACUAAC 300 12.9973 UBC4 UACUAAC 64 12.9973 RPL19A AACUAAC 486 8.5135 YBR090C UACUAAC 338 12.9973 SUS1 UACUGAC 46 7.3970 ...... Xuhua Xia 27

  28. Site sampler and motif sampler uofologo uofologo 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG HMRA1_1 GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG HMRA1_2 GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG TAN1 GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG AML1_2 GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG QCR10 GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG 1 2 3 4 5 6 123456789012345678901234567890123456789012345678901234567890123456789 Name Name YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG HMRA1_1 GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG HMRA1_2 GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG TAN1 GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG AML1_2 GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG QCR10 GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Motif sampler Site sampler 28

  29. Motif sampler output SeqName N 1 2 3 Seq1 2 10(TTATAA,93.4541) 18(TTATCA,163.6602) Seq2 1 22(CGGTCA,14.5511) Seq3 1 14(CTATCA,101.8203) Seq4 0 Seq5 1 16(TGATTA,12.9266) Seq6 1 18(CTATCT,90.7790) Seq7 1 20(TTATCA,163.6602) Seq8 2 2(TTATCA,163.6602) 24(CCATCA,10.2098) Seq9 1 17(CTATAA,58.1420) Seq10 3 14(CTATCT,90.7790) 28(ATATCT,41.4438) 32(CTGTCT,37.7888) Seq11 1 21(TGGTCA,23.3886) Seq12 2 3(TGGTCA,23.3886) 33(TTGTAA,38.9024) Seq13 1 20(TTATCT,145.9129) Seq14 1 2(TTATCT,145.9129) Seq15 3 1(TTATTT,33.5700) 10(TTATCA,163.6602) 36(TTCTCT,17.7407) Seq25 1 17(TTATCT,145.9129) Seq26 1 15(CTATCG,21.2368) Seq27 3 19(TTATCA,163.6602) 25(CTTTCT,13.3635) 32(TTATCA,163.6602) Seq28 1 15(CTATCT,90.7790) Xuhua Xia Seq29 2 2(UUGUCA,68.1272) 15(TGATAA,32.0835) 29

  30. Heuristic and exhaustive search uofologo uofologo There are different approaches in search for the best solution Exhaustive Branch and bound (N/A for Gibbs sampler) Heuristic/Greedy: BLAST/FASTA, Gibbs sampler A trade-off between accuracy and speed 30

  31. Illustration uofologo uofologo 31

  32. Exhaustive search for motifs uofologo uofologo Suppose an extremely small problem A set of 3 sequences with lengths 6, 6, and 5, respectively: S1 ACAGTT S2 CCAGTC S3 GCAAT You want to find a motif of length 4 The effective length of the sequences are 3, 3, and 2, respectively, which are the number of 4mers in each sequence: S1: ACAG, CAGT, AGTT S2: CCAG, CAGT, AGTC S3: GCAA, CAAT A set of three 4mer is one sample, we have 3*3*2 samples 32

  33. Exhaustive search uofologo uofologo All 18 possible samples S1: ACAG, CAGT, AGTT S11, S12, S13 S2: CCAG, CAGT, AGTC S21, S22, S23 S3: GCAA, CAAT S31,S32 S12 CAGT S21 CCAG S31 GCAA S13 AGTT S21 CCAG S31 GCAA S11 ACAG S21 CCAG S31 GCAA S12 CAGT S21 CCAG S32 CAAT S13 AGTT S21 CCAG S32 CAAT S11 ACAG S21 CCAG S32 CAAT A total of 18 (=3*3*2) samples. Find the sample with the strongest motif. S12 CAGT S22 CAGT S31 GCAA S13 AGTT S22 CAGT S31 GCAA S11 ACAG S22 CAGT S31 GCAA The red sample contains the strongest motif. S12 CAGT S22 CAGT S32 CAAT S13 AGTT S22 CAGT S32 CAAT S11 ACAG S22 CAGT S32 CAAT S12 CAGT S23 AGTC S31 GCAA S13 AGTT S23 AGTC S31 GCAA S11 ACAG S23 AGTC S31 GCAA S12 CAGT S23 AGCT S32 CAAT S13 AGTT S23 AGCT S32 CAAT S11 ACAG S23 AGCT S32 CAAT 33

  34. Exhaustive search uofologo uofologo The number of samples increases rapidly with number of sequences (n) and sequence length (M) Suppose M1, M2, ...Mn, and motif length 6. We take one 6mer from each sequence for a sample of n 6mers. There are (?? ? + 1) possible sets of n 6mers. We evaluate all them of to find one set with the highest F. Running time is O(Mn) If you have 200 sequences with sequences of ~100 nt, then you have 100200 samples which is already a huge number. In practical biological problems, you typically have longer and more sequences. Gibbs sampler is not an exhaustive search algorithm. It is heuristic but typically produces a satisfying solution. 34

  35. Objectives of this lecture uofologo uofologo Understand the algorithm of Gibbs sampler and its applications The Monte Carlo sampling approach is a way of thinking. This way of thinking can be applied to solving many different scientific problems. Understand that PWM is used as a component in Gibbs sampler and that all these algorithms can be used as building blocks of more complicated bioinformatics tools Recognize that we almost always have to make compromise between accuracy and speed. Popularity of bioinformatics tools depends mainly on its practical utility. 35

Related


More Related Content