
Understanding Profile Hidden Markov Models (PHMM) in Bioinformatics
Learn about Profile Hidden Markov Models (PHMM), a powerful tool in bioinformatics that overcomes limitations of Hidden Markov Models (HMM). PHMM is used for aligning related sequences, generating models, and scoring sequences efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Full Frontal View of Profile Hidden Markov Models Mark Stamp Full Frontal View of PHMMs 1
Hidden Markov Models Here, we assume you know about HMMs o If not, see A Revealing Introduction to Hidden Markov Models Executive summary of HMMs o HMM is a machine learning technique o and a discrete hill climb o Train model based on observation sequence o Can then score any given sequence to see how closely it matches the model o Efficient algorithms and many useful apps Full Frontal View of PHMMs 2
HMM Notation Recall, HMM model denoted = (A,B, ) Observation sequence is O Notation: Full Frontal View of PHMMs 3
Hidden Markov Models Among the many uses for HMMs o Speech recognition/analysis o Music search engine o Malware detection o Intrusion detection system (IDS) o And more and more all the time Full Frontal View of PHMMs 4
Limitations of HMMs Positional information not considered o HMM doesn t care where it s at in sequence o No explicit use of positional information With HMM, no insertions or deletions o Only overall statistical properties of O Serious limitations in some applications o E.g., bioinformatics sequence comparison o Where alignment & insert/delete are critical Full Frontal View of PHMMs 5
Profile HMM Profile HMM (PHMM) designed to overcome limitations on previous slide o In some ways, PHMM easier than HMM o In some ways, PHMM more complex The basic idea of PHMM ? o Define a B matrix for every position o Almost like having a (very simple) HMM for each position in sequence Full Frontal View of PHMMs 6
PHMM In bioinformatics, often start by aligning multiple related sequences o Multiple sequence alignment (MSA) o Comparable to training phase for HMM Generate PHMM based on MSA o This is easy, once MSA is known o Again, hard part is generating MSA We score sequences using the PHMM o Use forward algorithm, similar to HMM Full Frontal View of PHMMs 7
Training: PHMM vs HMM Training PHMM o Determine MSA challenging o Determine PHMM matrices easy Training HMM o Append training sequences trivial o Determine HMM matrices challenging PHMM and HMM are, in this sense, almost opposites Full Frontal View of PHMMs 8
Generic View of PHMM Have delete, insert, and match states o Match states correspond to HMM states o Delete and insert states are new things Arrows are possible transitions o And each transition has a probability Transition probabilities are in A matrix Emission probabilities are B matrices o In PHMM, observations called emissions o Match and insert states have emissions Full Frontal View of PHMMs 9
PHMM without Gaps If no insert or delete, PHMM is simple Illustrate such a PHMM as Here, Mi is ith match state o This diagram ignores the B matrix (matrices) and observations o Recall, that in PHMM, there is a distinct B matrix for each match state Full Frontal View of PHMMs 10
PHMM with Insertions If we also allow for insert states, diagram is of the form Allows for multiple insertions between match states Full Frontal View of PHMMs 11
PHMM with Deletions If instead, we allow for delete states, obtain the following diagram Note that a deletion skips over the corresponding match state Full Frontal View of PHMMs 12
Generic View of PHMM Circles are delete states, diamonds are insert states, squares are match states Note the many possible transitions Full Frontal View of PHMMs 13
PHMM Notation Notation Full Frontal View of PHMMs 14
PHMM Match state probabilities easily determined from MSA aMi,Mi+1 transitions between match states eMi(k) emission probability at match state Many other transition probabilities o For example, aMi,Ii and aMi,Di+1 Emissions at all match & insert states o Remember, emission == observation Full Frontal View of PHMMs 15
Multiple Sequence Alignment First we consider MSA construction o The difficult part!!! Lots of ways to do this o Best way depends on specific problem Then construct PHMM from MSA o This is the easy part o Standard way to generate PHMM from MSA How to score a sequence? o Forward algorithm, similar to HMM o But more complex due to insert/delete states Full Frontal View of PHMMs 16
MSA How to construct MSA? o First, construct pairwise alignments o Then combine these into an MSA We allow for gaps to be inserted o To make better matches Gaps tend to weaken PHMM scoring o So, tradeoff between number of gaps (better match) and strength of score Full Frontal View of PHMMs 17
Global vs Local Alignment For these pairwise alignment examples o - is gap o | means elements aligned o * for omitted beginning/ending symbols Full Frontal View of PHMMs 18
Global vs Local Alignment Global alignment is lossless o But gaps tend to proliferate o And gaps increase when combined into MSA o More gaps, more random matches o and the model is less useful for scoring Usually want to consider local alignment o That is, omit ends for better alignment For simplicity, we do global alignment in examples presented here (and in book) Full Frontal View of PHMMs 19
Pairwise Alignment Allow gaps when aligning How to score an alignment? o Based on n x n substitution matrix S o Where n is number of symbols What algorithm(s) to align sequences? o Usually, use dynamic programming o Sometimes, HMM is used o Other? Local alignment? Additional issues arise Full Frontal View of PHMMs 20
Pairwise Alignment Example Tradeoff: Gaps vs misaligned elements o Depends on matrix S and gap penalty function Full Frontal View of PHMMs 21
Substitution Matrix For example, masquerade detection o Detect imposter using computer account Consider 4 different operations o E == send email o G == play games o C == C programming o J == Java programming How similar are these to each other? Full Frontal View of PHMMs 22
Substitution Matrix Consider 4 different operations: o E, G, C, J Possible substitution matrix: Diagonal matches o High positive scores Which others most similar? o J and C, so substituting C for J is a high score Game playing/programming, very different o So substituting G for C is a negative score Full Frontal View of PHMMs 23
Substitution Matrix Depending on problem, might be easy or very difficult to find useful S matrix Consider masquerade detection based on UNIX commands o Difficult to say how close 2 commands are Suppose instead, aligning DNA sequences o Biologically valid reasons for S matrix o Easier to determine sensible S matrix Full Frontal View of PHMMs 24
Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic o Less useful for scoring, so we penalize gaps How to penalize gaps? Popular options Linear gap penalty function: g(x) = ax (constant penalty for every gap) Affine gap penalty function g(x) = a + b(x 1) o Gap opening penalty a and constant penalty of b for each extension of existing gap Full Frontal View of PHMMs 25
Pairwise Alignment Algorithm We ll use dynamic programming o Based on S matrix and gap penalty function Notation: Full Frontal View of PHMMs 26
Pairwise Alignment DP Assuming linear gap penalty, g(x) = ax Initialization: Recursion: Best score is in F(n,m) Affine gap penalty case is more complex Full Frontal View of PHMMs 27
MSA from Pairwise Alignments Given pairwise alignments How to construct MSA? Generally use progressive alignment o Select one pairwise alignment o Select another and combine with first o Continue to add more until all are used Relatively easy (good) Gaps proliferate, and it s unstable (bad) Full Frontal View of PHMMs 28
MSA from Pairwise Alignments Lots of ways to improve on generic progressive alignment o Here, we mention one such approach o Not necessarily best (or even most popular) Feng-Dolittle progressive alignment o Compute scores for all pairs of n sequences o Select n-1alignments that a) connect all sequences and b) maximize pairwise scores o Then generate a minimum spanning tree o For MSA, add sequences in the order that they appear in the spanning tree Full Frontal View of PHMMs 29
MSA Construction Create pairwise alignments o Dynamic program for pairwise alignments Specify function g and matrix S Use pairwise alignments to make MSA o First, construct spanning tree (Prim s Algorithm, for example) o Add sequences in spanning tree order, from high score, o Gaps and/or misalignments as needed Full Frontal View of PHMMs 30
MSA Example Suppose we have 10 sequences, with the following pairwise alignment scores Full Frontal View of PHMMs 31
MSA Example: Spanning Tree Spanning tree based on scores So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) Full Frontal View of PHMMs 32
MSA Snapshot Intermediate step and final o Use + for neutral symbol o Then - for gaps in MSA Note increase in gaps Full Frontal View of PHMMs 33
PHMM from MSA In PHMM, determine match and insert states & probabilities from MSA Conservative columns are match states o Half or less of symbols are gaps Other columns are insert states o Majority of symbols are gaps Delete states are a separate issue o Should become clear from example Full Frontal View of PHMMs 34
PHMM States from MSA Consider a simple MSA Columns 1,2,6 are match states 1,2,3, respectively o Since less than half gaps Columns 3,4,5 are combined to form insert state 2 o Since more than half gaps o Can loop in insert state o Match states between insert Full Frontal View of PHMMs 35
Probabilities from MSA Emission probabilities o Determined by symbol distribution in match and insert states State transition probs o Based on transitions in the MSA Full Frontal View of PHMMs 36
Probabilities from MSA Emission probabilities: But 0 probabilities are bad o Model overfits the data o So, use add one rule o Add one to each numerator, add total to denominators Full Frontal View of PHMMs 37
Probabilities from MSA More emission probabilities: But 0 probabilities still bad o Model overfits the data o Again, use add one rule o Add one to each numerator, add total to denominators Full Frontal View of PHMMs 38
Probabilities from MSA Transition probabilities: We look at some examples o Note that - is delete state First, consider begin state: Again, use add one rule Full Frontal View of PHMMs 39
Probabilities from MSA Transition probabilities When no information in MSA, set probs to uniform For example I1 does not appear in MSA, so Full Frontal View of PHMMs 40
Probabilities from MSA Transition probabilities, another example What about transitions from state D1? Can only go to M2, so Again, use add one rule: Full Frontal View of PHMMs 41
PHMM Emission Probabilities Emission probabilities for the given MSA o Using add-one rule Full Frontal View of PHMMs 42
PHMM Transition Probabilities Transition probabilities for the given MSA o Using add-one rule Full Frontal View of PHMMs 43
PHMM Summary Determine all pairwise alignments o Typically, use dynamic programming Use pairwise alignments to construct MSA o Lots of ways to do this Using MSA, determine probabilities o Emission probabilities o State transition probabilities Then we have trained our PHMM o Now what??? Full Frontal View of PHMMs 44
PHMM Scoring How to score an emmision sequence X against a PHMM? We want = (A,E, ) As with HMM, there is a brute force way, and an efficient way First, an example to illustrate the brute force method o Then, the efficient way Full Frontal View of PHMMs 45
PHMM Scoring: Brute Force Consider PHMM with 2 emitted symbols, (x,y), And, suppose that N=2 Table lists all possible paths Full Frontal View of PHMMs 46
PHMM Scoring: Brute Force Full Frontal View of PHMMs 47
PHMM Scoring Want to efficiently score sequences to see how closely they match PHMM How did we score using HMM? o Forward algorithm How to score sequences with PHMM? o Forward algorithm (surprise!) But, PHMM scoring is more complex o Due to more complex state transitions o and (slightly) more complex B matrices Full Frontal View of PHMMs 48
Forward Algorithm Notation o Indices i and j are columns in MSA o xi is ith observation (emission) symbol o qxi is distribution of xiin random model o Base case is o is score of x1, ,xiup to state j (note that in PHMM, i and j may not agree) Some states undefined o Undefined states ignored in calculation Full Frontal View of PHMMs 49
Forward Algorithm Compute P(X| ) recursively Note that depends on , and o Also depends on state transition probabilities Full Frontal View of PHMMs 50