Protein Domain Analysis and Gene Finding via HMMs

Slide Note

This content delves into protein domain analysis using Hidden Markov Models (HMMs) and gene finding techniques. It discusses the use of profiles, scoring matrices, and regular expressions to identify protein sequence motifs. Additionally, it explores the application of Psi-BLAST for sequence similarity searches and highlights the extension of profiles using HMMs for improved sensitivity in domain analysis.

kani_vy Follow

Uploaded on Feb 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CSE182-L8 Protein domain analysis via HMMs Gene finding February 25

Regular expressions as Protein sequence motifs C-X-[DE]-X{10,12}-C-X-C--[STYLV] Fam(B) A C E F Problem: if there is a mis-match, the sequence is not accepted. February 17, 2025

Representation 2: Profiles Profiles versus regular expressions Regular expressions are intolerant to an occasional mis-match. The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. Profiles capture some of these ideas. February 25

Profiles Start with an alignment of strings of length m, over an alphabet A, Build an |A| X m matrix F=(fki) Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 February 25

Scoring matrices Given a sequence s, does it belong to the family described by a profile? We align the sequence to the profile, and score it Let S(si,j) be the score of aligning position j of the profile to residue si The score of an alignment is the sum of column scores. j s si February 25

Scoring Profiles k S(si, j)= M rk,sj fkj Scoring Matrix j k fkj s February 25 i

Domain analysis via profiles Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. What if the sequence matches some other sequences weakly (using BLAST), but does not match any known profile? February 25

Psi-BLAST idea Seq Db --In the next iteration, the red sequence will be thrown out. --It matches the query in non-essential residues Iterate: Find homologs using Blast on query Discard very similar homologs Align, make a profile, search with profile. Why is this more sensitive? February 25

EXTENDING PROFILES USING HIDDEN MARKOV MODELS February 25

QUIZ! Question: Your friend likes to gamble. She tosses a coin: HEADS, she gives you a dollar. TAILS, you give her a dollar. Usually, she uses a fair coin, but once in a while , she uses a loaded coin. Can you say what fraction of the times she loads the coin? February 25

Representation 3: HMMs Building good profiles relies upon good alignments. Difficult if there are gaps in the alignment. Psi-BLAST/BLOCKS etc. work with gapless alignments. An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework. Also allows for position specific gap scoring. V February 25

The generative model Think of each column in the alignment as generating symbols according to a distribution. For each column, build a node that outputs an a.a. with the appropriate probability 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 February 25

A simple Profile HMM Connect nodes for each column into a chain. Thie chain generates random sequences. What is the probability of generating FKVVGQVILD? In this representation Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] What is the difference with Profiles? February 25

HMMs with indels February 25

Profile HMMs can handle gaps The match states are the same as on the previous page. Insertion and deletion states help introduce gaps. When in an insert state, generate any amino-acid When in delete, generate a - A sequence may be generated using different paths. February 25

Example A L - L A I V L A I - L Probability [ALIL] is part of the family? Note that multiple paths can generate this sequence. Go to M1, and generate A Go to I1 and generate L Go to M2 and generate I Go to M3 and generate L 1 2 3 OR 1 2 3 4 4 Go to M1, and generate A Go to M2 and generate L Go to I2 and generate I Go to M3 and generate L February 25

Example A L - L A I V L A I - L Probability [ALIL] is part of the family? Note that multiple paths can generate this sequence. M1I1M2M3 M1M2I2M3 In order to compute the probabilities, we must assign probabilities of transition between states February 25

Profile HMMs Directed Automaton M with nodes and edges. Nodes emit symbols according to emission probabilities Transition from node to node is guided by transition probabilities Joint probability of seeing a sequence S, and path P Pr[S,P|M] = Pr[S|P,M] Pr[P|M] Pr[ALIL AND M1I1M2M3|M] = Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M] Pr[ALIL | M] = ? February 25

HMM Q is a set of states is the probability distribution on initial state T is a matrix of transition probabilities T[j,k]: probability of moving from state j to state k is a set of symbols ej(S) is the probability of emitting S while in state j. T[j,k] k j i Automaton M=(Q,T, , ,e) At first, M goes to initial state j with probability j In state j, M emits a symbol from according to ej, and moves to state k with probability T[j,k]. February 17, 2025

Two solutions An unknown (hidden) path is traversed to produce (emit) the sequence S. The probability that M emits S can be either The sum over the joint probabilities over all paths. Pr(S|M) = P Pr(S,P|M) OR, it is the probability of the most likely path Pr(S|M) = maxP Pr(S,P|M) Both are appropriate ways to model, and have similar algorithms to solve them. February 25

Viterbi Algorithm for HMM A L - L A I V L A I - L Let Pmax(i,j|M) be the probability of the most likely solution that emits S1 Si, and ends in state j (is it sufficient to compute this?) February 25

Viterbi and sum algorithm for testing membership S1 Si-1 k Si j Let Pmax(i,j|M) be the probability of the most likely solution that emits S1 Si, and ends in state j (is it sufficient to compute this?) Pmax(i,j|M) = max k Pmax(i-1,k) T[k,j] ej(Si) (Viterbi) Psum(i,j|M) = k (Psum(i-1,k) T[k,j]) ej(Si) February 17, 2025

Profile HMM membership A L - L A I V L A I - L A L I L Path:M1 M2 I2 M3 We can use the Viterbi/Sum algorithm to compute the probability that the sequence belongs to the family. Backtracking can be used to get the path, which allows us to give an alignment February 25

HMM fair-coin example 0.6 0.6 1 0.4 0.4 EF(H)=0.5 EL(H)=0.1 February 17, 2025

0.6 0.6 1 0.4 0.4 EF(H)=0.5 EL(H)=0.1 H H T T T is the observed sequence P(1,F) 0.6 1.5e-1 4.5e-2 1.3e-2 5.8e-3 0.5 1 0 0.4 1.6e-2 0 2e-2 5.4e-2 2.9e-2 February 17, 2025

Summary HMMs allow us to model position specific gap penalties, and allow for automated training to get a good alignment. Patterns/Profiles/HMMs allow us to represent families and foucs on key residues Each has its advantages and disadvantages, and needs special algorithms to query efficiently. February 25

Protein Domain databases HMM A number of databases capture proteins (domains) using various representations Each domain is also associated with structure/function information, parsed from the literature. Each database has specific query mechanisms that allow us to compare our sequences against them, and assign function 3D February 25

Biology In our discussion of BLAST, we alternated between looking at DNA, and protein sequences, treating them as strings. DNA, RNA, and proteins are the 3 important molecules What is the relation between the three? February 25

February 25

Transcription and translation We define a gene as a location on the genome that codes for proteins. The genic information is used to manufacture proteins through transcription, and translation. There is a unique mapping from triplets to amino- acids February 25

Gene We define a gene as a location on the genome that codes for proteins. The genic information is used to manufacture proteins through transcription, and translation. There is a unique mapping from triplets to amino- acids 2/17/2025 CSE 182

Transcription February 25

Translation February 25

Transcription start ATAGATGATGTACGATGAGAATGTGATTAATG Translation start Donor Acceptor

Translation The ribosomal machinery reads mRNA. Each triplet is translated into a unique amino-acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon. 2/17/2025 CSE 182

Translation The ribosomal machinery reads mRNA. Each triplet is translated into a unique amino-acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon. Given a DNA sequence, how many ways can you translate it? 2/17/2025 CSE 182

Gene Features The gene can lie on any strand (relative to the reference genome) The code can be in one of 3 frames. Frame 1 Frame 2 Frame 3 S R V * W R V Q Y S G * S I V D AGTAGAGTATAGTGGACG TCATCTCATATCACCTGC -ve strand February 25

Eukaryotic gene structure 2/17/2025 CSE 182

Gene Features ATG 5 UTR 3 UTR exon intron Translation start Acceptor Donor splice site Transcription start 2/17/2025 CSE 182

Gene identification Eukaryotic gene definitions: Location that codes for a protein The transcript sequence(s) that encodes the protein The protein sequence(s) Suppose you want to know all of the genes in an organism. This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence. All of that changed with better reagents and the development of high throughput methods like EST sequencing 2/17/2025 CSE 182

End of L9 February 25

Protein Domain Analysis and Gene Finding via HMMs

Download Presentation

Presentation Transcript

Related

More Related Content