Sequence Analysis Concepts
Delve into the important concepts of homology, sequence similarity, limits of alignment detection, and determining homology in sequence analysis. Learn about the twilight zone, database similarity searching, and sequence database searching. Explore the nuances of inferring homologous relationships and the statistical significance of alignments.
Uploaded on Feb 21, 2025 | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Homology vs. similarity again Just a reminder of the important concept in sequence analysis homology. It is a conclusion about a common ancestral relationship drawn from sequence similarity. Sequence similarity is a direct result of observation from the sequence alignment. It can be quantified using percentages, but homology can not! It is important to understand this difference between homology and similarity. If the similarity is high enough, a common evolutionary relationship can be inferred.
Limits of the alignment detection However, what is enough? How many mutations can occur before the differences make two sequences unrecognizable? Intuitively, at some point two homologous sequences become so divergent that they do not align well.
Twilight zone The level one can infer homologous relationship depends on type of sequence (proteins, NA) and on the length of the alignment. Unrelated sequences of DNA have at least 25% chance to be identical. For proteins it is 5%. If gaps are allowed, this percentage can increase up to 10-20%. The shorter the sequence, the higher the chance that some alignment can be attributed to random chance. This suggest that shorter sequences require higher cuttof for inferring homology than longer sequences.
More than: 30% for proteins long at least 100 residues 70% for nucleotides will be considered as homologous Essential bioinformatics, Xiong
Determining homology It must be stressed that the percentage identity values only provide a tentative guidance for homology identification. This is not a precise rule for determining sequence relationships, especially for sequences in the twilight zone. A statistically more rigorous approach to determine homologous relationships exist. The statistical significance of the alignment (i.e. its score) can be tested. However, I will not cover this advanced topic in this lecture.
Sequence database searching query sequence pairwise alignment closely related matches target sequence database
Database searching requirements sensitivity the ability to find as many correct hits (TP) as possible selectivity (specificity) ability to exclude incorrect hits (FP) speed ideally: high sensitivity, high specificity, high speed reality: increase in sensitivity leads to decrease in specificity, improvement in speed often comes at the cost of lowered sensitivity and selectivity
Types of algorithms exhaustive uses a rigorous algorithm to find the exact solution for a particular problem by examining all mathematical combinations example: dynamic programming heuristic computational strategy to find an empirical or near optimal solution by using rules of thumb
Heuristic algorithms Perform faster searches because they examine only a fraction of the possible alignments examined in regular dynamic programming currently, there are two major algorithms: FASTA BLAST - Basic Local Alignment Search Tool, Google of the sequence world Not guaranteed to find the optimal alignment or true homologs, but are 50 100 times faster than DP. The increased computational speed comes at a moderate expense of sensitivity and specificity of the search, which is easily tolerated by working molecular biologists. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mo.l Biol. 1990 Oct 5;215(3):403-10.
Finding homologues Sequences look alike probably have the same structure and function. Use a sequence as a search query in order to find homologous sequences in a database. Inferring function for a novel sequence learning from previous data available for homologous sequences
Two components of BLAST BLAST consists of two components: a search algorithm and the evaluation of the quality of solutions
BLAST strategy Basic Local Alignment Search Tool Find short stretches (words) of identical or nearly identical letters in two sequences. The basic assumption is that two related sequences must have at least one word in common. By first identifying word matches, a longer alignment can be obtained by extending similarity regions from the words. Once regions of high sequence similarity are found, adjacent high-scoring regions can be joined into a full alignment.
How BLAST works 1st step Divide a query sequence into words of length W (W = 3 for proteins, W = 1 for nucleic acids) LGQ GQA QAL ALW LWG WGQ GQI QIW IWW LGQALWGQIWW
How BLAST works 1st step For each of these words, a list of similar words is created using a substitution matrix (implicit: BLOSUM62). LWG ... 21 IWG ... 19 MWG ... 19 VWG ... 18 FWG ... 17 LYG ... 12 LFG ... 11 FWS ... 11 AWS ... 10 LGQALWGQIWW 4 6 threshold T 11
How BLAST works 2nd step Scan the database sequences for exact matches with the high-scoring words. LWG IWG MWG VWG FWG LYG
How BLAST works 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG LGQALWGQIWW query sequence WTDFGYITALYGRINC database sequence
How BLAST works 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG LGQALWGQIWW -1-4-144 2 61 4-4-2 query sequence WTDFGYITALYGRINC database sequence S = 12
How BLAST works 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG LGQALWGQIWW -1-4-144 2 61 4-4-2 WTDFGYITALYGRINC query sequence database sequence S = 17
How BLAST works 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG LGQALWGQIWW -1-4-144 2 61 4-4-2 WTDFGYITALYGRINC query sequence database sequence S = 20
How BLAST works 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG LGQALWGQIWW -1-4-144 2 61 4-4-2 WTDFGYITALYGRINC query sequence database sequence S = 12
How BLAST works 3rd step Extend the exact matches to high-scoring segment pair (HSP) Recent improvement (BLAST 2.0) enables the explicit treatment of gaps. LYG LGQALWGQIWW -1-4-144 2 61 4-4-2 WTDFGYITALYGRINC query sequence HSP database sequence S = 20
BLAST when to stop the extension Example (identical scores +1, mismatch scores -1) The quick brown fox jump The quiet brown cat purr 123 45654 56789 876 5654 <- score 000 00012 10000 123 4345 <- drop off score stop here if X = 4 stop here if X = 2
BLAST when to stop the extension Score X Trim to max The length of the extension
How BLAST works Under certain conditions, HSPs can be joined to extend the alignment. overlapping HSPs not that distant HSPs
1 query sequence The query sequence is cut in words of length W list For each word, the list of similar words is created using a substitution matrix 2 match database sequences scan 3 high scoring pair the extension of the similarity on both sides of the word extend
BLAST parameters W : Word size find W-mers in target/query 2-3 (3) for proteins, 6-11 (28) for NA T : Neighborhood word score threshold focus on pairs more than T usually 11-13 X : Drop-off stop extending when score loss is higher than X S : Score the final score of a HSP
BLAST parameters Adjusting T and W controls both speed and sensitivity (TP) of BLAST When T is raised, the speed of the search is increased, but fewer hits are registered, and so distantly related database matches may be missed. When T is lowered, the search proceeds more slowly, but many more word hits are evaluated, and thus sensitivity is increased. To speed up BLASTN, increase W (T is not used in BLASTN, words are always identical) To speed up BLASTP, set W=3 and T to a large value. W and T better for controlling speed than X
Which sequence to search? The choice of the type of sequences also influences the sensitivity of the search. Clear advantage of using protein sequences in detecting homologs If the input sequence is a protein-encoding DNA sequence, use BLASTX (six open reading frames before sequence comparisons) If you re looking for protein homologs encoded in newly sequenced genomes, you may use TBLASTN. This may help to identify protein coding genes that have not yet been annotated. If a DNA sequence is to be used as the query, a protein- level comparison can be done with TBLASTX. TBLASTN, TBLASTX are very computationally intensive and the search process can be very slow.
E-value expected value The E-value estimates the expected number of records in the database that will be returned with a score as good as or better than the score of the record under scrutiny. An E value of 1 means that in a database of the current size one might expect to see 1 match with a similar score simply by chance. A value close to zero means that you would practically expect no unrelated sequence to score as high to your query sequence.
The interpretation of E-value The primary use of the E-value is to help to answer the question Is this alignment meaningful? . No if it has biological meaning. What is the highest E-value that I should consider as significant? No definite answer, depends on your goals and sequences. Generally, the lower the better. For example, at NCBI they use (in their internal processes) the cutoff of 1e-6 for blastp against the BLAST nr database. But, in some cases, this may be too restrictive for you.
Bit score A typical BLAST output reports E values and scores. There are two kinds of scores: raw and bit scores. Raw scores are calculated from the substitution matrix and the gap penalty parameters. The bit score S is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Bit scores from different alignments, even those employing different scoring matrices in separate BLAST searches, can be compared. E-values can not be compared when searching in different databases. The bit scores, however, will remain the same.