Ensembles of Hidden Markov Models in Bioinformatics Challenges

using ensembles of hidden markov models for grand n.w
1 / 68
Embed
Share

Explore the use of Ensemble of Hidden Markov Models in addressing key bioinformatics challenges including phylogenetic placement, multiple sequence alignment, and gene family assignment. Discover the application of Profile Hidden Markov Models in DNA sequence alignment and the need for novel techniques in multiple sequence alignment for scalability and accuracy. Simulation studies provide insights into comparing true and estimated trees and alignments.

  • Bioinformatics
  • Hidden Markov Models
  • Sequence Alignment
  • Ensemble Learning
  • DNA Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Department of Computer Science The University of Illinois at Urbana-Champaign

  2. Four Problems Phylogenetic Placement Multiple sequence alignment Metagenomic taxon identification Gene family assignment and homology detection A unifying technique is the Ensemble of Hidden Markov Models (introduced by Mirarab et al., 2012)

  3. Profile HMMs Generative model for representing a MSA Consists of: Set of states (Match, insertion, and deletion) Transition probabilities Emission probabilities

  4. Profile Hidden Markov Model for DNA sequence alignment

  5. Profile Hidden Markov Models Probabilistic model to represent a family of sequences, represented by a multiple sequence alignment Introduced for sequence analysis in Krogh et al. 1994. Popularized in Eddy 1996 and textbook Durbin et al. 1998 Fundamental part of HMMER and other protein databases Used for: homology detection, protein family assignment, multiple sequence alignment, phylogenetic placement, protein structure prediction, alignment segmentation, etc.

  6. Multiple Sequence Alignment (MSA): an important grand challenge1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- Sn = -------TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

  7. Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 Compare S2 S3 S4 S3 True tree and alignment Estimated tree and alignment

  8. 1kp: Thousand Transcriptome Project N. Matasci iPlant T. Warnow, UIUC S. Mirarab, UCSD N. Nguyen UCSD J. Leebens-Mack U Georgia N. Wickett Northwestern G. Ka-Shu Wong U Alberta Plus many many other people Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) Challenge: Alignments and trees on > 100,000 sequences

  9. PASTA: even better than SAT -2 RNASim 0.20 0.15 Clustal Omega Tree Error (FN Rate) Muscle Mafft Starting Tree 0.10 SATe2 PASTA Reference Alignment 0.05 0.00 10000 50000 100000 200000 PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015 Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish)

  10. 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary Mean:317 Median:266 12000 10000 8000 Counts 6000 4000 2000 0 1000 Length 0 500 1500 2000

  11. 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary Mean:317 Median:266 12000 10000 8000 Counts 6000 All standard multiple sequence alignment methods we tested performed poorly on datasets with fragments. 4000 2000 0 1000 Length 0 500 1500 2000

  12. HMMs for MSA Given seed alignment (e.g., in PFAM) and a collection of sequences for the protein family: Represent seed alignment using HMM Align each additional sequence to the HMM Use transitivity to obtain MSA

  13. HMMs for MSA Given seed alignment (e.g., in PFAM) and a collection of sequences for the protein family: Represent seed alignment using HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Can we do something like this without a seed alignment?

  14. UPP UPP = Ultra-large multiple sequence alignment using Phylogeny-aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

  15. UPP UPP = Ultra-large multiple sequence alignment using Phylogeny-aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences. Uses an ensemble of HMMs

  16. Simple idea (not UPP) Select random subset of sequences, and build backbonealignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

  17. PASTA: even better than SAT -2 Starting tree is based on UPP(simple): one profile HMM for Backbone alignment RNASim 0.20 0.15 Clustal Omega Tree Error (FN Rate) Muscle Mafft Starting Tree 0.10 SATe2 PASTA Reference Alignment 0.05 0.00 10000 50000 100000 200000 PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015 Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish)

  18. This approach works well if the dataset is small and has low evolutionary rates, but is not very accurate otherwise. Select random subset of sequences, and build backbonealignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

  19. One Hidden Markov Model for the backbone alignment? HMM 1

  20. Or 2 HMMs? HMM 1 HMM 2

  21. Or 4 HMMs? HMM 1 HMM 2 HMM 3 HMM 4

  22. Or all 7 HMMs? HMM 1 HMM 2 HMM 4 HMM 7 HMM 3 HMM 5 HMM 6

  23. UPP Algorithmic Approach 1. Select random subset of full-length sequences, and build backbonealignment 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

  24. Evaluation Simulated datasets (some have fragmentary sequences): 10K to 1,000,000 sequences in RNASim complex RNA sequence evolution simulation 1000-sequence nucleotide datasets from SAT papers 5000-sequence AA datasets (from FastTree paper) 10,000-sequence Indelible nucleotide simulation Biological datasets: Proteins: largest BaliBASE and HomFam RNA: 3 CRW datasets up to 28,000 sequences

  25. RNASim Million Sequences: tree error Using 12 processors: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days

  26. UPP is very robust to fragmentary sequences Under high rates of evolution, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Mean alignment error 0.6 0.4 0.2 0.0 0 12.5 25 50 UPP continues to have good accuracy even on datasets with many fragments under all rates of evolution. % Fragmentary PASTA UPP(Default) (a) Average alignment error Delta FN tree error 0.4 0.2 0.0 0 12.5 25 50 % Fragmentary PASTA UPP(Default) (b) Average tree error Performance on fragmentary datasets of the 1000M2 model condition

  27. UPP Running Time Wall clock align time (hr) 15 10 5 0 100000 50000 150000 200000 Number of sequences UPP(Fast) Wall-clock time used (in hours) given 12 processors

  28. Other Applications of the Ensemble of HMMs SEPP (phylogenetic placement, Mirarab, Nguyen, and Warnow PSB 2014) TIPP (metagenomic taxon identification, Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014) HIPPI (protein classification and remote homology detection), RECOMB-CG 2016 and BMC Genomics 2016 (Nguyen, Nute, Mirarab, and Warnow)

  29. Metagenomic taxonomic identification and phylogenetic profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

  30. Basic Questions 1. What is this fragment? (Classify each fragment as well as possible.) 2.What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.) 3.What are the organisms in this metagenomic sample doing together?

  31. SEPP and TIPP SEPP (PSB 2012): SAT -enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs) TIPP (Bioinformatics 2014): Applications of the eHMM technique to metagenomic abundance classification Both available at https://github.com/smirarab/sepp

  32. Phylogenetic Placement Input: Backbone alignment and tree on full-length sequences, and a set of homologous query sequences (e.g., reads in a metagenomic sample for the same gene) Output: Placement of query sequences on backbone tree Phylogenetic placement can be used inside a pipeline, after determining the genes for each of the reads in the metagenomic sample.

  33. Marker-based Taxon Identification Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . . . ACCT AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

  34. Input S1 S2 S3 S4 Q1 = -AGGCTATCACCTGACCTCCA-AA = TAG-CTATCAC--GACCGC--GCA = TAG-CT-------GACCGC--GCT = TAC----TCAC--GACCGACAGCT = TAAAAC S1 S2 S3 S4

  35. Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A AAAC S1 S2 S3 S4

  36. Place Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A AAAC S1 S2 S3 S4 Q1

  37. Phylogenetic Placement in 2011 Align each query sequence to backbone alignment HMMER (Finn et al., NAR 2011) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA solve same problem (maximum likelihood placement under standard sequence evolution models)

  38. HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

  39. One Hidden Markov Model for the entire alignment?

  40. Or 2 HMMs? HMM 1 HMM 2

  41. Or 4 HMMs? HMM 1 HMM 2 HMM 3 HMM 4

  42. SEPP Parameter Exploration Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP: Small alignment subset sizes best Large placement subset size best 10% rule (both subset sizes 10% of backbone) had best overall performance

  43. SEPP (10%-rule) on simulated data 0.0 0.0 Increasing rate of evolution

  44. Metagenomic Taxon Identification Objective: classify short reads in a metagenomic sample

  45. Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E

  46. TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler s marker genes TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree.

  47. TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler s marker genes TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment

  48. TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler s marker genes TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment

  49. TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler s marker genes TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignments

More Related Content