Enhancing Phylogenetic Estimation Through SuperFine Technology

superfine enabling large scale phylogenetic n.w
1 / 31
Embed
Share

Explore the cutting-edge technology of SuperFine that enables large-scale phylogenetic estimation, emphasizing the importance of phylogeny, DNA sequence evolution, and the application of multiple genes in tree estimation. Understand the significance of the Tree of Life in biology and the evolution of DNA sequences over millions of years, along with tackling phylogeny problems using advanced methods for accurate estimation.

  • Phylogenetic
  • Technology
  • DNA Sequences
  • Tree of Life
  • Evolution

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology

  2. Phylogeny (evolutionary tree) Orangutan Human Gorilla Chimpanzee Nothing in Biology makes sense except in the light of evolution Dobhzhansky 1 3 2 (1-3) From the Tree of the Life Website, University of Arizona

  3. Tree of Life, Importance to Biology Biomedical applications Mechanisms of evolution Tracking ancient migrations Protein structure and function Drug design We are here 1 2 3 1) Nature Reviews (Genetics) 2) Howard Hughes Medical Institute (BioInteractive) 3) 1000 Genomes Project

  4. DNA sequence evolution (idealized) -3 million yrs AAGACTT AAGACTT -2 million yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 million yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT TAGCCCA TAGCCCA TAGACTT TAGACTT AGCACAA AGCACAA AGCGCTT AGCGCTT today AGGGCAT AGGGCAT

  5. Phylogeny Problem U V W X Y AGACTA TGGACA TGCGACT AGGTCA AGATTA X U Y V W U V W X Y

  6. Two basic approaches for tree estimation on multi-gene datasets Apply phylogeny estimation methods to concatenated ( combined ) sequence alignments for different genes Compute trees on individual genes and apply a supertree method This Talk: SuperFine, boosts supertree methods, enabling faster, more accurate estimation for large scale problems

  7. Using multiple genes gene 1 gene 3 S1 S2 S3 TCTAATGGAA S1 gene 2 GCTAAGGGAA TATTGATACA S3 TCTAAGGGAA TCTAACGGAA TCTTGATACC TAGTGATGCA S4 S4 S4 GGTAACCCTC S7 S8 S5 TCTAATGGAC GCTAAACCTC S7 S8 TAGTGATGCA S6 TATAACGGAA GGTGACCATC CATTCATACC S7 GCTAAACCTC

  8. Concatenation gene 2 gene 3 gene 1 S1 S2 S3 TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA ? ? ? ? ? ? ? ? ? ? GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? TCTAAGGGAA TCTAACGGAA TCTTGATACC TAGTGATGCA ? ? ? ? ? ? ? ? ? ? S4 GGTAACCCTC S5 GCTAAACCTC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S6 GGTGACCATC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S7 S8 TCTAATGGAC GCTAAACCTC TAGTGATGCA TATAACGGAA CATTCATACC ? ? ? ? ? ? ? ? ? ?

  9. Two competing approaches gene 1 gene 2 . . . gene k Species . . . Concatenation Analyze separately . . . Supertree Method

  10. Why use supertree methods? Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees)

  11. Many Supertree Methods Matrix Representation with Parsimony (Most commonly used and among most accurate) MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ...

  12. Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

  13. FN rate MRP vs. Concatenation MRP Concatenation FN Rate (%) Scaffold Density (%) Concatenation is not always an option We need better supertree methods

  14. FN Rate SuperFine vs. MRP and Concatenation MRP SuperFine Concatenation FN Rate (%) Scaffold Density (%)

  15. Running Time SuperFine vs. MRP (Concatenation is much slower) MRP SuperFin e Minutes MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

  16. Idea behind SuperFine 1. Construct a supertree with low false positive rate 2. Reduce false negatives by resolving areas of uncertainty using a supertree method Quartet Max Cut (Swenson et al., Systematic Biology, 2011)

  17. Bipartitions and refinement Let B(T) denote the set of (non-trivial) bipartitions induced by the edges of T. TrefinesT (T T) if B(T) B(T ) Polytomy c d c a e a d Refinement e b f b f T T B(T ) = {ab|cdef, abc|def} B(T) = {ab|cdef, abc|def,abcd|ef}

  18. Idea behind SuperFine 1. Construct a supertree with low FP using the Strict Consensus Merger (SCM) (Huson et al. 1999) 2. Reduce FN by resolving each polytomy using a supertree method Quartet Max Cut

  19. Strict Consensus Merger (SCM) b e a e b a e c b f a c d f g g d f g a b b a c h d i j c h c i j d h d i j

  20. Property of SCM: Bipartitions in SCM tree correspond to bipartitions in the source trees b e a e b a e c b f a c d f g g d f g a b b a c h d i j c h c i j d h d i j Swenson, Ph.D. Thesis, 2009

  21. Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges) Runs in polynomial time (in the number of source trees and total number of species)

  22. Idea behind SuperFine 1. Construct a supertree with low FP using SCM 2. Refine the tree to reduce FN by resolving each polytomy using a supertree method (eg. MRP) Quartet Max Cut

  23. Resolving a single polytomy, v Step 1: Reduce each source tree to a tree on {1,2,...,d}, where d=degree(v) Step 2: Apply MRP to the collection of reduced trees, to produce a tree t on leafset {1,2,...,d} Step 3: Replace the star tree at v by tree t

  24. Back to Our Example b 1 e 1 a 1 e b a f g f 65 c 1 d 4 c g h d i j a 1 b 1 1 2 3 h i j 4 5 6 c 1 a c e b h 2 d g f d 4 i j 3 3

  25. Where We Use the Property b e a e b a 1 f g 65 f c c 4 d g h d i j a b 1 c 2 h 3 4 d i j

  26. Step 1: Reduce each source tree to a tree on the set {1,2,...,d} b e a 1 f 65 c d 4 g a b 1 2 4 c 3 h d i j

  27. Step 2: Apply MRP to the collection of reduced trees 1 4 5 MRP 1 4 6 5 MRP 1 4 2 3 6 2 3

  28. Replace polytomy using tree from MRP e b b a ce f 5 g a 4 d 1 g c h d i j 2 h 3 6 f j h i i j a c e b d g f

  29. FN Rate SuperFine vs. MRP and Concatenation MRP SuperFine Concatenation FN Rate (%) Scaffold Density (%)

  30. Running Time SuperFine vs. MRP (Concatenation is much slower) MRP SuperFin e Minutes MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

  31. SuperFine: Boosting supertree methods Superfine+MRP vs. MRP (Swenson et al. 2011) SuperFine combines the features of the SCM method (polynomial time, low false positive rates) with the lower false negative rate of MRP, to achieve greater accuracy in less time. Speed-up results from the re-encoding of source trees as smaller trees. SuperFine+QMC vs. QMC (quartet-based) QMC (Snir 2008), polynomial time, but infeasible for 500+ taxa SuperFine+QMC, runs where QMC cannot (Swenson et al. 2010) SuperFine+MRL vs. MRL (likelihood) (Nguyen et al. 2012) SuperFine+MRL, faster and more accurate, similar likelihood scores DACTAL (Nelesen, et al. 2012) Boosting concatenation methods; uses SuperFine in its divide-and-conquer strategy

Related


More Related Content