Fast and Accurate Supertree Estimation Methods

parallel superfine a tool for fast and accurate n.w
1 / 27
Embed
Share

This content explores the use of parallelism and high-performance computing in phylogeny estimation, focusing on supertree methods like Superfine. It discusses the benefits and limitations of supertree methods, the importance of parallelism in computation, and the three-stage strategy used in Superfine for fast and accurate supertree estimation.

  • - Supertree Methods - Parallelism - High-performance Computing - Phylogeny Estimation - Superfine

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Parallel SuperFineA tool for fast and accurate supertree estimation: Features and limitations Neves and Sobral, 2017 Chloe Makdad, CS 581

  2. Parallel and High-Performance Computing in Phylogeny Estimation Getting quality results can take a long time Large amounts of data Expensive methods Could use less expensive algorithms Might produce lower quality estimations Could use HPC techniques Improve running time, preserve estimation quality

  3. Parallelism in Phylogeny Estimation* Many tools that we ve talked about support some parallelism RAxML, MrBayes, BEAST2, and others GRAPPA: parallel tool for parsimony 2500x faster than Sankoff on single processor 1000000x fast on 512 processor cluster BEAGLE: likelihood calculations Bayesian methods Performance optimization *David A Bader and Kamesh Madduri. High-performance phylogenetic inference. Bioinformatics and Phylogenetics, pages 39 45, 2019.

  4. Parallelism and Supertree Methods Parallelism can also be applied to supertree methods Superfine (2017) Guide Tree Merger (2022, hopefully!)

  5. Why Supertree Methods? Tree of Life computation Increasingly large amount of biological data available to analyze Data sets could have thousands of taxa (or more!) Supertree methods: Construct trees on subsets of the taxa and assemble Allows for divide-and-conquer approach

  6. Superfine* Three-stage strategy Stage 1: parsing the input Stage 2: produce a conservative, highly unresolved supertree estimate Stage 3: refine the tree with base supertree method and source tree *M Shel Swenson, Rahul Suri, C Randal Linder, and Tandy Warnow. Superfine: Fast and Accurate Supertree Estimation. Systematic Biology, 61(2):214, 2012.

  7. Superfine Stage 2: Strict Consensus Merger Merge the source trees 2 at a time with SCM Repeat the process until all source trees have been merged into one Note: the order in which the trees are merged has an impact! The paper suggests using maximum backbone number as the criterion for choosing the order

  8. Superfine Stage 3: Refinement The SCM tree is (usually) not fully resolved, requiring the refinement of polytomies Some notation: T: a set of source trees T: an SCM supertree on T v:be a polytomy of T (a node of degree at least 4)

  9. Superfine Stage 2: Refinement 1. Root T at v. Let v1, ,vdbe the children of v Let T1, ,Td be the subtrees rooted at v1, ,vd 2. Compute a set of re-encoded source trees Tvbased on these subtrees 3. Apply the base supertree method to obtain a tree T*with taxon labeled 1, ,d 4. Construct the final supertree T by attaching each Ti to T*

  10. Superfine Performance Originally tested with MRP and QMC as base supertree methods Superfine+MRP and Superfine+QMC outperform MRP or QMC alone Both outperform other supertree methods Both approach the accuracy of CA-ML Running time of the refinement step: Largely determined by the max degree of a node in the SCM tree Max degree being small means refinement is quite fast Even when the base supertree method is expensive!

  11. Parallel Superfine Paper Contributions Paper proposes a few improvements/changes to Superfine Extending Superfine to use FastTree as an inference tool FastTree is faster and topologically more accurate that RAxML, which Superfine uses as a base supertree method for ML analyses Parallel Superfine

  12. Parallel Superfine Authors use the EPIC framework for multilevel parallelism Parallelization occurs in the parse phase and the refinement phase No parallelization of Strict Consensus Merger

  13. EPIC Framework Supertree methods have several stages Aligning sequences Estimating trees from each alignment Estimating a supertree that each may rely on third-party tools Each relies on third-party tools that may not be controlled by the main program This makes it difficult to efficiently exploit parallelism at each level EPIC provides abstraction to help resolve this

  14. Parallelization of Parse Phase Each source tree is independent, so we can exploit inter-tree parallelism when parsing them There isn t enough pay-off to try to exploit intra-tree parallelism (parsing each individual tree in parallel) Each source tree can be parsed on one thread The scheduler will ensure the longest Newick strings are parsed first While it is simple to implement, parallelizing parse has negligible impact on overall performance

  15. Parallelization of Refinement Phase The most expensive phase of Superfine Idea: The order in which the polytomies are refined does not affect the quality of the output Refine them in parallel! Embarrassingly parallel! Challenges Load balancing Third-party tools Multilevel parallelization Effificent usage of available cores

  16. Nave Parallelization Creates a priority queue with all polytomies Polytomies sorted in descending order by degree Refinement of each polytomy processed in parallel Performs well on simulated data Disappointing on biological data

  17. Leveraging the EPIC Framework Allows us to leverage inter- and intra-polytomy parallelism Inter-polytomy parallelism Processes distinct polytomies in parallel Intra-polytomy parallelism Allows for the processing of high degree polytomies in parallel Should be exploited in all polytomies with significantly high degree

  18. A Note on Chunking Using multiple threads leads to overheads Chunking can help Several units of work packaged and assigned to a thread EPIC supports chunking, but the authors did not find it advantageous They claim this time is negligible when compared to the time required to refine a polytomy

  19. Experiments Setups: Sequential Intra-parallel Inter-parallel Hybrid-parallel Thread-counts Data sets 1000-taxon simulated with birth/death, missing data Several biological datasets (CPL, Marsupials, Placental Mammals, Seabirds, Legumes)

  20. Simulated Data RF error rate (%) Scaffold Factor

  21. Running Time Improvements (Biological Data) Running Time (sec)

  22. Running Time Improvements Speedup is relative to the Seq_SuperFine+PAUP* setup

  23. Running Time Allocation

  24. Maximum Likelihood Analyses Marsupials data set and Hybrid_Par- SuperFine+FastTree setup The Marsupials data set has one polytomy of degree 199; all of the rest are below 15 Poses challenges with allocation

  25. Limitations Load balancing Lack of parallelism in SCM phase Third-party ML softwares parallel strategies FastTree and RAxML Intra-polytomy parallelism under MP criterion lacks support Due to third-party software

  26. My Project Parallelize another supertree method: Guide Tree Merger Uses a Disjoint Tree Merger approach Computes constraint trees and uses a guide tree to determine how to connect these trees

  27. Parallelizing GTM Computation of constraint trees is the most expensive part of the pipeline Clear place where parallelism could have high impact

Related


More Related Content