Fast and Accurate Supertree Estimation Methods

parallel superfine a tool for fast and accurate n.w

1 / 27

Embed Share

This content explores the use of parallelism and high-performance computing in phylogeny estimation, focusing on supertree methods like Superfine. It discusses the benefits and limitations of supertree methods, the importance of parallelism in computation, and the three-stage strategy used in Superfine for fast and accurate supertree estimation.

arak510 Follow

Uploaded on Mar 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Parallel SuperFineA tool for fast and accurate supertree estimation: Features and limitations Neves and Sobral, 2017 Chloe Makdad, CS 581

Parallel and High-Performance Computing in Phylogeny Estimation Getting quality results can take a long time Large amounts of data Expensive methods Could use less expensive algorithms Might produce lower quality estimations Could use HPC techniques Improve running time, preserve estimation quality

Parallelism in Phylogeny Estimation* Many tools that we ve talked about support some parallelism RAxML, MrBayes, BEAST2, and others GRAPPA: parallel tool for parsimony 2500x faster than Sankoff on single processor 1000000x fast on 512 processor cluster BEAGLE: likelihood calculations Bayesian methods Performance optimization *David A Bader and Kamesh Madduri. High-performance phylogenetic inference. Bioinformatics and Phylogenetics, pages 39 45, 2019.

Parallelism and Supertree Methods Parallelism can also be applied to supertree methods Superfine (2017) Guide Tree Merger (2022, hopefully!)

Why Supertree Methods? Tree of Life computation Increasingly large amount of biological data available to analyze Data sets could have thousands of taxa (or more!) Supertree methods: Construct trees on subsets of the taxa and assemble Allows for divide-and-conquer approach

Superfine* Three-stage strategy Stage 1: parsing the input Stage 2: produce a conservative, highly unresolved supertree estimate Stage 3: refine the tree with base supertree method and source tree *M Shel Swenson, Rahul Suri, C Randal Linder, and Tandy Warnow. Superfine: Fast and Accurate Supertree Estimation. Systematic Biology, 61(2):214, 2012.

Superfine Stage 2: Strict Consensus Merger Merge the source trees 2 at a time with SCM Repeat the process until all source trees have been merged into one Note: the order in which the trees are merged has an impact! The paper suggests using maximum backbone number as the criterion for choosing the order

Superfine Stage 3: Refinement The SCM tree is (usually) not fully resolved, requiring the refinement of polytomies Some notation: T: a set of source trees T: an SCM supertree on T v:be a polytomy of T (a node of degree at least 4)

Superfine Stage 2: Refinement 1. Root T at v. Let v1, ,vdbe the children of v Let T1, ,Td be the subtrees rooted at v1, ,vd 2. Compute a set of re-encoded source trees Tvbased on these subtrees 3. Apply the base supertree method to obtain a tree T*with taxon labeled 1, ,d 4. Construct the final supertree T by attaching each Ti to T*

Superfine Performance Originally tested with MRP and QMC as base supertree methods Superfine+MRP and Superfine+QMC outperform MRP or QMC alone Both outperform other supertree methods Both approach the accuracy of CA-ML Running time of the refinement step: Largely determined by the max degree of a node in the SCM tree Max degree being small means refinement is quite fast Even when the base supertree method is expensive!

Parallel Superfine Paper Contributions Paper proposes a few improvements/changes to Superfine Extending Superfine to use FastTree as an inference tool FastTree is faster and topologically more accurate that RAxML, which Superfine uses as a base supertree method for ML analyses Parallel Superfine

Parallel Superfine Authors use the EPIC framework for multilevel parallelism Parallelization occurs in the parse phase and the refinement phase No parallelization of Strict Consensus Merger

EPIC Framework Supertree methods have several stages Aligning sequences Estimating trees from each alignment Estimating a supertree that each may rely on third-party tools Each relies on third-party tools that may not be controlled by the main program This makes it difficult to efficiently exploit parallelism at each level EPIC provides abstraction to help resolve this

Parallelization of Parse Phase Each source tree is independent, so we can exploit inter-tree parallelism when parsing them There isn t enough pay-off to try to exploit intra-tree parallelism (parsing each individual tree in parallel) Each source tree can be parsed on one thread The scheduler will ensure the longest Newick strings are parsed first While it is simple to implement, parallelizing parse has negligible impact on overall performance

Parallelization of Refinement Phase The most expensive phase of Superfine Idea: The order in which the polytomies are refined does not affect the quality of the output Refine them in parallel! Embarrassingly parallel! Challenges Load balancing Third-party tools Multilevel parallelization Effificent usage of available cores

Nave Parallelization Creates a priority queue with all polytomies Polytomies sorted in descending order by degree Refinement of each polytomy processed in parallel Performs well on simulated data Disappointing on biological data

Leveraging the EPIC Framework Allows us to leverage inter- and intra-polytomy parallelism Inter-polytomy parallelism Processes distinct polytomies in parallel Intra-polytomy parallelism Allows for the processing of high degree polytomies in parallel Should be exploited in all polytomies with significantly high degree

A Note on Chunking Using multiple threads leads to overheads Chunking can help Several units of work packaged and assigned to a thread EPIC supports chunking, but the authors did not find it advantageous They claim this time is negligible when compared to the time required to refine a polytomy

Experiments Setups: Sequential Intra-parallel Inter-parallel Hybrid-parallel Thread-counts Data sets 1000-taxon simulated with birth/death, missing data Several biological datasets (CPL, Marsupials, Placental Mammals, Seabirds, Legumes)

Simulated Data RF error rate (%) Scaffold Factor

Running Time Improvements (Biological Data) Running Time (sec)

Running Time Improvements Speedup is relative to the Seq_SuperFine+PAUP* setup

Running Time Allocation

Maximum Likelihood Analyses Marsupials data set and Hybrid_Par- SuperFine+FastTree setup The Marsupials data set has one polytomy of degree 199; all of the rest are below 15 Poses challenges with allocation

Limitations Load balancing Lack of parallelism in SCM phase Third-party ML softwares parallel strategies FastTree and RAxML Intra-polytomy parallelism under MP criterion lacks support Due to third-party software

My Project Parallelize another supertree method: Guide Tree Merger Uses a Disjoint Tree Merger approach Computes constraint trees and uses a guide tree to determine how to connect these trees