Guide Tree Merger Evaluation for Phylogenetic Analysis

evaluating guide tree merger with alternative n.w
1 / 16
Embed
Share

"Explore the effectiveness of Guide Tree Merger with alternative tree estimation methods in single-gene phylogenetic analysis. Assessing performance, tradeoffs, and significance to enhance accessibility for large datasets with limited resources."

  • Phylogenetic Analysis
  • Guide Tree Merger
  • Tree Estimation
  • Computational Genomics
  • Algorithmic

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Evaluating Guide Tree Merger with Alternative Tree Estimation Methods Final Project Presentation by Utkarsh Sharma CS 581: Algorithmic Computational Genomics (Spring 2025)

  2. Project Overview Goal: Evaluate GTM with alternatives to maximum likelihood for single-gene phylogenetic analysis Key Questions: Can GTM work effectively with faster distance-based methods? How important is guide tree quality for GTM performance? What are the accuracy/runtime tradeoffs of different GTM configurations? Significance: Make GTM more accessible for large datasets with limited computational resources

  3. Methodology and Experiment Design Compare five approaches on all datasets: Baseline: FastTree-2 and NJ on full dataset Original GTM: FastTree-2 guide tree + FastTree-2 subset trees Hybrid 1: FastTree-2 guide tree + NJ-LogDet subset trees Hybrid 2: NJ-LogDet guide tree + FastTree-2 subset trees Hybrid 3: NJ-LogDet guide tree + NJ-LogDet subset trees Evaluation Metrics: RF distance to reference trees, runtime, memory usage

  4. Datasets and Implementation Datasets: All single-gene phylogenetic (gene tree) datasets Simulated: GTR+Gamma simulated DNA (1K-10K taxa) from Smirnov & Warnow (2020) INDELible datasets from PASTA paper (Mirarab et al., 2015)* Computational Environment: UIUC Campus Cluster (CS instructional queue) Runtime limit: 12 hours Memory limit: 64 GB

  5. GTM Pipeline Implementation Guide Tree Merger workflow: Generate guide tree (FastTree or NJ-LogDet) Decompose dataset using guide tree (subset size: 250 taxa) Estimate subset trees (FastTree or NJ-LogDet) Merge subset trees using GTM with "convex" merge mode Evaluate against reference tree

  6. Results - Overall Performance Table 1: Summary of method performance on 1000M1 and 1000M4 datasets

  7. RF Distance Comparison Figure 1: Comparison of normalized Robinson-Foulds distances across methods and models. Error bars represent standard deviation.

  8. Comparing GTM Variants Figure 2: Comparison of GTM methods. Error bars represent standard deviation.

  9. Key Finding 1 - Guide v/s Subset Tree Quality Finding: Both guide and subset tree quality matter, but subset tree quality has a greater impact Evidence: Moving from high to low quality subset trees: ~109% increase in error Moving from high to low quality guide trees: ~12% increase in error

  10. Key Finding 2 - Dataset Complexity Effects Finding: Evidence: The performance gap widens on more challenging datasets 1000M1 (difficult): NJ-LogDet v/s FastTree GTR gap is 150% (FN: 0.25 vs 0.10) 1000M4 (moderate): NJ-LogDet v/s FastTree GTR gap is 200% (FN: 0.15 vs 0.05) Hybrid GTM 2 maintains strong performance across both datasets: 1000M1: Only 12% higher FN than Original GTM (0.12 vs 0.11) 1000M4: Only 20% higher FN than Original GTM (0.06 vs 0.05)

  11. Practical Implications For accuracy-focused applications: Use maximum likelihood methods throughout GTM pipeline Original GTM maintains FastTree-level accuracy (RF 219.80 vs 211.30) For balanced accuracy/speed: Hybrid GTM 2 (NJ guide + FastTree subset) is an excellent compromise Only 12% higher error than Original GTM (RF 246.20 vs 219.80) Uses faster guide tree generation method Avoid using NJ-LogDet for subset trees regardless of guide tree quality Over 100% higher error rates (RF 459.85+ vs 219.80)

  12. Limitations and Future Work Current Limitations: Tested on limited dataset range (only 1000M1 and 1000M4) Fixed subset size (250 taxa) for all experiments Focused on accuracy metrics over runtime/memory Future Directions: Evaluate on larger datasets Test parameter sensitivity (vary subset size) Explore additional guide tree approaches (UPGMA) Comprehensive runtime and memory analysis

  13. Conclusions GTM performance depends heavily on both guide and subset tree quality Original GTM (FastTree+FastTree) maintains direct FastTree accuracy Subset tree quality has ~9 greater impact than guide tree quality Hybrid GTM 2 (NJ guide + FastTree subset) offers the best compromise for limited resources These findings provide practical guidelines for large-scale phylogenetic analysis with GTM

  14. Acknowledgements and Questions Thanks to: Prof. Tandy Warnow Eleanor Wedell Minhyuk Park GTM Paper Authors (Park et al., 2021) Questions?

  15. References Desper, R. and Gascuel, O. (2002). Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of Computational Biology, 9(5):687 705. Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., and Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934):1561 1564. Lefort, V., Desper, R., and Gascuel, O. (2015). FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10):2798 2800. Mirarab, S., Nguyen, N., Guo, S., Wang, L.-S., Kim, J., and Warnow, T. (2015). PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. Journal of Computational Biology, 22(5):377 386. Molloy, E. K. and Warnow, T. (2019). TreeMerge: A new method for improving the scalability of species tree estimation methods. Bioinformatics, 35(13):2300 2307. Nelesen, S., Liu, K., Wang, L.-S., Linder, C. R., and Warnow, T. (2012). DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics, 28(12):i274 i282. Park, M., Zaharias, P., and Warnow, T. (2021). Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms, 14(5):148. Price, M. N., Dehal, P. S., and Arkin, A. P. (2010). FastTree 2 approximately maximum-likelihood trees for large alignments. PloS One, 5(3):e9490. Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406 425

  16. .

More Related Content