
BAli-Phy: Innovative Software for Phylogenetic Analysis
"Discover BAli-Phy, a unique software from 2005 that co-estimates alignments and phylogenies, offering detailed results with uncertainties. Learn about its advantages, limitations, and comparison with other methods like MAFFT and PASTA."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1
BAli-Phy: Brief Summary What is BAli-Phy? (Redelings & Suchard, 2005) Software from 2005 that takes as input unaligned sequences and co- estimates the alignment and the phylogeny in a way that accounts for indels. Output can be a multiple sequence alignment, a phylogeny, or both, and can give estimate of uncertainty in each one. Why is it interesting? The statistical model is unique and detailed, so given enough time it might find a better optimum than other methods. Experimental evidence has shown that it gives more accurate multiple sequence alignments than more common methods (Liu, et al, 2012). 1Redelings, B. D., & Suchard, M. a. (2005). Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, 54(3), 401 418. 2Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science (New York, N.Y.), 324(5934), 1561 4. 2
BAli-Phy: Brief Summary What is the disadvantage? Software cannot handle more than 200 taxa due to suspected numerical instability. Computation is very slow: most publications using it have run for several weeks. (Gaya, et al., 2011) 68 sequences ran in 3 weeks Largest data set we have found is 117-sequences (McKenzie, et al., 2014) 3
BAli-Phy: Quick Look at Results (1 of 2) Alignment Error* 40.0% MAFFT 30.0% False Negative % 40% PASTA (1 SP-Score) BAli-Phy 30% 20.0% 20% 10.0% 10% 0.0% 0% 40% (1 Modeler Score) False Positive % 30% 20% 10% 0% # Taxa: 100 200 100 200 Simulator: Indelible (DNA) RNAsim (RNA) 4 *Averages over 10 replicates
BAli-Phy: Quick Look at Results (2 of 2) Alignment Accuracy* 40.0% MAFFT 30.0% PASTA 40% BAli-Phy 20.0% Total-Column 30% 10.0% Score 20% 0.0% 10% 0% # Taxa: 100 Indelible (DNA) 200 200 100 RNAsim (RNA) Simulator: Total-Column Score: Percentage of columns from the reference alignment that are fully reproduced by the estimated alignment. 5 *Averages over 10 replicates
SAT and PASTA Algorithms Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score 6
PASTA Algorithm Input: unaligned sequences 1) Get initial alignment 2) Estimate tree on current alignment 3) Break into subsets according to tree (repeat) 6) Use transitivity to merge subset pairs into a full alignment, scrap the old tree 4) Use external aligner to align subsets 5) Use external profile aligner to merge subset alignments ? 7
Divide and Conquer with BAli-Phy QUESTION: Since we saw that BAli-Phy gives better alignments on small numbers of taxa, could we get better alignment on large data sets if we used BAli-Phy on subsets? ANSWER: Yes, but it takes a lot of computing resources. To get the most out of PASTA+BAli-Phy, start with the tree from the LAST iteration of default PASTA 8
Methods to Compare PASTA: All default settings Iterations 1-3: MAFFT (Subset Size 200) (all default settings) PASTA+BAli-Phy Takes advantage of faster MAFFT on early iterations where subsets are more diverse. Iterations 1-3: MAFFT (Subset Size 200) Iteration 4: BAli-Phy (Subset Size 100) PASTA+MAFFT Helpful to identify whether any gain is from extra iteration or because BAli-Phy was used. Iterations 1-3: MAFFT (Subset Size 200) Iteration 4: MAFFT (Subset Size 100) MAFFT L-INS-i Useful comparison to benchmark difficulty of alignment. MAFFT is popular and L-INS-i is the most accurate version. Default MAFFT (v7.273) using the mafft-linsi command 9
Error Reduction (1000 Sequences) 30% 25% (i.e. 1 - SP-Score) False Negative % 30% PASTA 20% 25% PASTA+BAli-Phy 15% PASTA (Extra Iteration) 20% 10% MAFFT L-INS-i 15% 5% 10% 0% 5% Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 (Smaller is better) 0% 30% (i.e. 1 Modeler Score) Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 25% False Positive % 20% 15% 10% 5% 0% Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 10
Accuracy Gain (1000 Sequences) 35% 30% Total Column Score 30% 25% PASTA 20% 25% PASTA+BAli-Phy PASTA (Extra Iteration) 15% 20% MAFFT L-INS-i 10% 15% 5% 10% 0% Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 5% 0% Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 11
Tree Error Relative to ML(Reference Alignment) 6% 5% 30% Delta-RF (RAxML) PASTA 4% 25% PASTA+BAli-Phy PASTA (Extra Iteration) 3% 20% MAFFT L-INS-i 2% 15% 1% 10% 0% 5% Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 0% Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 12
Tree Error: Delta RF (RAxML) False Negative % Total Column Score 0.4 0.4 PASTA+BAli-Phy Better PASTA+BAli-Phy Better PASTA+BAli-Phy Better 0.15 0.3 0.3 PASTA+BAli-Phy PASTA+BAli-Phy 0.10 PASTA 0.2 0.2 PASTA Better PASTA Better PASTA Better 0.05 0.1 0.1 0.00 0.0 0.0 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 PASTA+BAli-Phy PASTA PASTA Tree Error: Delta RF (FastTree-2) False Positive % Accuracy Gain (1000 Sequences, Detail) 0.4 PASTA+BAli-Phy Better 0.20 PASTA+BAli-Phy Better 0.15 0.3 PASTA+BAli-Phy data Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 Tree Error: Delta RF (RAxML) False Negative % Total Column Score PASTA 0.10 0.2 0.4 0.4 PASTA+BAli-Phy Better PASTA+BAli-Phy Better PASTA+BAli-Phy Better PASTA Better PASTA Better 0.05 0.15 0.1 0.3 0.3 PASTA+BAli-Phy PASTA+BAli-Phy 0.00 0.10 PASTA 0.0 -0.05 0.2 0.2 -0.05 0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 PASTA Better 0.3 0.4 PASTA+BAli-Phy PASTA PASTA Better PASTA Better 0.05 0.1 0.1 0.00 0.0 0.0 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 PASTA+BAli-Phy PASTA PASTA Tree Error: Delta RF (FastTree-2) False Positive % 0.4 PASTA+BAli-Phy Better 0.20 PASTA+BAli-Phy Better 13 0.15 0.3 PASTA+BAli-Phy data Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 PASTA 0.10 0.2 PASTA Better PASTA Better 0.05 0.1 0.00 0.0 -0.05 -0.05 0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.4 PASTA+BAli-Phy PASTA
Scaling to 10,000 Sequences We can use UPP (Nguyen, et al, 2015) to extend an alignment to larger numbers of sequences: Take a random backbone subset (i.e. 1,000 sequences from previous slides) Align the backbone Align all remaining sequences to the backbone via HMMs 15
Scaling to 10,000 Sequences Accuracy of full alignment tends to track the accuracy of the backbone: D D-RF 0.77% 0.54% 0.62% 0.77% 0.67% 0.67% Data Backbone PASTA PASTA+BAli-Phy PASTA+MAFFT PASTA FP % 3.8% 2.2% 2.7% 9.2% 8.6% 10.6% FN % 6.4% 4.4% 5.0% 9.5% 9.0% 10.9% TC 2.6% 4.3% 3.2% 0.5% 0.6% 0.5% Indelible M2 RNAsim PASTA+BAli-Phy PASTA+MAFFT 16
Scaling to 10,000 Sequences Total Column Score 0.100 PASTA+BAli-Phy Better PASTA+BAli-Phy 0.075 data 0.050 Indelible M2 PASTA Better RNAsim 0.025 0.000 0.000 0.025 0.050 PASTA 0.075 0.100 Recall (SP-Score) Tree Error: Delta RF 1.000 0.015 PASTA+BAli-Phy Better PASTA+BAli-Phy Better PASTA+BAli-Phy 0.975 0.010 PASTA 0.950 PASTA Better PASTA Better 0.005 0.925 0.900 0.000 0.900 0.925 0.950 PASTA 0.975 1.000 0.000 0.005 PASTA+BAli-Phy 0.010 0.015 17
Hypothetical Running Time Comparison Data: 1000 Taxa Goal: Run PASTA (1 iteration, maximum subset size 100) Resource: 1 Server, 32 Cores Question: How long does each subset- alignment method take? Answer: With 1,000 taxa and subsets no larger than 100, we ll have approximately 15 subsets (between 10-16). MAFFT: Takes 1 core approx. 10-20 minutes to do 1 subset. Can do all subsets in parallel. Total: 10-20 minutes BAli-Phy: Takes 24 hours for all 32 cores to do 1 subset. Can t run in parallel since we only have 32 cores. Total: 15 Days These calculations are hypothetical but representative. If we have multiple servers, we can run BAli-Phy in parallel in less time Still, if we want to run BAli-Phy it makes the most sense to several iterations with MAFFT first. 18
Summary BAli-Phy provides more accurate alignments than MAFFT on small data, which can translate to more accurate alignments on up to 1,000 taxa by boosting with PASTA. Although the running time is longer, this allows BAli-Phy to be scaled in a parallel way, so we can align 1,000 sequences in the time it takes to align 100. Alignment accuracy translates to improved tree accuracy on this data. Alignments can be further extended to 10,000 sequences using UPP Code at: http://github.com/mgnute/pasta 19
Acknowledgements Special Thanks This is a joint work with my advisor, Prof. Tandy Warnow. Special thanks to Nam Nguyen for early collaboration on this project and for teaching me the codebase for PASTA and UPP, and to Erin Molloy for several helpful discussions and suggestions over the course of this research. NSF This work was funded by NSF grant III:AF:1513629 Blue Waters This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. 20
References [1.] Redelings, B. D., & Suchard, M. (2005). Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, 54(3), 401 418. [2.] Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science (New York, N.Y.), 324(5934), 1561 4. [3.] Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14), 3059 3066. [4.] Mirarab, S., Nguyen, N., Guo, S., Wang, L.-S., Kim, J., & Warnow, T. (2015). PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. Journal of Computational Biology, 22(5), 377 386. [5.] McKenzie, S. K., Oxley, P. R., & Kronauer, D. J. C. (2014). Comparative genomics and transcriptomics in ants provide new insights into the evolution and function of odorant binding and chemosensory proteins. BMC Genomics, 15(1), 718. [6.] Liu, K., Warnow, T. J., Holder, M. T., Nelesen, S. M., Yu, J., Stamatakis, A. P., & Linder, C. R. (2012). SATe-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees. Systematic Biology, 61(1), 90 106. [7.] Nguyen, N. D., Mirarab, S., Kumar, K., & Warnow, T. (2015). Ultra-large alignments using phylogeny-aware profiles. Genome Biology, 16(1), 124. 21