Enhancing Multiple Sequence Alignment with PASTA and Parallelism

pastaspark multiple sequence alignment meets n.w

1 / 22

Embed Share

Explore the intersection of multiple sequence alignment and Big Data with PASTASpark. Discover the challenges of High Performance Computing in MSA and the efficiency improvements through parallelism. Dive into the methodology of PASTA for optimizing MSA processes and leveraging parallelizability. Uncover the issues faced in parallel processing and the utilization of Apache Spark for cluster computing.

emon_205 Follow

Uploaded on May 02, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

PASTASpark: multiple sequence alignment meets Big Data Abu n, Pena, Pichel, 2017 Vikram Ramavarapu, CS 581

High Performance Computing in Multiple Sequence Alignment MSA is expensive! Large amounts of data Expensive methods Improvements on methods More efficient algorithms (can be less accurate) HPC and parallelism MSA tools that support parallelism MAFFT, M2Align, ClustalW

PASTA (Mirarab et. al. 2015)

PASTA (in words this time) Phase I The sequence set S is divided into disjoint sets S1, . . . , Sm, each with at most 200 sequences, using the current guided tree and the centroid decomposition technique in SAT e-II A spanning tree T on the subsets is obtained. Phase II In this step, MSAs on each Si are obtained using an existing MSA tool. Phase III Every node in T is labeled by an alignment subset for which we have a type 1 subalignment from previous step. For every edge (v, w) in T , OPAL [22] is employed to align the type 1 subalignments at v and w; The final alignment is computed through a sequence of pairwise mergers using transitivity Phase IV If an additional iteration (or a tree on the alignment) is desired, FastTree-2 is used to estimate a maximum likelihood tree on the MSA produced in the previous phase and the process is repeated

Parallelizability of PASTA The most expensive phase in terms of computational time is P2 (subsets alignment) Alignments are independent

PASTA Already Uses Parallelism Default threads = Cores on machine User can specify as well Process: 1. Subsets are stored as files (MAFFT input) 2. Python thread forks a child process 3. Processes each run MAFFT 4. Outputs are stored for the next phase

Issues Threads launch processes to run MAFFT, avoiding locking during parallel processing But Python libraries have overhead when creating processes Number of processes are bound by the number of cores PASTA is limited to shared memory machines, which limits number of cores that can be used.

Apache Spark Cluster computing framework Supports parallelization, task distribution, fault tolerance DAG of operations

The Anatomy of a Spark Application Spark Driver: central coordinator Spark Executors: many workers running independent sets of processes SparkContext: Object on the driver that coordinates executors Connects to a cluster manager to schedule executors Ex. Hadoop Resilient Distributed Datasets (RDDs): Read only data that is partitioned across workers

PASTASpark

PASTASpark P1, P3, P4 run on Spark Driver P2 run on Spark Executors RDD: (key, value) pairs Key = Job identifier Value = Unaligned subset spark_submit launches worker_nodes map() executes MAFFT on subsets collect() retrieves outputs to host Writing is also parallelized!

The Data Datasets chosen from PASTA paper Subsamples of 10k, 20k, 50k, 100k, 200k sequences from RNASim Biological datasets from Comparative Ribosomal Website (CRW) The chosen datasets in table below

The Tools Hadoop 2.7.1 HDP (CESGA), 2.7.2 (AWS) Spark 1.6.1 (CESGA), 1.6.2 (AWS) Java 1.8

Computational Platforms CESGA (Galicia Supercomputing Center, Spain) 12 Nodes 54.5 GB RAM 8 Cores per node Intel Xeon CPU E5-2620-v3 at 2.40GHz AWS EC2 Cluster w/ r3.4xlarge instance Nodes 9 Nodes 122 GB RAM 16 Cores per node Intel Xeon E5-2670 at 2.5GHz CPUs

D1/2 on CESGA

Speedup: D1/2 on CESGA

Execution Times of D3 on AWS

Limitations and Future Directions As cores increase, P4 (Tree computation) is now a bottleneck. Integration of P4 into Spark Framework Improvements on Study Design Analysis on other datasets used in original PASTA paper Analysis of portions of compute time (how did they change w/ parallelism?) No figures in the paper that show that accuracy wasn t affected