Parallelization Opportunities in Proteomics Mass Spectra Search

proteomics mass spectra search parallelization n.w

1 / 23

Embed Share

Explore the challenges and opportunities in protein identification through mass spectrometry, focusing on the need for robust software to handle the vast amount of data generated daily. Learn about peptides, amino acids, mass spectra, and tools like SEQUEST in this informative journey towards efficient parallelization in proteomics research.

neve_uin Follow

Uploaded on Mar 21, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Proteomics Mass-Spectra Search: Parallelization Opportunities Majdi Maabreh 04/22/2015 3/21/2025

Introduction Protein identification: Why? Too many reasons, some of them in [1]: Protein identification is important in all stages of drug development. Food / Feed applications, like hypo-allergenic (baby) foods. Some proteins involved in the programmed cell death, or apoptosis. Much of the fabric of our body is constructed from protein molecules. Muscle, cartilage, ligaments, skin and hair. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 1/22

Problem Statement Protein identification: What and Where is the problem? Huge amount of data are generated daily. > 24 GB / day are produced from Thermo Fusion spectrometer. The computational aspects are still the bottleneck in this field. [2] There is a need to build a software which is; robust, efficient, and using the state of the art algorithms. [3] 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 2/22

Definitions(1) What is protein? Ans: One or more long chains of amino acids and are an essential part of all living organisms. What is peptide? Ans: chain of amino acids. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 3/22

Definitions(2) What is Amino Acid? Ans: a simple organic compound containing both a carboxyl ( COOH) and an amino ( NH2) group. What Amino Acid mass means? Ans: A predefined value in Biochemistry. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 4/22

Definitions(3) What is peptide precursor mass? Ans: Simply, the summation of masses of amino acids which form a peptide. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 5/22

Definitions(4) What is mass spectrum? Ans: the mass spectrum of a sample is a pattern representing the distribution of ions by mass (more correctly: mass-to-charge ratio) in a sample. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 6/22

SEQUEST It is one of the earliest software used for mass spectra search. In fact, it is the most common one [2]. The First release of SEQUEST did not include a peptide index, but it scanned the database file repeatedly for each new observed spectrum. This approach requires little memory and disk space, but it runs slowly [2]. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 7/22

CRUX: Tide-Search (Big Picture) SEQUEST-style searching with improvements (Sequential) Observed Spectrum Normalize Read Precursor Mass Ordered Results Query Compare & Rank Database Generate Theoretical spectra Candidate peptides where their masses = +-3 of observed M. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 8/22

Normalization the Observed Spectrum A set of bins, each of width 1.0005079 Da/charge, is laid over the full range of the m/z values. Each peak will be bucketed into nearest bin which retains the highest peak. ??? ????????? ????? Create 10 equally-sized regions. Each has maximum peak of 50. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 9/22

Normalized Observed Spectrum This figure is used as is from [4]. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 10/22

Generate Theoretical spectrum [5] A mass spectrometer typically breaks a peptide p1p2 pn at different peptide bonds and detects the masses of the resulting partial N-terminal and C-terminal peptides. Example: The peptide GPFNA may be broken into the N-terminal peptides G, GP, GPF, GPFN, and C-terminal peptides PFNA, FNA, NA, A. y-ion b-ion 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 11/22

Generate Theoretical spectrum [2] The theoretical mass of each of these ions is then bucketed into bins of width 1.0005079 Da/charge, just as for the observed spectrum. Intensity values were assigned to the bins; 50 for each b and y ions, and there are 25, and 10 for other bins based on some rules. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 12/22

Comparing and PSM Process This is done based on peak-intensity vectors comparison. U vector contains the intensity peaks of observed spectrum. V vector contains the intensity peaks of theoretical spectrum. Both of the above vectors have Length of N, where N is the number of bins. For each spectrum, the PSM with the highest XCorr score is output to the user. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 13/22

Parallelization Opportunities From now on, please feel free to Add, Correct, foil any idea. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 14/22

PO#1 Observed Spectrum Normalize Read Precursor Mass Ordered Results Query Compare & Rank Database Generate Theoretical spectra Candidate peptides where their masses = +-3 of observed M. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 15/22

PO#2 Observed Spectrum Normalize Read Precursor Mass Ordered Results Query Compare & Rank Database Generate Theoretical spectra Candidate peptides where their masses = +-3 of observed M. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 16/22

PO#3 Observed Spectrum Normalize Read Precursor Mass Ordered Results Query Compare & Rank Database Generate Theoretical spectra Candidate peptides where their masses = +-3 of observed M. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 17/22

PO#4 Observed Spectrum Normalize Read Precursor Mass Ordered Results Query Compare & Rank Database Generate Theoretical spectra Candidate peptides where their masses = +-3 of observed M. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 18/22

Improvements and Research Ideas Why do we need to generate the theoretical spectra each time we receive a new observed spectrum? Can we generate all the theoretical spectra a head of time, so that no need to do the same process each time? What about the window value to retrieve the candidate peptide from the database? Why 3? What about 5 for example? What about this value with parallel computing as the time of computing will be decreased? The two primary postprocessors, Percolator and Barista, offer more substantial differences. Both use a target decoy machine learning approach, but with different behavior. Which approach performs better in practice is an open question that deserves further exploration. Is the normalization we did is the best way to prepare the spectra for comparison? Think about it as we have two 2D arrays to compare. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 19/22

Conclusion & Future Works There are a lot of parallelization opportunities in Crux Tide-search. There is a good chance to build full parallel solution from the sequential Tide-Search in Crux. The spectra-search in protein identification is one of the hot topics since many years and still need a lot of research to improve the performance of its applications. In near future, we will try in practice to push the limit with this Tide- search. 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 20/22

References 1) http://www.eurosequence.nl 2) Diament B. J. and Noble W., Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra . Journal of proteome Research. 2011. 3) *Sean McIlwain et al. Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis . Journal of proteome Research. 2014. 4) Eng J. K., McCormack A., and Yates J. R., An Approach to Correlate Tandem Mass spectral Data of Peptide with Amino acid Sequences in a Protein Database . Journal of the American Society for Mass Spectrometry. 1994. 5) Jones N. C., and Pevzner P. A., An Introduction into bioinformatics algorithms . The MIT press. 2004. * Not full list of authors. (Informally). 3/21/2025 Proteomics Mass-Spectra Search: Parallelization Opportunities 21/22