Marine Metagenomics Data Architecture

Slide Note

Marine metagenomics data architecture involves three tiers of database curation, storage, transfers, and pipelines for analyzing marine genetic information. The architecture includes MarRef as Tier 1 for complete genomes, MarDB as Tier 2 for marine genome projects, and MarCat as Tier 3 for assembled metagenomics and metatranscriptomics reads. Data storage architecture encompasses Reference DB, Spark, HDFS, and various tools while data transfers are managed between ENA and Troms. Pipelines like EMG/MGP and META-pipe are optimized for cloud environments and data benchmarking.

beauvais_s Follow

Uploaded on Mar 16, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

WP6: Marine metagenomics IngeAlexander Raknes, Giacomo Tartari(ELIXIR-NO) ELIXIR All Hands, 8-9 March 2016, Barcelona, Spain ELIXIR-EXCELERATE is funded by the European Commission within the Research Infrastructures programme of Horizon 2020, grant agreement number 676559. www.elixir-europe.org/excelerate

Use case architecture

Database Tier 1 -MarRef Gold Standard and build upon complete marine prokaryotic, eukaryotic and virus genomes available in UniProtproteome database. Manually curated. Tier 2 MarDB Includes all prokaryotic, eukaryotic and virus genomes independent of whether they are complete or not. Manually curated at the beginning. Later there will be standards to avoid manual curation. Tier 3 -MarCat Based upon annotation of assembled marine metagenomicsand metatransciptomicsreads.

Tier 1 Tier 1 MarRef (Gold standard complete genomes ) ENA/Genebank/DDBJ RefSeq Manual curation and enrichment MarRef Nucleotide MarRef MarRef Protein

Tier 2 Tier 2 MarDb marine genome database Genome Projects ENA/Genebank/DDBJ MarDb Nucleotide MarineDb MarDb Protein

Tier 3 Tier 3 Marine gene catalogue Marine metagenomics reads EBI metagenomics ENA Marine metatranscriptome reads ENA Tier1 database Tier2 database META-pipe MarCat Nucleotide MarCat gene catalogue MarCat Protein

Data Storage Architecture Reference DB Spark HDFS big data WEB GUI Curator REST API gridFTP SQL metadata Scientist Admin NorStore backup

Data Transfers Transferred 36 projects/studies from ENA to Troms Temporarily parked data on NorStore staging area Thanks to Tony Wildish and Thierry Toutain Not the expected speed investigation in progress

Pipelines EMG/MGP: porting to cloud (Embassy cloud or Amazon EC2) META-pipe: adapting to Apache Spark Defining set of tools for benchmarking Defining data standards

Meta-pipe architecture Execution environments Execution Manager (Stallo) Execution Manager (CSC) Web front-end CLI Tool Execution Manager (ICE-2) Execution Manager (anywhere else?) Public API Elixir AAI Auth Storage Job Service - - Tokens Authentication events - - Inputs / uploads Outputs / downloads - - Job queue Execution status

Spark Meta-pipe Currently have a set of tools that are individually submitted to Torque Implement the workflow execution of Meta- pipe in Spark Already have most of the Meta-pipe codebase written in Scala

Cloud Deployment Use cPoutaas a computational backend for Meta-pipe Other environments could be Amazon, etc. Looking into technologies like AppImage to make it more easily deployable