Virtual Atomic and Molecular Data Centre: Federating Heterogeneous Databases
The Virtual Atomic and Molecular Data Centre (VAMDC) federates 28 heterogeneous databases in various fields such as plasma sciences, astrophysics, fusion, and more. It provides a unified access platform for data producers and follows a standard XML file format (XSAMS) for query submission. The technical organization includes a VAMDC Registry for managing resources and ensuring a seamless query experience across independent databases.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
VAMDC use-case for the RDA Data Citation Working Group C.M. Zw lf and VAMDC consortium 6thRDA Plenary PARIS September 2015
The Virtual Atomic and Molecular Data Centre Federates 28 heterogeneous databases http://portal.vamdc.org/ Plasma sciences The V of VAMDC stands for Virtual in the sense that the e-infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases. Lighting technologies Astrophysics VAMDC Single and unique access to heterogeneous A+M Databases The consortium is politically organized around a Memorandum of understanding (15 international members have signed the MoU, 1 November 2014) Health and clinical sciences Atmospheric Physics High quality scientific data come from different Physical/Chemical Communities Fusion Environmental sciences technologies Provides data producers with a large dissemination platform
The VAMDC infrastructure technical organization Existing Independent A+M database Existing Independent A+M database
The VAMDC infrastructure technical organization Standard vocabulary for submitting queries Existing Independent A+M database Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer VAMDC Node Standard vocabulary for submitting queries Existing Independent A+M database Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer VAMDC Node
The VAMDC infrastructure technical organization VAMDC Registry Resource registered into Standard vocabulary for submitting queries Existing Independent A+M database Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer VAMDC Node Standard vocabulary for submitting queries Existing Independent A+M database Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer VAMDC Node
The VAMDC infrastructure technical organization VAMDC Registry Asks for available resources Resource registered into Standard vocabulary for submitting queries Existing Independent A+M database Set of XSAMS files VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol Results provided formatted into standard XML file (XSAMS) Unique A+M query VAMDC wrapping layer VAMDC Node Standard vocabulary for submitting queries Existing Independent A+M database Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer VAMDC Node
Trying to implement the recommendations Tagging Datasets with Ids (Relational Database case) Query Store
Trying to implement the recommendations Tagging Datasets with Ids (Relational Database case) Query Store From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc ). This context naturally define the dataset perimeter
Trying to implement the recommendations Tagging Datasets with Ids (Relational Database case) Query Store From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc ). This context naturally define the dataset perimeter From spring 2015 to present The issue is more anthropological than technological, since each database provider (VAMDC node owner, recall VAMDC federates 28 heterogeneous DB) has its own understanding and a- priori idea of what a dataset is (some examples on the Working group wiki page, VAMDC usecase section) Indeed, a DataSet is not uniquely defined and understood by VAMDC members. Depending on the definition a unique query may be the result of combination of multitudes of dataset. In this case how to use datasets for citing data if one has to cite hundreds of different dataset for a single query? Need to find a common understanding
Trying to implement the recommendations Tagging Datasets with Ids (Relational Database case) Query Store From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc ). This context naturally define the dataset perimeter The actual proposition in discussion Keeping in mind that this evolution is for reproducing request at later time and for sustainable data citation We introduced the notion of Version rather than DataSet A Version is the snapshot of a entire database at a given (timestamped time) Each evolution (even minimal) of the DB will be associated to a new Snapshot Version. All the data extracted from VAMDC will be attached (i.e. will refer to) a specific Version. We are internally discussing this approach and evaluating its implementing cost.
Trying to implement the recommendations Tagging Datasets with Ids (Relational Database case) Query Store Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context?
Trying to implement the recommendations Tagging Datasets with Ids (Relational Database case) Query Store Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? We are prototyping an implementation based on a central service for collecting logs from each VAMDC infrastructure element...
Schema of the proposed architecture Central Log Service VAMDC Node (with versioning) Client 1 A given user is using at time t the Client 1, from a given IP, for submitting a given request to the infrastructure Client 2 VAMDC Node (with versioning) Client 3
Schema of the proposed architecture Central Log Service VAMDC Node (with versioning) Client 1 A given user is using at time t the Client 1, from a given IP, for submitting a given request to the infrastructure I am receiving at time t a given request by a user running a given client from a given IP Client 2 VAMDC Node (with versioning) Client 3
Schema of the proposed architecture Central Log Service VAMDC Node (with versioning) Client 1 Client 2 Non blocking communications for avoiding bottleneck effects Non blocking communications for avoiding bottleneck effects VAMDC Node (with versioning) Client 3
Schema of the proposed architecture Central Log Service VAMDC Node (with versioning) Client 1 Client 2 VAMDC Node (with versioning) Client 3 We will be able to identify unique queries (that have been virtually multiplied by the infrastructure) with unique IDs and assign time-stamps From Raw information on the log service
A proposed API for the query store Architecture of the query store Central Log Service Versioning on Databases Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web Service Takes a query and a date Returns the associated query ID. Web service: Takes a query ID Returns the query and the associated timestamp.
A proposed API for the query store Architecture of the query store Central Log Service Versioning on Databases Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web Service Takes a query and a date Returns the associated query ID. Web service: Takes a query ID Returns the query and the associated timestamp. Web Service Takes the query ID Return the associated results
Concluding remarks / open questions about query store How to deal with confidentiality of the information? Should we need an authentication/authorization policy on the query store? Is the sketched log service compliant with the EU law about confidentiality? We are providing to users the tools for efficiently cite our dynamic data, but How can we be sure that they will use it for citing our data? In other words, how to enforce the citation instincts in our final users? We are thinking at proposing a reverse approach : We may cite the users accessing to our data. They will accept these terms, that will be explained in the condition of usage of the VAMDC services. How to prevent plagiarism?: A user might extract data, modify and cite them as the original extracted ones. Do we have tools for preventing such behaviors? MD5 of extracted data on query-store?