The Metadata Groups Unpacking the Elements

The Metadata Groups Unpacking the Elements
Slide Note
Embed
Share

Metadata plays a crucial role in describing and contextualizing data for users, software, and computing resources. It goes beyond mere data description and discovery to facilitate the coupling of various resources, including a Virtual Research Environment. The FAIR principles emphasize the importance of making data findable, accessible, interoperable, and reusable through rich metadata and standardized protocols. Explore the key elements and principles governing metadata and data management in research and information systems.

  • Metadata
  • Principles
  • FAIR
  • Data Management
  • Research

Uploaded on Feb 15, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. 1 The Metadata Groups Unpacking the Elements- Keith G Jeffery

  2. RDA Metadata Principles 2 The only difference between metadata and data is mode of use Metadata is not just for data, it is also for users, software services, computing resources Metadata is not just for description and discovery; it is also for contextualisation (relevance, quality, restrictions (rights, costs)) and for coupling users, software and computing resources to data (to provide a Virtual Research Environment) Metadata must be machine-understandable as well as human understandable for autonomicity (formalism) Management (meta)data is also relevant (research proposal, funding, project information, research outputs, outcomes, impact )

  3. FAIR Principles 3 To be Findable: F1. (meta)data are assigned a globally unique and eternally persistent identifier. F2. data are described with rich metadata. F3. (meta)data are registered or indexed in a searchable resource. F4. metadata specify the data identifier. To be Accessible: A1 (meta)data are retrievable by their identifier using a standardized communications protocol. A1.1 the protocol is open, free, and universally implementable. A1.2 the protocol allows for an authentication and authorization procedure, where necessary. A2 metadata are accessible, even when the data are no longer available.

  4. FAIR Principles 4 To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles. I3. (meta)data include qualified references to other (meta)data. To be Re-usable: R1. meta(data) have a plurality of accurate and relevant attributes. R1.1. (meta)data are released with a clear and accessible data usage license. R1.2. (meta)data are associated with their provenance. R1.3. (meta)data meet domain-relevant community standards.

  5. Metadata Element Set - and URL of unpacking (1) 5 Unique Identifier (for later use including citation) {http://bit.ly/2ryRr12} Location (URL) {http://bit.ly/2rujALv} Description {http://bit.ly/2ss2CwH} Keywords (terms) {http://bit.ly/2se44QX} Temporal coordinates {http://bit.ly/2sdVKAR} Spatial coordinates {http://bit.ly/2ru6kGt} Originator (organisation(s) / person(s)) {http://bit.ly/2ruFCgZ} Project {http://bit.ly/2rukIid}

  6. Metadata Element Set - and URL of unpacking (2) 6 Facility / equipment {http://bit.ly/2sdEj3h} Quality {http://bit.ly/2svs0Cc} Availability (licence, persistence) {http://bit.ly/2t56LEy} Provenance {http://bit.ly/2se59Z1} Citations {http://bit.ly/2se9efQ} Related publications (white or grey) {http://bit.ly/2rjHFR5} Related software {http://bit.ly/2rutPzn} Schema {http://bit.ly/2srMUl3} Medium / format {http://bit.ly/2svtEEe}

  7. Unique Identifier 7 Identifies the digital object of interest It might be useful to constrain representations of identifiers of well-known types Should it be PUUID: permanent/peristent universal unique ID? Managed system or generated? How to handle PUUIDs of versions, fragments of digital object? All metadata elements should be referentially and functionally related to the PUUID of the digital object Allows for elements to have formal structure (syntax) and terms to have declared meaning (semantics) Ensures elements have a relationship to the digital object represented by the PUUID E.g. an organisation or person exists independently of the dataset of which they are the owner/creator/manager

  8. Location 8 URL (locator, not identifier URI) Atomic or unpacked semantically Protocol (http, ftp, mailto, jdbc, etc.) Subdomain (www or other) Domain name (name.com, name.de, etc.) Port number (80 or other) Directory (path to the page, if none is provided, server uses root web directory) Page (if no page is provided, server uses default page) How to handle locations of versions, fragments?

  9. Description 9 Is dataset name or title part of description? Abstract: Text that needs a qualifier of the language. Keywords (terms) see that sheet Classification: is this keywords? Schema or controlled vocabulary needs to be declared Examples: Dewey Library of Congress National Library of Medicine Universal Decimal Classification Multilingual versions?

  10. Keywords 10 Keywords may come from controlled or uncontrolled vocabularies. are controlled keywords classification? (note there may be other classification systems used for other elements e.g. for quality, media ) For this the meaning of a term must be machine understandable. This means a dereferentiable ID is needed The keyword should have a relationship identifier or role of the term to the object e.g. aboutness This means that a set of standard relations should be defined

  11. Temporal Coordinates 11 Time scales (depend on research fields) < nano seconds (Particle Physics), Seconds / minutes, Hours (climate/weather), Days,Years (history), Millenia (geoscience).. See: http://standards.sedris.or g/18026/text/ISOIEC_18 026E_TEMPORAL_CS. HTM Forms Date Timestamp Time interval (start/end time) Historical period Period Format (UTC?) Units Errors on time stamp and period a date/time interval represents all the forms?

  12. Spatial Coordinates 12 In the context of Astronomy IVOA recommendation: Space-Time Coordinate Metadata for the Virtual Observatory (Version 1.33) [ http://www.ivoa.net/documents/REC/DM/STC- 20071030.pdf ] Geospatial ISO19115 (139 for XML linearization) Note it has itself complex elements with structured attributes

  13. Originator (organisation, person) 13 Originator is a roles of agents (people, corporate bodies and computational agents) in the data creation process role: creator, publisher, author, ? What s difference between creator and author in the research data context. Who is publisher - the researcher who deposits data, or the organization that maintains the repository. If it s the latter, we suggest the information be generated automatically. See PRO, the Publishing Roles Ontology http://www.sparontologies.net/ontologies/pro/source.html#d 4e599 as example of how such roles are described for publication. Does a roles ontology exist for research data? If not, should we create one? Research roles include data collector, analyst, etc, and creator does not sufficiently reflect those functions. Need to define relationship (role and temporal duration) between dataset and originator with defined role terms

  14. Project 14 Project name (full) Project name (abbreviated) Grant number Program? Funder (see Originator?) Name ID (e.g. FundRef) Grant beginning date Grant end date Investigators (see Originator) Principal Co Project URI: of limited value since ephemeral? (CERIF has detail here that might be useful) When the title of the data set is different from the project name, where the title information should be recorded? see comments under Description Note many of these unpacked sub-elements require relationships multilinguality

  15. Facility/Equipment 15 Facilities are owned or run by organisations. E.g. A research vessel, an analytical facility, a space or ground-based telescope A virtual facility could be a data sharing network. Relationships facility/equipment and each to organisation, person, publications Facility A Facility provides a capability via the provision of services to serve a specific function. Facilities can be physical or virtual. Facilities, like equipment, are artifacts designed, built, operating or installed to serve a specific function affording a convenience or service.

  16. Facility/Equipment 16 Does this include the instruments installed on a facility? Yes. So if you separate the concepts you need to have facility as a possible metadata attached to an equipment. So a piece of equipment would normally (but not always) be contained within a facility (e.g. maybe not for chemical or biology equipments which provide measures). Consider substituting Instrument for Equipment . This would allow including items such as surveys in the metadata. Equipment A physical item used within a research process for a specific purpose, say for preparation of a sample, or taking of measurements. Or a computing system.

  17. Data Quality 17 From TeD-T: http://smw- rda.esc.rzg.mpg.de/index.php/Data_Quality Data quality (DQ) is a multi-dimensional construct perception and/or a judgment of data's fitness or trustworthiness to serve intended research uses in a given context From DUL https://www.w3.org/2005/Incubator/ssn/wiki/DUL_ssn#Qual ity Data Quality could be categorized by pre-defined values or classification from a particular vocabulary or scheme. RDA-DQV (Data Quality Vocabulary)? The problem here is How to categorize a perception!! :-O Perception is a process of recognizing and interpreting sensory stimuli (G)

  18. Data Quality 18 Data quality review often includes matching (statistics on) the data to the metadata. For instance, are there missing values? Are the missing values accounted for? This is a stage in data curation. However, researchers have a different concept of what entails data quality, and this element should be renamed to avoid confusion and user-friendliness. Quality may include availability (persistence, access), see next slide contextualisation, Provenance see later slide

  19. Availability (licence, persistence) 19 Persistence includes backup, recovery, mirroring, fragmentation, media migration Versioning, provenance Licence includes right to read, copy, write Usually (in open science) with acknowledgement/citation

  20. Provenance 20 The PROV-O ontology is a good starting point for Provenance concepts Entity, Activity, Agent note relationships Maybe in specific areas the provenance description can be simplified. Relationship provenance to metadata catalogs? Is provenance within the catalog or separate? Relationship provenance to logs? Does provenance rely on logs or replicate them (with more contextual information)

  21. Citation 21 An open question that came up is whether we should support a defined way to communicate the reason for the citation i.e. the role between the object being cited and the object citing with attendant relationship (e.g. person). There is substantial overlap with the related publications and related software documents/considerations. Wondering about how citations in different styles (MLA, ALA...) should be referred to? classification Classical: There are well established ways to cite scientific publication, classically by specifying attributes like authors, title, publisher, journal name, volume & issue number, pages, and respective variants for books, book chapters, thesis types, how to cite datasets from other objects? The ID is probably OK but what about the role? Fragments of datasets? Versions? Identifier based: More recently it became common to refer to (sometimes in the context of citation also)

  22. Related Publications (white or grey) 22 Suggestion to use essentially DC or DCAT metadata Problem of referential and functional integrity (experience in EPOS with DataCite) The related publication is not referentially or functionally dependent on the dataset or other digital object it exists independently It is all about relationships Role, temporal duration RDF version of DC or DCAT could be used

  23. Related Software 23 There may be many kinds of software related to a Dataset Software generating it (especially simulation) Software processing it (especially analytics) Software validating it Software Software on which the software object is dependent e.g. libraries Software dependent on the software object Software of the infrastructure to execute the software object (operating system)

  24. Schema 24 Used for validation of the dataset - constraints Equivalent for software object validating source code Used to connect the dataset to executing software Data structure constraints

  25. Medium/Format 25 Medium Versions of a digital object may be on different media Kinds of medium classification system / enumerated list of terms Format The structure and encoding of the digital object May be implicit in schema (but not all digital objects have schema)

  26. Overall Remarks 26 Still lacking detail despite a lot of work More work to do Some characteristics emerging: Relationships between and within elements Need for classification on many elements For properties of the entity (e.g. medium) For roles in relationships of the entity with others e.g. dataset <-> person owner

Related


More Related Content