Data Type Registries and Assumptions in Data Sharing

Data Type Registries and Assumptions in Data Sharing
Slide Note
Embed
Share

This content discusses the importance of explicating and sharing assumptions in data through Type Registries. It emphasizes the need for precise characterization of data to enable understanding and reuse by both humans and machines. The goal is to create a functioning Registry system that supports various data assumptions and characteristics, facilitating evaluation before adoption and allowing federation between multiple Registry instances. Data Types and Registries play a crucial role in resolving data representation challenges and evolving towards a structured and functional system.

  • Data Type Registries
  • Assumptions
  • Data Sharing
  • Type Registries
  • Data Types

Uploaded on Mar 10, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data Type Registries #2 RDA Chairs Mtg Gothenburg June 2017 Co-Chairs: Larry Lannom - CNRI Tobias Weigel DKRZ

  2. Problem: Implicit Assumptions in Data Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data How do we do this now? For documents formats are enough, e.g., PDF, and then the document explains itself to humans This doesn t work well with data numbers are not self- explanatory What does the number 7 mean in cell B27? Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc. Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation Corporation for National Research Initiatives

  3. Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries Evaluate and identify a few assumptions in data that can be codified and shared in order to Produce a functioning Registry system that can easily be evaluated by organizations before adoption Highly configurable for changing scope of captured and shared assumptions depending on the domain or organization Supports several Type record dissemination variations Design for allowing federation between multiple Registry instances The emphasis is not on Identifying every possible assumption and data characteristic applicable for all domains Technology Corporation for National Research Initiatives

  4. What is a Data Type? A unique and resolvable identifier Which resolves to characterization of structures, conventions, semantics, and representations of data Serves as a shortcut for humans and machines to understand and process data File formats and mime types have solved the representation problem at a unit level Examples of problems we aim to solve with data types: It is a number in cell A3, but is it temperature? If so, in Celsius? It is a dataset consisting of location, temperature, and time, but what variable names should I look for? Is it all packaged as CSV or NetCDF? And as a single unit or a collection of units? Type record structure will continue to evolve not finished, but functioning Corporation for National Research Initiatives

  5. What is a Data Type Registry? A low-level infrastructure with wide applicability to record and disseminate type records Not an immediate ROI application Assigns unique and resolvable identifiers to type records Enforces and validates common data model & expression for interoperation between multiple instances of Registries API for machine consumption UI for human use Corporation for National Research Initiatives

  6. Process Use Case 3 Users 2 Federated Set of Type Registries 1 4 ID ID ID Terms: ID 4 Type Type ID Type Type ID I Agree 10100 11010 101 . Data Set Dissemination Visualization Payload Payload Payload Type Payload Payload Payload Type Rights Data Processing Typed Data Services 1 Client (process or people) encounters unknown data type. 2 Resolved to Type Registry. 3 Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally typed data or reference to typed data can be sent to service provider. 4 Corporation for National Research Initiatives

  7. Discovery Use Case 2 Users 1 Federated Set of Type Registries 3 4 ID ID ID ID Type Type ID Type Type ID Payload Payload Payload Type Payload Payload Payload Type Repositories and Metadata Registries 1 Clients (process or people) look for types that match their criteria for data. For example, clients may look for types that match certain criteria, e.g., combine location, temperature, and date-time stamp. 2 Type Registry returns matching types. 3 Clients look up in repositories and metadata registries for data sets matching those types. 4 Appropriate typed data is returned. Corporation for National Research Initiatives

  8. Type Registry History Handle Types 0.Type/SomeType Good idea, limited applicability Profiles type the whole set of handle/type/value triples. No traction Sloan Grant: 2012 -2014 Generic Registry system using Type Registry as a use case NSF Grant: 2013 -2014 Included support for Type Registry Research Data Alliance (RDA) Data Type Registries Working Group One of first two WGs approved at Plenary 1 March 2013 International representation, > 50 members Co-chairs Lannom (CNRI), Broeder (Max Planck Psycholinguistics Institute) Approved as RDA Recommendation (2015) DTR Phase 2: Follow-on Group Focus on data type records Help data producers create useful record types Co-chairs Lannom (CNRI), Weigel (DKRZ) Corporation for National Research Initiatives

  9. Current State A prototype is at: http://typeregistry.org/ Multiple other implementations/projects DKRZ, Vermont Monitoring Coop, RPID project Implementation supports notions of primitives and derived types Primitives are fundamental types that we expect humans and software to parse and understand Integer, floating point, boolean value, string, date, timestamp, etc. Derived types depend on primitives to describe something complex Stream gauge, Lidar, Spatial bounding box, etc. Registered types are assigned unique identifiers ISO Study Group Corporation for National Research Initiatives

  10. ISO Study Group Corporation for National Research Initiatives

  11. ISO Study Group Activity on Data Type Records: Background ISO-IEC/JTC1/SC32/WG2 is currently working on a meta model for dataset description. That meta model covers data elements "about" datasets versus "internal details" of datasets. We refer to the record that captures internal details a "data type record . WG2 was receptive to the idea of exploring the "data type record" space. A study group was authorized Nov 2016, with CNRI as lead ISO-IEC/JTC1/SC32/WG2/SG Present use case(s) from existing RDA members and see what fields are needed to describe the internals of the datasets pertaining the use case.(s) Evaluate what existing ISO standards cover and what they do not. If applicable, recommend a technical report or technical specification or standard. The study group terminates June 2017 Corporation for National Research Initiatives

  12. ISO Study Group Activity on Data Type Records: Work and Possible Outcomes ISO members referenced portions of existing ISO standards that are applicable to the proposed data type record structure (11179-3, 11179-7, 19763-12, and 11404) CNRI will create a UML diagram that, wherever applicable, references the pieces from each of those standards. Feb 10th Data Type Record Elements for Characterizing Data submitted to WG. Has been posted to DTR site Based in part on CMIP6 and Vermont Monitoring Coop examples For discussion purposes, CNRI introduced the notion of a simple data type and a complex data type. Simple data type describes characteristics of a single value (e.g., a cell in a spreadsheet) Complex data type is an aggregate of simple data types (e.g., a row in a spreadsheet can be described using a complex data type) June 17 Decision Technical Report: documentation on how to use existing standards for the data type use case Technical Specification: intermediate spec, possible future standard, still under development Technical Standard: something new and useful, fully standardized Good news: we get an ISO number, no matter what happens Corporation for National Research Initiatives

  13. Corporation for National Research Initiatives

  14. Data Type Example Corporation for National Research Initiatives

  15. DATA TYPE REGISTRY DATA SET DESCRIPTIONS FOR AUTOMATION STEPHEN M RICHARD, IEDA, EARTHCUBE

  16. USE CASES Document the meaning of entities and attributes in data. Re-use of data type and attribute definitions Machine-assisted data integration: matching attribute content. Validation of data instances against a type definition. Tools that spin up a UI for a particular data type. Link software to data sources that it can use Support file introspection to assist with deep data registration

  17. PROGRESS JSON schema implemented for data type model https://github.com/usgin/digital-crust-LDR DataTypeJSON.json Initial testing with Cordra Next steps how to get model compilations from spreadsheet or rdf to Cordra. Where to deploy

  18. Corporation for National Research Initiatives enrich.cordra.org Enrich DTR Service enhances dataset metadata Syntactic nature (Decimal values between 0.0 24.0) Semantic information (Concept time in unit hours) Currently applying the DTR at an attribute level (Concept and Unit) Expand to the dataset level Dataset Metadata (row/column count, maintainer, description, etc.) Dataset Schema (columns information constraints, formats, semantic information, etc.) Utilizing DTR mapping, in addition to other metadata, to drive a dataset recommendation system. Continuing to collaborate with CNRI on this.

  19. Corporation for National Research Initiatives

  20. Concept as it was at P8 netCDF-Files Agent Collection script Processing service (WPS) <Metadata> (xml) well-defined ways to publish it (automatically) possible repacking into a new collection ? output (third-party input) multiple types, e.g. netcdf, xml, linked data, text reports, PROV record described in DTR HTTPS://RD-ALLIANCE.ORG/ - HTTPS://TWITTER.COM/RESDATALL

  21. Pathway for climate data processing services Multiple upcoming angles for attaching a DT solution: CMIP6 data Handles, Handle records, potential PID Kernel Information profile Copernicus Data Services re-using existing WPS-based services Handles? Depends on success and acceptance of CMIP6 solution and available effort Climate Analytics Service (DKRZ/CMCC) server-side data processing with PID support data sources: CMIP6, possibly Copernicus, ... HTTPS://RD-ALLIANCE.ORG/ - HTTPS://TWITTER.COM/RESDATALL

  22. Typing angles a) Service typing automated discovery or at least verification b) Data input and output typing (coarse) to distinguish the data sources and help with management c) Typing of data internals (fine) traditional DTR: enable machines to understand meaning of data HTTPS://RD-ALLIANCE.ORG/ - HTTPS://TWITTER.COM/RESDATALL

More Related Content