
Assumptions in Data Sharing and Registry System Development
Explore the challenges of data sharing, the need for precise data characterization, and the goals of evaluating assumptions in data to develop a functioning registry system easily adoptable by organizations. This involves addressing issues related to self-explanatory data, federation of registry instances, and configurable assumptions for different domains or organizations.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Data Type Registries Breakout Co-chairs: Larry Lannom, Tobias Weigel P10, Montreal September 2017
Agenda 11:30 - 11:35 Welcome & Intros, Agenda Bashing 11:35 - 11:40 Larry Lannom, State of the WG & Brief DTR Overview 11:40 - 11:50 Tobias Weigel, Climate Data Processing 11:50 - 12:00 Ulrich Schwardmann, ePIC DTR 12:00 - 12:10 Wo Chang, Common Access Protocol, IEEE BDGMM 12:10 - 12:20 Rob Quick, RPID Test Bed 12:20 - 12:25 Steve Richard, EarthCube (remote) 12:25 - 12:30 Andres Ferreyra, AgGateway (remote) 12:30 - 12:40 Giridhar Manepalli, ISO WG plus Data Models (remote) 12:40 - 13:00 Tobias Weigel, Discussion: Next Steps, Goals for P11
What is the Issue? Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data How do we do this now? For documents formats are enough, e.g., PDF, and then the document explains itself to humans This doesn t work well with data numbers are not self-explanatory What does the number 7 mean in cell B27? Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc. Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation
DTR Usage Example 3 Federated Set of Type Registries Users 2 1 4 Terms: ID ID ID ID I Agree Type Type 10100 11010 101 . Data Set Dissemination ID Type Type ID 4 Data Rights Processing Payload Payload Payload Type Payload Payload Payload Type Visualization Typed Data Services 1 Client (process or people) encounter data of an unknown type 2 Resolved the Type to Type Registry Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally Typed data or reference to typed data can be sent to service provider 3 4
Goal of the WG Evaluate and identify a few assumptions in data that can be codified and shared in order to Produce a functioning Registry system that can easily be evaluated by organizations before adoption Highly configurable for changing scope of captured and shared assumptions depending on the domain or organization This doesn t work well with data numbers are not self-explanatory Supports several Type record dissemination variations Design for allowing federation between multiple Registry instances The emphasis is not on Identifying every possible assumption and data characteristic applicable for all domains Technology
Status of the WG A prototype is at: http://typeregistry.org/ Multiple other implementations/projects, including multiple schemas Implementation supports notions of primitives and derived types Primitives are fundamental types that we expect humans and software to parse and understand Derived types depend on primitives to describe something complex Registered types are assigned unique identifiers Initial WG output published as ICT Technical Standard ISO Study Group in process
Initial Adopters EarthCube Steve Richard Vermont Monitoring Cooperative Mike Finnegan DKRZ Tobias Weigel ePIC Ulrich Schwardmann NIST, Common Access Platform Wo Chang CNRI multiple projects Ongoing ISO Study Group
Expected Impact of the Deliverable Best case scenario: agreed upon set of standard schemas; ISO standard Wide use of types for data sharing and workflow automation Significant use of federation of distributed set of type registries Extended use of typed attribute/value pairs in PID resolution Worst case scenario: no agreed upon set of schemas, no further standardization General concept influences multiple communities in the direction of clearer data syntax and semantics ICT Tech Standard remains Existing use of typed attribute/value pairs in PID resolution
Expected Impact of the Deliverable Before After Data sets difficult to impossible to parse, understand, and re-use unless you created them, know who did, or there exists detailed pubic documentation. Search criteria for data sets restricted to keywords and sources. Standardization across data sets fairly arbitrary, concentrated in small groups and narrow communities. Data sets can be typed at a fine level of granularity, those types can be registered in a public registry, and those type records can contain sufficient information to make detailed and accurate use of the data sets so typed. Search criteria for data sets can include type information, yielding easier comparisons and mash- ups. Greater chance of standards developing across data set construction.