Data Management and Data Fabric: Organizational Model for Terminology in RDA Efforts

rda terminology data management and data fabric n.w
1 / 9
Embed
Share

Explore a comprehensive model for organizing data management terms within the RDA framework, addressing issues in terminology and proposing strategies for improving clarity and consistency in data concepts.

  • Data Management
  • Data Fabric
  • RDA
  • Terminology
  • Organization

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. RDA Terminology: Data Management and Data Fabric RDA Terminology: Data Management and Data Fabric DFT Goal: Describe a basic, abstract (but clear) data organization model that systemizes the already large body of definition work on data management terms, especially as involved in RDA s efforts. Terminology Issue What do we expect from RDA ? Adopt one or build own language? Spend years on terminology debates? Build our own language stepwise, Other - such as cooperate with other efforts? Prepared for RDA 6thPlenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair for DF IG

  2. Topics - RDA DFT is about clarifying and labeling concepts and Terminology Strategy Franco Zoppi The document seems to suffer from a problem in the used terminology. Terms are sometimes unclear (in many cases definitions would help) or even wrong or misused. I guess that most of these problems could be avoided with a correct use of Computer Science/ICT well established and consolidated terminology. This is particularly evident in Sections 2.2, 2.3 and 2.6. Broadening discussion beyond a core to wider Data Management Including suggested concepts with candidate terminology Current strategy is to: Clarify and update existing terms Digital Objects need IDs, but what and how as part of data management? etc... Improve supporting models with conceptual relations (a big job) Provide practical guidance (technical and policy views)

  3. Broadening the Discussion (Stepwise or Scope-wise) Data Management (and use) is broader still Digital Data Management including unregistrered (is a braoder concept) Digital Object Management (registered, digital data) Where are datasets???

  4. Based on practical principles, Policy defines when in a workflow a PID is created as well as other curation activities..These defs are linked Integrate Concepts: Policy-based Digital Data Management Concept Graph (Reagan Moore) Purpose Defines Collection DATA_ID DATA_REPL_NUM DATA_CHECKSUM SubType Replication Policy Checksum Policy Has Isa Isa Isa Has Isa Sharing Publication Preservation Has Digital Object Attribute Isa Quota Policy Has Isa Defines Data Type Policy Isa Updates Integrity Isa Isa Persistent State Information Authenticity Isa Defines Property Policy Procedure Updates Controls Access control Isa Isa SubType Has HasFeature GetUserACL HasFeature Periodic Assessment Criteria Policy Workflow Isa Policy SetDataType Completeness HasFeature Enforcement Point Chains Isa SetQuota Correctness Isa Function HasFeature Invokes Isa DataObjRepl Isa Consensus Isa SysChksumDataObj Operation Consistency Client Action

  5. Including suggested concepts with candidate terminology: Examples Data practice is the actual application/ use of ideas & methods (as opposed to theories) about how data are collected, created, stored (maintained), curated, used, shared and released (disseminated). Data principles are rules that provide guidance across data management and use for such things as data acquisition, data lifecycle control, data policy & ownership, metadata practices, data quality etc. Common data solutions are agreed upon, easily available, tested & approved approaches to widely occurring problems in data management and use Data discovery is a process of query and/or search to find (research) data of interest. Database cracking features incremental partial indexing and/or sorting of the data. It combines features of automatic index selection and partial indexes. It reorganizes data within the query operators, integrating the re-organization effort (occasionally invoking creation or removal of indexes on tables and views based on use) into query execution. It shifts the cost of index maintenance from updates to query processing. Adaptive indexing is characterized by the partial creation and refinement of preliminary or fixed DB indexes as side effects to support efficient query execution. (after http://www.vldb.org/pvldb/vol4/p586-idreos.pdf) 1. 2. 3. 4. 5. 6.

  6. Clarifying Concepts: we discussed other organizing model ideas Link data management principles to the actual workflow of generating data Data Management Workflow Structured Object includes provenance, versioning, and output MD (from PP) Digital Object (aka Digital Entity) A digital object is composed of structured sequence of bits/bytes. As an object it is named. This bit sequence can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties. Note Digital Entitydefinition from X.1255 ITU standard machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.

  7. Clarifying and updating existing terms: adding practicality Comments on the DF White paper include challenges to the idea that Internal/External properties is a useful distinction for DOs: Internal property refers to the properties, making up an internal structure, that allow one to interpret the content of a DO. the statement we need to distinguish the external characteristics from the internal characteristics to ensure that we really can separate common data management tasks from discipline specific heterogeneity .. seems not appropriate.... many such things considered external for data managements vary by discipline too...search by sample type or Dx. I think that it is unfeasible the assignment of PIDS to single data. Therefore you need search and query capabilities to find the required data contained in datasets/databases identified by the PIDs. Patient Age Symptom Dx ... Sample type UoM Obs. Precision ID, creation date,... ID, creation date,... Common Management for these External Properties? Part is Identification, but Part is for discoverability

  8. Improving conceptual relations Concept map overview of Core Terms How should data stored in a repository that has complex internal structure and that is subject to change be identified/cited? How is some part of a database or dataset to be identified/cited? We will need smarter resolvers that offer additional services beyond getting from an identifier to an object location.

  9. Providing Practical Guidance (Tech, Policy & Strategy) When should a PID be assigned to be useful with dynamic data? If you build up a clinical trial database you will continuously add and change data. There is no PID necessary because here you have the audit trail which stores all actions. A PID should be assigned, for example, when the database is cleaned and frozen, which is a definite working step in the workflow of clinical trials. (Christian Ohmann, Wolfgang Kuchinke, Steve Canham) PIDs should be assigned at the level of granularity (data sets) appropriate for a functional use that is envisaged (Costantino Thanos) Responses Scalibility is an issue, so the management of objects & identifiers should work through the same mechanisms as much as possible. To enable management of objects beyond a view focusing on single items, adequate mechanisms should, for example, be able to select objects by their most important characteristics or aggregate them at multiple levels of granularity and provide basic CRUD operations on such object collections. Tobias Weigel, Michael Lautenschlager For added-value services registries at the resolvers level are also needed and should be maintained by recognized international organizations. Publishers will rely on the DOI system because there has been major investment. What highly available and scalable PID system is feasible? We should develop a strategy build upon what is existing and what can be done for those cases, where currently no PID is used. Etc....

Related


More Related Content