Data Quality and DDI Metadata Overview

Data Quality and DDI Metadata Overview
Slide Note
Embed
Share

This presentation delves into data quality, trustworthiness, usability, fitness-for-purpose, and the importance of rich metadata and provenance in the context of DDI standards.

  • Data Quality
  • DDI Metadata
  • Trustworthiness
  • Usability
  • Provenance

Uploaded on Feb 18, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data Quality and DDI Metadata DDI Alliance Training Library Version 1.0 DDI Alliance, DDI Training Working Group This work is licensed under Creative Commons Attribution 4.0 International License.

  2. Outline Goals What is Data Quality? Approaches to Data Quality Rich Metadata and Provenance Organizational Certification Data Quality Standards and Frameworks SIMS Example Using DDI Summary 2

  3. Goals To understand the breadth of issues around data quality To understand the different audiences for data quality metrics To understand different requirements and current approaches To understand how DDI standards can help This presentation does not attempt to be comprehensive, but rather to introduce the various aspects of data quality which may be of interest the topic is simply too broad! 3

  4. What is Data Quality? 4

  5. Understanding the Question What is Data Quality? Literally: data quality is the determination of whether data is good or not This depends entirely on what is meant by good ! Can the data be trusted? Is the data useable? Is it fit for purpose? Was it collected in accordance with agreed processes and practices? 5

  6. Trust Whether data can be trusted is an important consideration The provenance of data is very important The people and organizations involved with the data are important: Who collected it? Who funded the data collection? Who is archiving and disseminating the data? The methods used to collect, process, and manage the data are important 6

  7. Useability and Fitness-for-Purpose The useability of data depends on how well-documented the data is, and whether it has been responsibly managed The fitness-for-purpose of data depends almost entirely on the research for which it will be used Some key factors must be considered: Temporal coverage and granularity Geographic coverage and granularity Topical coverage Methodologies for data collection and processing Combinability of data with other sources being used 7

  8. Agreed Processes and Practices Governmental and supra-governmental organizations have a legal obligation to perform due diligence they must take appropriate and regular actions when performing an activity All researchers and organizations have an ethical and reputational obligation to perform due diligence Measuring compliance with accepted practices and procedures is not a direct measure of data quality It is a practical measure It supports due diligence Avoid labeling data as absolutely good or bad focus on describing how it was created Relies on agreed standards, frameworks, and methods 8

  9. Approaches to Data Quality 9

  10. Overview of Current Approaches Rich Metadata and Provenance Data Documentation Initiative (DDI) World Wide Web Consortium (W3C) Other Standards Organizational Certification CoreTrustSeal FAIR Assessment Tools Statistical System Audits (Eurostat Peer Review ) Quality Standards and Frameworks International Monetary Fund (IMF) Data Quality Assessment Framework (DQAF) OECD Statistical Data Quality Eurostat Single Integrated Metadata Structure (SIMS combines ESQRS, ESMS) 10

  11. Relative Maturity Different efforts around data quality are more or less advanced Some approaches involve more effort due to their complexity Some communities are less regulated/disciplined in their approaches to data (i.e., statistical agencies vs. academic researchers) Attempts to provide rich metadata and provenance information are on-going, but difficult to solve (moving target) Certification of repositories and other organizations is relatively advanced Use of quality frameworks and standards in official statistics is an established (and growing) practice 11

  12. Rich Metadata Rich metadata covers all levels of metadata: Study-level (dataset level) metadata Variable-level (datum-level) metadata Full definition/description of structures, concepts, representations, methods, organizations, citation, identification, versioning, etc. Supported by standards such as DDI (for Social, Behavioral, and Economic sciences) All versions of DDI (DDI-Codebook, DDI-Lifecycle, DDI-CDI) support this Other standards used in other domains (i.e., OMOP CDM for clinical data, SDMX for official statistics aggregates and time series) Not as common as it needs to be! Emphasized by the FAIR principles 12

  13. Provenance Describes the lineage of data Who was involved? (Data collection, processing, funding, etc.) Why? What purpose? To support what analysis or goal? What methods and processes were used? What resources were involved? How was it edited/transformed, recoded, etc.? How were variables derived? DDI provides support for this material DDI-Codebook has descriptive material at the study level, and some specific process fields to hold processing code DDI-Lifecycle has a greatly-expanded capacity for doing this, including detail on survey questionnaires for data collection DDI-CDI provides a framework for detailed provenance description, building on other standards (including other DDI versions) Other standards also support this W3C PROV-O, Statistical Data Transformation Language (SDTL), PROV-ONE, BPMN On-going efforts to identify common standards for provenance information through RDA, CODATA, etc. 13

  14. Organizational Certification The CoreTrustSeal from the World Data System (WDS) and the Dutch Data Archive (DANS) is the most significant example https://www.coretrustseal.org/ Repeated 3-year certification of repositories according to a detailed set of requirements and tools More than 100 trusted repositories have been certified FAIR Assessment Tools and Implementation Applies to all data-producing or disseminating organizations GO FAIR, the Research Data Alliance (RDA), and others have groups focused on the FAIRification process, including assessment tools Not yet mature, but demonstrable progress Rich metadata especially around Provenance may be lacking Statistical System Audits (Eurostat Peer Review in ESS) Started in 2006 Recurring examination of production systems (every 5 years) 14

  15. Data Quality Standards and Frameworks Mostly from the official statistics world Official data meets a high minimum standard for quality Enforced through the regulations driving data reporting and collection Specific concepts reported on a specific schedule using specific classifications Quality consist of knowing the details about how these requirements were met by each data reporter/collector The quality frameworks and standards provide a consistent technique for collecting and publishing these details Each of these reporters/collectors uses a standard set of information so that differences between data contributing to a single combined data set can be understood (and foot- noted) Differences in methodology Differences in national practice Etc. 15

  16. Real-World Standards and Frameworks International Monetary Fund (IMF) Data Quality Assessment Framework (DQAF) https://dsbb.imf.org/dqrs/DQAF Applied to many important international data sets (i.e., National Accounts) Aggregate statistics (based on SDMX) OECD Statistical Data Quality https://www.oecd.org/sdd/qualityframeworkforoecdstatisticalactivities.htm#:~:text=For%20an%20international%20org anisation%2C%20the,dissemination%20of%20data%20and%20metadata. Aggregate statistics (based on SDMX), broader in scope than IMF Eurostat Single Integrated Metadata Structure (SIMS) https://ec.europa.eu/eurostat/ramon/statmanuals/files/SIMS_Manual_2014.pdf Based on SDMX, but covers aggregate and microdata Implementation based on DDI-Lifecycle (Statistics Denmark) Based on earlier related efforts (ESQRS, ESMS) which SIMS has harmonized All of the quality frameworks are explicitly aligned 16

  17. SIMS A Quality Reporting Framework Example Eurostat Single Integrated Metadata Structure - Using DDI for a Quality Framework 17

  18. Quality Standards (1) What is a quality standard ? Metadata items describing specific aspects of a data set A formal description of a quality standard, and the quality concepts, which it requires (DDI 3.3) Can be used by reference at various points in the life-cycle May be based on some official standard but may also be a set by the organization itself In the European Statistical System (ESS) there are three different quality standards: SIMS the Single Integrated Metadata Structure ESQRS the ESS Standard for Quality Reports Structure ESMS EURO-SDMX Metadata Structure SIMS is a harmonized structure which is now being primarily implemented. 18

  19. Quality Standards (2) Quality Standard example: Name: SIMS Label: the Single Integrated Metadata Structure Description: SIMS is the dynamic inventory of statistical concepts used for quality and metadata reporting in the ESS Reference to the Standard used: https://ec.europa.eu/eurostat/documents/64157/4373903/03-Single-Integrated- Metadata-Structure-and-its-Technical-Manual.pdf List of quality concepts that are measured: S.1 Contact Quality Standard has Name Label S.1.1 Contact organization S.1.2 Contact organization unit S.1.3 Contact name Description Etc. S.2 Introduction S.3 Metadata Update Standard Used (citation) S.3.1 Metadata last certified S.3.2 Metadata last posted S.3.3 Metadata last update S.4 Statistical presentation Etc. List of concepts (compliance definition) 19

  20. Quality Concepts Each Quality Standard consists of quality concepts (constitutes a Concept Set) defined by that standard When implementing several Quality Standards, some quality concepts may be the same across standards Describe quality concepts in DDI once and reuse them within different Quality Standards wherever they are needed If the definitions of the quality concepts are the same However, the quality concepts may be defined slightly differently In this case, it is necessary to create mapping at the level of concepts and/or definitions and link them to the common concept that characterizes them 20

  21. Quality Concepts - Reuse In this example common Terms within 3 standards are Contact and Confidentiality Are these concepts the same? If the answer is yes, we can create the mapping Are there terms that might mean the same thing? Sometimes concepts match even though the terms do not (eg. Introduction vs. General Description) Note: Self-defined standards are custom concepts used inside an institution, not required by a shared framework 21

  22. Quality Concepts according to SIMS (example) Concept Description Contact Individual or organisational contact points for the data/metadata, incl. information on how to reach the contact points. Contact organisation The name of the organisation of the contact points for the data/metadata. Contact name The name of the contact points for the data/metadata. Introduction A general description of the statistical process and its outputs, and their evolution over time. Metadata update The date on which the metadata element was inserted or modified in the database. Metadata last update Date of last update of the content of the metadata. Statistical presentation Description of the disseminated data which can be displayed to users as tables, graphs or maps. Sector coverage Main economic or other sectors covered by the statistics. Time coverage The time period covered by the data set. Etc. Etc. SIMS Single IntegratedMetadata Structure 22

  23. Quality Statement A document in which an organisation establishes what quality concepts are considered relevant and necessary at a certain point of the life- cycle May be related to an external standard or contain a simple statement of the internal quality goals or expectations May correspond to one specific quality standard, but does not necessarily do so (some concepts are shared between standards) Quality Standard Quality Statement Quality Report Provides values for those Quality Concepts Selects Quality Concepts from 23

  24. Quality Statements/Reports Quality Reports Primarily address processes and steps that are taken to ensure quality within those processes Allows for either: the identification of an external standard plus a statement regarding compliance with that standard, or a general statement of steps taken to ensure quality for a given process or activity 24

  25. DDI: Reusing Concepts in Quality Statements Total list of Quality Concepts: Contact Metadata update Statistical presentation Confidentiality Introduction Etc. In DDI, each quality concept with the same information content can be described only once and reused in as many Quality Statements as necessary. Quality Statement B Quality Statement C Quality Statement A USED CONCEPTS It doesn t matter if the concepts in the Quality Statement are based on the same Quality Standard or not they can be combined. METADATA UPDATE STATISTICAL PRESENTATION CONFI- DENTIALITY ACCURACY AND RELIABILITY CONTACT INTRODUCTION RELEASE POLICY Self-defined quality standard Quality Standard X Quality Standard Y 25

  26. Summary 26

  27. Data Quality is Complicated! There are many different understandings and approaches for measuring data quality This depends on the organization and the intended audience for the data being described Even though it may be difficult, it is very, very important! DDI supports these approaches: By providing rich metadata and provenance information By helping organizations meet certification requirements By directly citing quality standards and supporting quality standards, statements, and concepts 27

  28. Credits: DDI Training Working Group Florio Orocio Arguillas Kathryn Lavender Alina Danciu Amber Leahey Adrian Dusa Marta Limmert Jane Fry Jared Lyle Martine Gagnon Alexandre Mairot Dan Gillman Lucie Marie Arofan Gregory Hayley Mills Taras G nther Laura Molloy Lea Sztuk Haahr Hilde Orten Chifundo Kanjala Anja Perry Kaia Kulla Knut Wenzig

Related


More Related Content