
Using Synthetic Data to Reduce Disclosure Risk in Municipal Health Surveys
This presentation discusses the use of synthetic data to mitigate disclosure risks in municipal health surveys. The project objectives include evaluating disclosure risks, implementing mitigation solutions, and assessing risk reduction post-mitigation. The approach involves evaluating high-risk survey records and identifying key variables for protection. Overall, the focus is on safeguarding public-use data and reducing the potential for disclosure risks in releasing sensitive information.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
USING SYNTHETIC DATA TO REDUCE DISCLOSURE RISK IN MUNICIPAL HEALTH SURVEYS Wen Qin Deng New York City Department of Health and Metal Hygiene Joint work with Stephen Immerwahr, Tashema Bholanath (NYC DOHMH) and Jingchen (Monika) Hu (Vassar College) Federal Computer Assisted Survey Information Collection Workshops April 17th, 2024
CONTENTS 01 02 03 04 BACKGROUND DISCLOSURE RISK ASSESSMENT MITGATION SOLUTIONS & RESULTS SUMMARY & TAKEAWAYS 2
01 BACKGROUND 3
PUBLIC-USE DATA: USEFULNESS AND CHALLENGES Public-use data files are extremely valuable Disclosure risks may exist in the public release of record-level survey data e.g., potential linkage to administrative database (vaccine, etc.) 4
PROJECT OBJECTIVES Systematic methods to Systematic methods to: Evaluate disclosure risks Evaluate disclosure risks Implement mitigation solutions Implement mitigation solutions Evaluate utility and risk Evaluate utility and risk reduction post mitigation reduction post mitigation 5
02 DISCLOSURE RISK ASSESSMENT 6
APPROACH OVERVIEW Evaluate disclosure risk of all confidential CHS 2021 survey records A malicious intruder might correctly identify a target in the NYC population with information (such as identifying variables) from the public use CHS dataset. Assume the intruder knows a combination of identifying variables of each record Identity Disclosure Risk Flag survey records as high-risk using identifying categorical variables and sampling weights Weighted N (WgtN) and 95% Confidence Intervals (CIs) focus on population total estimates and their standard errors 7
SELECTION OF IDENTIFYING VARIABLES Demographic variables that are easily knowable: age group, sex, race/ethnicity Geography (Borough) High priority to be kept unmodified for use by analysts CORE VARIABLES Additional demographic and health-related variables that are easily knowable Containing sensitive information and are subject to mitigation solutions KEY VARIABLES 8
IDENTIFYING HIGH-RISK RECORDS Age Group x Sex x Race/ethnicity x Borough xKey Variable A 2021 CHS: 25 key variables identified elevated risk of re-identification Core + Key variable (one key at a time) Weighted N less than 100 in the lower bound of 95% CIs are flagged as high-risk i.e., the estimated population of these records in this combination are less than 100 in NYC 9
RISK ASSESSMENT RESULTS Of the 25 key variables selected 4%-24% (of all observations) with elevated risk of re-identification 400-2,442 high-risk observations out of 10,271 records 10
03 MITIGATION SOLUTIONS & RESULTS 11
OVERVIEW OF OUR APPROACH Mitigation solution Partial data synthesis with DPMPM a Bayesian latent class model Apply mitigation solution to a subset of the high-risk records Determine a threshold for this subset (e.g., 5% of all records) A randomized process of selecting the subset Evaluate data utility Compare the 95% Confidence Interval Overlap (CIO) of prevalence estimates of selected health measures, between the ones from the confidential data and the ones from the protected public use data 12
RISK RESULTS AFTER DPMPM SYNTHESIS Before Synthesis 4% to 24% high-risk observations of all observations After Synthesis At most 21% of the dataset remains classified as high-risk (i.e., at least 79% protection) At most 21% of the dataset remains classified as high-risk (i.e., at least 79% protection) Synthesis- at-most-5% approach 4% - 24% 21% (At most) Note: among 25 key variables selected in the 2021 CHS 13
DATA UTILITY RESULTS AFTER DPMPM SYNTHESIS With synthesis-at-most 5% 94% SYNTHESIS Thresholds: 5%, 10%, 20% overlap in the 95% confidence intervals of important health measures, on average 95% CONFIDENCE INTERVAL OVERLAP Cutoffs: 50%, 75%, 90% Best in balancing risk reduction and utility preservation HEALTH MEASURES 10 commonly reported 14
RISK AND UTILITY TRADE-OFF DPMPM: Higher utility at the price of slightly higher disclosure risks 22% Disclosure risk (y-axis) % of high-risk after mitigation Smaller means lower risk Variable A Variable B Variable C Variable D 21% Disclosure Risk Utility (x-axis) 95% Confidence interval overlap of health outcomes before and after mitigation 20% 19% 18% 91% 92% 93% 94% 95% 96% 97% Utility 15
04 SUMMARY AND TAKEAWAYS 16
SUMMARY AND KEY TAKEAWAYS Any mitigation presents a utility- risk tradeoff 01 Multiple considerations in setting parameters 02 17
CONSIDERATIONS AND BEST PRACTICES TIMING FOR MITIGATION IMPLEMENTATION SCOPE OF DATA FOR MITIGATION RETROTRATIVE MITIGATION HANDLING USER INQUIRIES Where in the lifecycle to implement mitigation? At the very start? Only to public use data? To which datasets to apply mitigation solutions? Only to large survey datasets? To all datasets? Should we retroactively apply mitigation solutions? Retroactive application to all datasets? Internal datasets? Public use datasets? How to respond user inquiries when they obtain different estimates using the synthetic data vs. the DOHMH s publications? 18
REFERENCES Drechsler, J. (2011), Synthetic Datasets for Statistical Disclosure Control, Springer: New York. Drechsler, J. and Hu, J. (2021), Synthesizing geocodes to facilitate access to detailed geographical information in large-scale administrative data, Journal of Survey Statistics and Methodology, 9(3), 523-548. Duncan, G. T. and Stokes, S. L. (2012), Disclosure risk vs. data utility: the R-U confidentiality map as applied to topcoding, CHANCE, 17(3), 16-20. Grant-Chapman, H. and Vallee, H. Q. (2022), Making government data publicly available: guidance for agencies on releasing data responsibly, Center for Democracy and Technology. Hu, J., Reiter, J. P., and Wang, Q. (2014), Disclosure risk evaluation for fully synthetic categorical data, Privacy in Statistical Databases, J. Domingo-Ferrer (ed), 185-199. Hu, J. and Savitsky, T. D. (2023), Bayesian data synthesis and disclosure risk quantification: an application to the Consumer Expenditure Surveys, Transactions on Data Privacy, 16, 83-121. Little, R. J. A. (1993), Statistical analysis of masked data, Journal of Official Statistics, 9(2), 407-426. Nowok, B., Raab, G. M., Dibben, C., Snoke, J., and van Lissa, C. (2022), synthpop: Generating synthetic versions of sensitive microdata for statistical disclosure control, R package version 1.8. Reiter, J. P. and Mitra, R. (2009), Estimating risks of identification disclosure in partially synthetic data, The Journal of Privacy and Confidentiality, 1, 99-110. Rubin, D. B. (1993), Discussion statistical disclosure limitation, Journal of Official Statistics, 9(2), 461-468. Simon, G., Shortreed, S. M., Coley, R. Y., Iturralde, E. M., Platt, R., Toh, S., and Ahmedani, B. (2020), Toolkit for assessing and mitigating risk of re- identification when sharing data derived from health records, Sentinel. Skinner, C. and Shlomo, N. (2008), Assessing identification risk in survey microdata using log-linear models, Journal of the American Statistical Association, 103, 989-1001. Snoke, J., Raab, G. M., Nowok, B., Dibben, C., and Slavkovic, A. (2018), General and specific utility measures for synthetic data, Journal of Royal Statistical Society, Series A, 181, 663-688.
Thank You! Thank You! Questions? Wen Qin Deng NYC DOHMH wdeng@health.nyc.gov