Introduction to Differential Privacy
This content provides insights into traditional disclosure limitation methods, introducing the concept of differential privacy as a more robust approach. It delves into techniques like cell suppression and complementary suppression, highlighting challenges and solutions in safeguarding sensitive data in statistical tables.
Uploaded on Feb 22, 2025 | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introduction to Differential Privacy Privacy Day January 31, 2018 Simson L. Garfinkel Senior Computer Scientist for Confidentiality and Data Access Associate Directorate for Research and Methodology U.S. Census Bureau Disclaimer: The views expressed in this talk are those of the author, and not necessarily those of the U.S. Census Bureau. 1
Outline for todays talk 1. Traditional Disclosure Limitation 2. Introducing Differential Privacy 3. Microdata 4. Conclusion 2
Statistical tables can disclose information about individual people and firms. Sensitive data values: a data item which a data user could utilize to calculate another individual s or establishment s data. Sensitive data values must not be disclosed. 5
Cell suppression: A disclosure avoidance technique in which the sensitive data in the publication is replaced with a (D) 6
Tables with marginal totals require complementary suppression. 173,536 = 14,566 + 45,105 + 113,865 84,842 = 5,413 + (D) + 63,252 (D) = 16,177 TOTALS (D) (D) (D) Complementary Suppression: Suppressions which prevent a user from deriving a sensitive data value from additive table relationships 7
Problems with cell suppression: 1. It s hard to get right. This table is not adequately suppressed! 8
Solution 9
Problems with cell suppression: 1. It s hard to get right. 2. It s vulnerable to external data. (D) (D) (D) A business that knows SIC 2 in MSA 2 can calculate SIC 1 in MSA 2. 10
Problems with cell suppression: 1. It s hard to get right. 2. It s vulnerable to external data. 3. It doesn t work well with time-series data Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Alex Bobbie Casey Harper tabulations, it s pretty easy to figure out that the new kid is sad and has a 90. Students: 4 Percent Happy: 50% Average Grade: 65 Statistical Tabulation January Even if we just publish the Name Affect Sad Sad Happy Sad Happy Grade 30 50 80 90 100 Alex Bobbie Casey Emerson Harper Students: 5 Percent Happy: 40% Average Grade: 70 NEWKID = (70 * 5) (65 * 4) Statistical Tabulation February 11
The Census Bureau has used these kinds of rules to protect privacy for decades Every time a vulnerability is discovered, another rule is added. That s why we call these rules ad hoc. Privacy- Preserving Data Release Ad hoc Rules Sensitive dataset The rules are designed to protect the release from a data intruder. Key assumptions: 1. Data available to intruders is known and bounded. 2. Intruders do not collude. 3. Future data sets will not be released that can be linked with previously released data 4. Adversaries have limited resources to pursue re-identification attacks 12
Outline for todays talk 1. Traditional Disclosure Limitation 2. Introducing Differential Privacy 3. Microdata 4. Conclusion 13
Differential privacys core idea: Create uncertainty regarding the presence any person in the dataset. Noise is added to mask an individual s contribution Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Alex Bobbie Casey Harper Students: 4 Percent Happy: 50% Average Grade: 65 Average Grade: 50 Statistical Tabulation Students: 4 Percent Happy: 45% January Statistical Tabulation + noise Name Affect Sad Sad Happy Sad Happy Grade 30 50 80 90 100 Alex Bobbie Casey Emerson Harper Students: 5 Percent Happy: 40% Average Grade: 70 Average Grade: 75 Students: 5 Percent Happy: 60% Statistical Tabulation Statistical Tabulation + noise February This is called output noise infusion because the noise is added on output. 15
If we ran the statistics different times, we would get different results Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Alex Bobbie Casey Harper Statistical Tabulation + noise Students: 4 Percent Happy: 45% Average Grade: 50 January Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Alex Bobbie Casey Harper Statistical Tabulation + noise Students: 4 Percent Happy: 55% Average Grade: 75 January Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Alex Bobbie Casey Harper Statistical Tabulation + noise Students: 4 Percent Happy: 51% Average Grade: 60 January In this example, a policy decision requires that the number of students be accurately reported. 16
Data users understand that noise has been added. The real value has a probability of being within a given range. Students: 3 Percent Happy: 40% Average Grade: 50 What were the original data? 30% .. 50% (p=.95) Students: 6 Percent Happy: 45% Average Grade: 45 35% .. 55% (p=.95) What were the original data? Students: 5 Percent Happy: 51% Average Grade: 60 What were the original data? 41% .. 61% (p=.95) The original value was 50%. Note: This example shows results from 3 different universes. In practice, a differentially private system must remember the answer to every query. 17
Noise can be added in two places: 1) When data are collected. 2) When statistics are produced. Input noise infusion: Name Affect Grade 30 + NOISE 50 + NOISE 80 + NOISE 100 + NOISE Alex Bobbie Casey Harper Sad + NOISE Sad + NOISE Happy + NOISE Happy + NOISE Students: 4 Percent Happy: 30..70 Average Grade: 50..80 Statistical Tabulation Advantages: Tabulator need not be trusted. More statistics do not pose additional privacy threats. Output noise infusion: Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Alex Bobbie Casey Harper Students: 4 Percent Happy: 40..60 Average Grade: 60..70 Statistical Tabulation Advantages: More accurate for the same level of privacy Allows uses of confidential data that do not involve publication. 18
The amount of noise added depends on two factors. Factor #1: The sensitivity of the query. Factor #2: The tradeoff between privacy loss and accuracy. These factors are based a mathematical model for privacy. The model is based on the concept of adjacent datasets. Dataset D: without Emerson Dataset D : with Emerson Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Name Affect Sad Sad Happy Sad Happy Grade 30 50 80 90 100 Alex Bobbie Casey Harper Alex Bobbie Casey Emerson Harper What is the difference of any function f() run over these two datasets? (i.e. |f(D) f(D )| ) 19
Factor #1: The sensitivity of the query: How much can f() change for any two adjacent datasets? Dataset D: without Emerson Dataset D : with Emerson Name Affect Sad Sad Happy Happy Grade 30 50 80 100 Name Affect Sad Sad Happy Sad Happy Grade 30 50 80 90 100 Alex Bobbie Casey Harper Alex Bobbie Casey Emerson Harper Counting queries: (f) = 1 Average Grade: (f) = 100 / n The amount of noise added is ~ (f) 20
Factor #2: The tradeoff between privacy loss and accuracy. privacy loss Differential privacy uses the parameter (epsilon) to describe the privacy/accuracy tradeoff. = 0 No accuracy, full privacy = No privacy, full accuracy 21
Differential privacy inverts the model of privacy preserving data releases. Privacy Parameters Differential privacy starts with a formal privacy definition. Previously we started with rules and hoped that they protected privacy: Methods that implement the privacy definition Formal Privacy Definition Privacy- Preserving Privacy- Preserving Data Release Data Release Ad hoc Rules Sensitive dataset Sensitive dataset 22
Outline for todays talk 1. Traditional Disclosure Limitation 2. Introducing Differential Privacy 3. Microdata 4. Conclusion 23
Microdata 24
Final problem: what do we do about microdata? Let s say we want to publish this microdata: Name Affect Sad Sad Happy Sad Happy Grade 30 50 80 90 100 ID# 1 2 3 4 5 Affect Sad Sad Happy Sad Happy Grade 30 50 80 90 100 Alex Bobbie Casey Emerson Harper NOT DIFFERENTIALLY PRIVATE Now say Emerson s report card is lost on the way home: EMERSON 90 As a result of the data release, Emerson s affect can be determined from the microdata. The only solution is to add noise to microdata or produce synthetic microdata. 25
The problem: so much noise needs to be added, the microdata are virtually useless. Input noise infusion: Name Alex Bobbie Casey Harper Affect Grade 30 + NOISE 50 + NOISE 80 + NOISE 100 + NOISE Sad + NOISE Sad + NOISE Happy + NOISE Happy + NOISE Students: 4 Percent Happy: 30..70 Average Grade: 50..80 Statistical Tabulation Advantages: Tabulator need not be trusted. More statistics do not pose additional privacy threats. 26
Outline for todays talk 1. Traditional Disclosure Limitation 2. Introducing Differential Privacy 3. Microdata 4. Conclusion 27
Conclusion 28
Differential privacy was invented in 2006 by Dwork, McSherry, Nissim and Smith Differential privacy is just 11 years old. Today s public key cryptography was invented in 1976-1978 Remember public key cryptography in 1989? No standardized implementations. No SSL/TLS. No S/MIME or PGP. Very few people knew how to build systems that used crypto. 29
Other choices for policy makers Where should the accuracy be spent? What values should be reported exactly (with no privacy) What are the possible bounds (sensitivity) of a person s data? e.g. If reporting average student age, can students be 5..18 or 5..115? How do we convey privacy guarantees to public? 30