
Data Privacy and Anonymization Techniques
Explore the concept of data privacy and anonymization techniques in this comprehensive collection of slides. Learn about the importance of privacy laws, OECD guidelines, and methods for anonymizing data to protect individuals' information while promoting data sharing and retention. Dive into discussions on privacy definitions, privacy laws in EU and US, and the significance of different privacy protection methods such as k-anonymity and l-diversity. Gain insights into the intricate balance between data sharing and safeguarding privacy in today's digital landscape.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SADET Module D5: Data Privacy Dr. Balaji Palanisamy Associate Professor School of Computing and Information University of Pittsburgh bpalan@pitt.edu Slides Courtesy: Prof. James Joshi (University of Pittsburgh). Many slides in this lecture are adapted from SIGMOD 2009 tutorial, Anonymized Data: Generation, Models, Usage , Cormode & Srivastava and Indrajit Roy et. al, NSDI 2010 paper IS-2150/TEL2810: Info. Security and Privacy 2
Introduction to Privacy Data Privacy Anonymization techniques Differential Privacy IS-2150/TEL2810: Info. Security and Privacy 3
What is privacy? Hard to define Privacy is the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others Alan Westin, Privacy and Freedom, 1967 4
OECD Guidelines on the Protection of Privacy (1980) Collection limitation Data quality Purpose specification Use limitation Security safeguards Openness Individual participation Accountability http://www.oecd.org/document/18/0,3343,en_2649_34255_1815186_1_1_1_1,00.html#part2 5
Privacy Laws EU: Comprehensive European Directive on Data Protection US: Sector specific HIPAA (Health Insurance Portability and Accountability Act of 1996) Protect individually identifiable health information COPPA (Children s Online Privacy Protection Act of 1998) Address collection of personal information from children under 13, how to seek verifiable parental consent from their parents, etc. GLB (Gramm-Leach-Bliley-Act of 1999) Requires financial institutions to provide consumers with a privacy policy notice, including what info. collected, where info. shared (affiliates and nonaffiliated third parties), how info. used, how info. protected, opt-out options, etc. Fair Credit Reporting Act 6
Why anonymize and how? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data For Data Retention and Usage Various requirements prevent companies from retaining customer information indefinitely Anonymization methods: k-anonymity l-diversity differential privacy Anonymized Data: Generation, Models, Usage Cormode & Srivastava 7
Tabular Data Example Course record data recording scores and demographics Releasing Student ID Score association violates individual s privacy Student ID is an identifier, Score is a sensitive attribute (SA) Student ID DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 75835 14792 87593 87950 38833 68054 99316 51589 14941 22563 90652 12386 8
Tabular Data Example: De-Identification Course record: remove Student ID to create de-identified table Does the de-identified table preserve an individual s privacy? Depends on what other information an attacker knows DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 9
Tabular Data Example: Linking Attack Student ID 75835 51589 DOB 9/28/96 1/13/92 Sex M M DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 De-identified private data + publicly available data Cannot uniquely identify either individual s score DOB is a quasi-identifier (QI) 10
Tabular Data Example: Linking Attack Student ID 75835 22563 DOB 9/28/96 7/28/99 Sex M M DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 De-identified private data + publicly available data Uniquely identified one individual s score, but not the other s DOB, Sex are quasi- identifiers (QI) 11
Tabular Data Example: Linking Attack ZIP 15213 15232 Student ID 75835 22563 DOB 9/28/96 7/28/99 Sex M M DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 De-identified private data + publicly available data Uniquely identified both individuals scores [DOB, Sex, ZIP] is unique for lots of US residents [Sweeney 02] 12
k-Anonymization [Samarati, Sweeney 98] DOB 96-95 96-95 96-95 96-95 92 92 92 92 98-99 98-99 98-99 98-99 Sex * * * * * * * * * * * * ZIP 1521* 1521* 1521* 1521* 1520* 1520* 1520* 1520* 1523* 1523* 1523* 1523* Score 70 70 80 80 90 70 80 80 90 90 90 90 DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 4-anonymization k-anonymity: Table T satisfies k-anonymity with quasi- identifier QI iff each tuple in (the multiset) T[QI] appears at least k times Protects against linking attack k-anonymization: Table T is a k-anonymization of T if T is a generalization/suppression of T, and T satisfies k- anonymity 13
k-Anonymization and Uncertainty DOB 96-95 96-95 96-95 96-95 92 92 92 92 98-99 98-99 98-99 98-99 Sex * * * * * * * * * * * * ZIP 1521* 1521* 1521* 1521* 1520* 1520* 1520* 1520* 1523* 1523* 1523* 1523* Score 70 70 80 80 90 70 80 80 90 90 90 90 DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 One possibility Intuition: A k-anonymized table T represents the set of all possible world tables Ti s.t. T is a k- anonymization of Ti The table T from which T was originally derived is one of the possible worlds 14
k-Anonymization and Uncertainty DOB 96-95 96-95 96-95 96-95 92 92 92 92 98-99 98-99 98-99 98-99 Sex * * * * * * * * * * * * ZIP 1521* 1521* 1521* 1521* 1520* 1520* 1520* 1520* 1523* 1523* 1523* 1523* Score 70 70 80 80 90 70 80 80 90 90 90 90 DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F F M F M F M M M F ZIP 15217 15213 15212 15212 15206 15206 15207 15207 15232 15232 15232 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 Another possibility Intuition: A k- anonymized table T represents the set of all possible world tables Ti s.t. T is a k- anonymization of Ti (Many) other tables are also possible 15
Homogeneity Attack [Machanavajjhala+ 06] Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear k times, but does not say anything about the SA values If (almost) all SA values in a QI group are equal, loss of privacy! The problem is with the choice of grouping, not the data DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 DOB 96-95 96-95 96-95 96-95 92 92 92 92 98-99 98-99 98-99 98-99 Sex * * * * * * * * * * * * ZIP 1521* 1521* 1521* 1521* 1520* 1520* 1520* 1520* 1523* 1523* 1523* 1523* Score 70 70 80 80 90 70 80 80 90 90 90 90 Not Ok! All scores are 90 In the QI group! 16
Homogeneity Attack [Machanavajjhala+ 06] Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear k times, but does not say anything about the SA values If (almost) all SA values in a QI group are equal, loss of privacy! The problem is with the choice of grouping, not the data For some groupings, no loss of privacy At least 3 unique values in each QI group 3-diversity with 3 distinct values for each QI group DOB 9/28/96 9/29/96 1/21/95 9/28/96 5/25/92 1/13/92 7/28/92 1/13/92 1/13/98 7/28/99 1/22/99 2/23/98 Sex M F F M M F M F F M M F ZIP 15213 15213 15212 15212 15206 15206 15207 15207 15232 15232 15231 15231 Score 70 70 80 80 90 70 80 80 90 90 90 90 DOB 95-99 95-99 95-99 95-99 92 92 92 92 95-99 95-99 95-99 95-99 Sex * * * * * * * * * * * * ZIP 152** 152** 152** 152** 1520* 1520* 1520* 1520* 152** 152** 152** 152** Score 70 70 80 80 90 70 80 80 90 90 90 90 Ok! 17
Homogeneity and Uncertainty ZIP 15231 Student ID DOB 1/22/99 Sex M DOB 96-95 96-95 96-95 96-95 92 92 92 92 98-99 98-99 98-99 98-99 Sex * * * * * * * * * * * * ZIP 1521* 1521* 1521* 1521* 1520* 1520* 1520* 1520* 1523* 1523* 1523* 1523* Score 70 70 80 80 90 70 80 80 90 90 90 90 90652 Intuition: A k-anonymized table T represents the set of all possible world tables Ti s.t. T is a k-anonymization of Ti Lack of diversity of SA values implies that in a large fraction of possible worlds, some fact is true, which can violate privacy 18
l-Diversity [Machanavajjhala+ 06] DOB 96-95 96-95 96-95 96-95 92 92 92 92 98-99 98-99 98-99 98-99 Sex * * * * * * * * * * * * ZIP 1521* 1521* 1521* 1521* 1520* 1520* 1520* 1520* 1523* 1523* 1523* 1523* Score 70 70 80 80 90 70 80 80 90 90 90 90 Intuition: Most frequent value does not appear too often compared to the less frequent values in a QI group l-Diversity Principle: a table is l-diverse if each of its QI groups contains at least l well-represented values for the SA well-represented extensions: Distinct l-diversity: simplest definition that at least l distinct values represented in each QI group g Entropy l-diversity: for each QI group g, entropy(g) log(l) Recursive (c,l)-diversity: for each QI group g with m SA values, and ri is the i th highest frequency, r1 < c (rl + rl+1+ + rm) 19
Background: Differential privacy A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not Cynthia Dwork. Differential Privacy. ICALP 2006 20
Differential privacy (intuition) A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not A B C Output distribution F(x) Cynthia Dwork. Differential Privacy. ICALP 2006 21
Differential privacy (intuition) A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not A B C D A B C Similar output distributions F(x) F(x) Bounded risk for D if she includes her data! Cynthia Dwork. Differential Privacy. ICALP 2006 22
Achieving differential privacy A simple differentially private mechanism Tell me f(x) f(x)+noise x1 xn How much noise should one add? 23
Achieving differential privacy Function sensitivity (intuition): Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: Computing the average height of the people in this room has low sensitivity Any single person s height does not affect the final average by too much Calculating the maximum height has high sensitivity 24
Achieving differential privacy Function sensitivity (intuition): Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: SUM over input elements drawn from [0, M] X1 X2 X3 X4 SUM Sensitivity = M Max. effect of any input element is M 25
Achieving differential privacy A simple differentially private mechanism Tell me f(x) x1 xn f(x)+Lap( (f)) Intuition: Noise needed to mask the effect of a single input Lap = Laplace distribution (f) = sensitivity 26
Sensitivity of a Function f How Much Can f(DB + Me) Exceed f(DB - Me)? Recall: K K (f, DB) = f(DB) + noise Question Asks: What difference must noise obscure? f = maxDB, Me |f(DB+Me) f(DB-Me)| eg, Count = 1 27
Calibrate Noise to Sensitivity f = maxDB, Me |f(DB+Me) f(DB-Me)| Theorem: To achieve -differential privacy, use scaled symmetric noise Lap(|x|/R) with R = f/ 0 -4R -3R -2R -R R 2R 3R 4R 5R Pr[ K (f, DB - Me) = t] =exp(-(|t- f-|-|t- f+|)/R) exp(- f/R) Pr[ K (f, DB + Me) = t] 28