Seminar on Differential Privacy in Fall 19/20 - Insights on Data Privacy Challenges

seminar on differential privacy fall 19 20 n.w
1 / 31
Embed
Share

Explore the challenges of data privacy in the context of differential privacy as discussed in the Seminar on Differential Privacy in Fall 19/20. Learn about reidentification linkage attacks, the anonymization dream, k-Anonymity, and more.

  • Seminar
  • Privacy
  • Data Privacy
  • Challenges
  • Differential Privacy

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Seminar on differential privacy, fall 19/20 Haim Kaplan

  2. Data Privacy The Problem Individuals Server/agency (trusted) Users queries answers) ( Government, researchers, businesses (or) Malicious adversary A

  3. The Anonymization Dream Database Anonymized Database Trusted server: Removes identifying information (name, address, ssn, ). Replaces identities with random identifiers. Idea hard wired into practices, regulations, , thought. Many uses. Reality: it fails. Pronounced both in academic and public literature.

  4. Reidentification Linkage Attacks [Sweeney 2000] GIC Name Ethnicity Group Insurance Commission Address Voter registration of Cambridge MA visit date patient specific data ( 135,000 patients) ZIP ZIP Date Diagnosis registered Birth date Birth date Public records Procedure 100 attributes per encounter Sex open for inspection by anyone Party affiliation Sex Medication Date last voted Total Charge Anonymized Anonymized GIC data Voter registration

  5. Reidentification Linkage Attacks via Quasi identifiers William Weld (governor of Massachusetts at the time) According to the Cambridge Voter list: Six people had his particular birth date Of which three were men He was the only one in his 5-digit ZIP code!

  6. k-Anonymity [SS98,S02] Def: A data base is said be k-anonymized if every combination of values of protected columns appears at least k times. Example: A 2-anonymized data set: Zip Age 4217 34 4217 34 1742 77 1742 77 4217 34

  7. k-Anonymity [SS98,S02] Def: A data base is said be k-anonymized if every combination of values of protected columns appears at least k times. Example: Are the following databases 2-anonymized ? Zip Age Zip Age 4217 34 4217 34 4217 34 4217 34 1742 34 1742 77 1742 77 1743 77 4217 34 4217 34 Intuitively prevents re-identification linkage attacks

  8. k-Anonymity (difficulties) Not clear how to choose the protected columns (quasi identifiers) Not clear how to choose k

  9. k-Anonymizing Generalization: Zip Age Zip Age 4217 34-29 4217 34 4217 34-29 4217 39 1*** 75-79 1742 79 1*** 75-79 1691 75 4217 34-29 4217 34 Suppression: Zip Age Zip Age 4217 34-29 4217 34 4217 34-29 4217 39 1*** 75-79 1742 79 1*** 75-79 1691 75 9755 13

  10. k-Anonymity -- algorithms It is NP-hard to find the optimal k-anonymization (for many notions of optimality) Approximation algorithms have been developed, and there are some publicly available tools. For references and more information see this Blog post by Desfontaines

  11. Respond to queries Do not publish an anonymized data set Only answer queries Do not answer queries on small subsets only on large aggregates

  12. Differencing attack I know you have done an HIV test during the second week of October 2019 Consider the queries: How many HIV postives are in the database that were registered by the end of September 2019 ? How many HIV positives are in the database today ?

  13. Randomization Add some random noise to your aggregated answers.. Example: Return sum(rows that satisfy some criteria) + noise

  14. Reconstruction attacks (You cannot allow too many queries ) Consider a single binary attribute user A 1 1 2 0 3 1 4 1 5 0 6 1 Query: Ask for the sum of A over some subset of the users. You can think of a query as a binary vector ? = is ? ? = 0,1,1,0,0,1 1,0,1,1,0,1 = 2 0,1,1,0,0,1 and the answer

  15. Reconstruction attacks Thm: If M can answer all possible queries with additive error at most ? (That is for every ? we get an answer ?(?) such that ? ? ? ? ?) then we can find ? such that ? ?1 4? Proof: Ask all possible queries and rule out every A such that there exists q for which ? ? ? ? > ? Return an ? that was not ruled out (there must be one since the true ? will not be ruled out) Correctness: consider the query ?0 that has ones wherever ? is 0

  16. Reconstruction attacks ?0 0 user A 1 1 2 0 1 3 1 0 4 1 0 5 0 1 6 1 0 We have ? ?0 ?0 ? ?, since we get answers with error at most ? We also have ? ?0 ?0 ? ?, since ? was not ruled out So (triangle inequality) ?0 ? ?0 ? 2?

  17. Reconstruction attacks ?0 0 ?? 1 user A 1 1 2 0 1 0 3 1 0 1 4 1 0 1 5 0 1 0 6 1 0 1 Similar argument works for ?1, ? ?1 ?1 ? ?, ? ?01 ?1 ? ?, and therefore ?1 ? ?1 ? 2? ? ?1 = ?0 ? ?0 ? + ?1 ? ?1 ? 4?

  18. Summary so far We will add noise and we will restrict the # of queries BUT, we still do not have a definition of privacy

  19. Differential Privacy [Dwork McSherry Nissim Smith 06] Real world: Analysis (Computation) Outcome Data (?,?)- similar s ideal world: Data w/ s info removed Analysis (Computation) Outcome

  20. Differential Privacy [Dwork McSherry Nissim Smith 06/J2016]6] A (randomized) algorithm ?:?? ? satisfies ?,? -differential privacy if for all datasets ?,? ?? that differ on one entry, For all subsets ? of the outcome space ?, M? ? ? ??Pr M? ? ? + ?. Pr The parameter ?measures leakage or harm 1 1 10 For small ?:?? 1 + ? 1. Think ? 100 or ? Meaningful ? should be small (? 1/?). Think ? 2 60, or cryptographically small, or zero ? 1/?(usually cryptographically small; pure differential privacy: ? = 0)

  21. Why Differential Privacy? DP: quantifiable, composable mathematical privacy guarantee Provably resilient to known and unknown attack modes! Opens door to algorithmic development Theoretically, DP enables many computations with personal data while preserving personal privacy Practicality in first stages of validation Has a natural interpretation

  22. Interpreting Differential Privacy [Kasiviswanathan and Smith J2014] Thm: Suppose you have some prior distribution ? on the databases ? and you see an answer ?(?) to query ?. Then your posterior distribution ? on ? is almost the same, nomatter if user ? is in the database or not: ? 2?? ? ? ? ? ? ? ?2?? ? ? ? Proof: homework..(use Bayes rule + definition)

  23. Are you HIV positive? Toss Coin If heads tell the truth If tails, toss another coin: If heads say yes If tails say no

  24. Randomized response [Warner 65] ? {0,1} 1 2+ ? 1 2 ? ? ?.?. ???(?) = ? ? = ? ?.?. ?? 1 ??+1, ???(?) is ? differentially private Claim: setting ? =1 Proof: Neighboring databases: ? = 0;? = 1 1 2(1+?? 1 1 2(1 ?? 1 small ?:e? 1 + ? Get ? ? 4 2 ??+1) ??+1)= ?? Pr[?? 0 =0] Pr[?? 1 =0]=

  25. Can we make use of randomized response? Individuals ?1 Analyzer (untrusted) A ?2 ??

  26. The local model Individuals ?1= ????1 ?1 Analyzer (untrusted) ?2= ????2 A ?2 ??= ????? ??

  27. The local model Individuals ?1= ????1 ?1 Analyzer (untrusted) ?2= ????2 A ?2 ??? ??= ????? ??

  28. The local model 1 2+ ? + 1 ?? ?? 1 2+? 2? then ? ?? = ?? 1 2 ? =1 ? ?? = ?? 2+ ?(2?? 1) Put ??= 1 4 ?2 4?2 1 4 ?2 1 ?2 high! by construction but ??? ?? = ? ?2 ; stdev ? ? ?[ ??] = ?? and ???[ ??] = ? 4?2 ? ? ? (i.e. ? 1 ?2) Useful when

  29. Basic properties of differential privacy: post processing ?:?? ? ?:? ? M A Data ?? ?,? ?? ? Claim: ? is ?,? -differentially private Proof: Let ?,? be neighboring databases and ? a subset of T Let ? = {? ?:? ? ? }be the preimage of S under A Pr ? ? ? = Pr ? ? ? ??Pr ? ? ? + ? = ??Pr ? ? ? + ?

  30. Basic properties of differential privacy: Basic composition [DMNS06, DKMMN06, DL09] Claim: ? is ?1+ ?2,?1+ ?2- differentially private Proof (for the case ?1= ?2= 0) Let ?,? be neighboring databases and ? a subset of (?1 ?2) M1 Data ?? ?1,?1 ?? M2 ?2,?2 ?? Pr ? ? ? = Pr[?1? = ?1 ?2? = ?2] ? z1,z2 ? ??:?? ?? = Pr[?1? = ?1] Pr[?2? = ?2] z1,z2 ? ??1Pr[?1? = ?1] ??2Pr[?2? = ?2]= e?1+?2Pr ? ? ? z1,z2 ?

  31. Basic properties of DP: Group privacy Let ? be ?,? -differentially private: For all datasets ?,? ?? that differ on one entry, for all subsets ? of the outcome space ?: Pr M? ? ? ??Pr M? ? ? + ?. Claim: for all databases ?,? ?? that differ on t entries, for all subsets ? of the outcome space ?: Pr M? ? ? ???Pr M? ? ? + ?????.

More Related Content