Understanding Linkage Measures in Data Analysis

mary scott 16 th august 2019 n.w
1 / 11
Embed
Share

Explore the concept of linkage in data analysis, including measures such as re-identifiability and joinability. Learn about innovative techniques like KHyperLogLog and K-Minimum Values to estimate uniqueness in datasets. Dive into programming for KHLL in Python and analyze uniqueness distributions post-implementation. Quantify linkage risks through re-identifiability analysis to enhance data privacy.

  • Data Analysis
  • Data Linkage
  • Data Privacy
  • Python Programming
  • Unique Data

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Mary Scott 16thAugust 2019

  2. INTRODUCTION TO LINKAGE Linkage: when an adversary finds two datasets with sufficient common information about the same people that they can be merged.

  3. MEASURES OF LINKAGE Re-identifiability: the potential by which the identities of people can be recovered by combining a supposedly anonymous dataset with another dataset. Joinability: the potential by which datasets are linkable by unexpected join keys.

  4. KHYPERLOGLOG (KHLL) HyperLogLog (HLL): - Measure the uniqueness of all of the customer IDs associated with each movie - Maximum number of trailing zeros Low cardinality: - Sparse representation

  5. K Minimum Values (KMV): KHYPERLOGLOG (KHLL) - Estimate the number of unique customer IDs associated with each movie - Consider a representative sample of movies of size K KHLL: - Contains K HLL structures corresponding to the K smallest hashes of the movies

  6. WRITING A PROGRAM FOR KHLL IN PYTHON Makes use of an existing Python package for HLL, extended it to write a program for KHLL Large random dataset ~480k pairs based on Netflix prize dataset from 2006 User selects value of K, only consider those pairs where the movie has a hash value below K Insert corresponding customer IDs into the HLL algorithm to acquire an estimate for their cardinality Total cardinality estimate is the sum of the HLL lengths

  7. UNIQUENESS DISTRIBUTION Plot uniqueness distribution of the dataset after implementing KHLL (top) Compare with uniqueness distribution of original dataset (bottom) Example in diagram is when K=100. I also investigated K=10 and K=1000.

  8. RE-IDENTIFIABILITY ANALYSIS Quantify linkage risks: - Re-identifiability by Uniqueness - Joinability by Containment Cumulative Uniqueness Distributions: - Estimates how much data will be lost when applying a k-anonymity threshold

  9. JOINABILITY ANALYSIS Quantify linkage risks: - Re-identifiability by Uniqueness - Joinability by Containment After KMV has been implemented, find the union of each pair of HLL algorithms A cardinality estimate for their intersection is calculated using the inclusion-exclusion principle Containment: Containment is estimated by dividing by the estimated cardinality of the relevant dataset Find the sum of all K of these cardinalities - Close to 0 indicates low risk of linkage - Close to 1 indicates high risk of linkage My dataset suggested a low containment which is what I expected given the small overlap

  10. WHAT I CHANGED: CARDINALITY ESTIMATES I adopted a different method of estimating the cardinality from the one suggested in literature. Instead of taking the maximum HLL length and dividing by K to estimate the density, I simply added up all of the HLL lengths. The new cardinality estimate was significantly closer to the true value: on average, the percentage error was below 2% as opposed to over 10%. Next step: find a way to scalably measure linkage risk from an arbitrary combination of columns.

  11. THANK YOU FOR LISTENING Any questions?

Related


More Related Content