Membership Inference Attacks Against Language Models

not all tokens are equal membership inference n.w

1 / 16

Embed Share

Discover how membership inference attacks target fine-tuned language models and analyze token-level membership signals. Explore methodologies and observations on privacy leakage in language modeling, uncovering disparities in prediction difficulty among tokens.

shpe674 Follow

Uploaded on Jun 16, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Not All Tokens Are Equal: Membership Inference Attacks Against Fine-tuned Language Models Changtian Song, Dongdong Zhao, Jianwen Xiang, Wuhan University of Technology

Membership Inference Attacks The Membership Inference Attacks (MIAs) aim to determine whether a given data point belongs to the training set of the target machine learning model. Member Non-member data training set

Language Models Language modeling We focus on casual Language Models (CLMs), which aim to predict the next token for a given sequence . Privacy leakage of training data As the pretraining-finetuning paradigm becomes the mainstream, we focus on the privacy leakage in the finetuning phase.

Typical MIAs on LMs Metric Based MIAs: ... Reference-free methods often perform poorly, while reference-based methods usually invoke impractical assumptions.

Motivation Existing methods lack an analysis of token-level membership signals: different tokens may contribute unequally to MIA. Text x: The 15th Miss Universe Thailand pageant was held at The 15 th Miss agon Hall Par target model ( ) p x ... 0.01 0.21 0.83 0.37 0.93 0.64 0.02 t Royal Paragon Hall...

Methodology The 15 th Miss agon Hall Par target dataset: ( ) p x ... 0.01 0.21 0.83 0.37 0.93 0.64 0.02 t target model: ( ) p x ... 0.02 0.07 0.29 0.33 0.04 0.72 0.88 t grouping scaling reference model: The 15 th Miss agon Hall Par ( ) s x ... -1.70 0.91 1.27 0.04 0.19 0.24 -1.05 t Consider using the scaled likelihood ratio as the token-level membership signals. Furthermore, we assign different weights to each .

Observation For tokens which are more difficult to predict, members and non-members exhibit greater distributional differences.

Weight Enhanced Membership Inference Attack Divide tokens into groups based on their prediction difficulty, and apply quantile scaling to the token-level likelihood ratio. Then, consider assigning different weights to each group, while introducing a smoothing function . Our attack model can be written as:

Optimal weights Consider maximizing the mean discrepancy between member and non- member samples, We transform the computation of the weight into an optimization problem: Combining our previous observation, the optimal weight calculation formula can be derived:

Experiments Settings Models GPT-2-Base, GPT-2-Medium, GPT-2-Large GPT-Neo-125M Pythia-160M Datasets AGNews Wikitext-103 XSum

Experiments

Ablation Study Decomposing WEL-MIA The size of target dataset

Ablation Study Text Length Ratio of members to non-members

Defense Protect the target model using Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm: The performance of WEL-MIA under DP-SGD:

Generalizability on Visual Modality A simple attempt By treating image patches as tokens in language models, we adapt WEL-MIA to masked image modeling, such as Mask Autoencoders (MAE) .