Baselines

We provide four baseline Membership Inference Attacks (MIAs) as reference implementations: Loss, MinK%Prob, SURP, and PAC.

Each baseline applies a distinct technique to infer whether a given sample was part of a model’s training dataset. For detailed explanations of the underlying methods, we refer the reader to the corresponding original publications.

The reference implementations were developed by Cosmin Vasilescu, Ísak Jónsson, and Roham Koohestani as part of the Research Project course (CSE3000) at TU Delft. Each attack extends an abstract MIAttack class. You may reuse this structure or any part of the provided implementations; however, doing so is not mandatory.

👉GitHub repository👈

Dataset

The dataset was create by sampling 500,000 files from both The Stack V2 as well as The Heap. For these 500,000 files, a BOW model was trained and combined with a logistic regression classifier to identify if a file belongs to either the Heap or Stack V2. The final datasets uploaded for the competition are all the files that were miss-classified by this BOW approach. This allows us to remove “easy” samples, which can be identified as either members or non-members based on a temporal shift, such as dates from the future (compared to the creation time of the Stack V2), changed/newly released libraries, or differences in licenses included in code comments. This is similar to the approach presented here.

As a result, the provided dataset yields worse performance for all MIAs (compared to a randomly sampled dataset) that we have tested it with, however it also ensures that the attacks are looking at more than only a temporal shift in the data.

The evaluation dataset is also generated in the same way, however it consists of only 5,000 samples for each language.

👉HuggingFace Dataset👈

Evaluation

Participants operate in a white‑box membership inference setting. The target models are open‑weights: their architectures and parameters are fully accessible. Participants may inspect and utilize any information available through the HuggingFace Transformers library, including (but not limited to) model weights, layer outputs and activations, logits, attention maps, and other runtime state.

Participants are permitted to perform arbitrary forward and backward computations on the target model and may leverage any signals derived from these computations, such as intermediate activations, gradients, Hessians, or other parameter‑dependent quantities. Temporary, non‑persisting instrumentation, such as forward or backward hooks, probes, or local copies of weights used for analysis, is allowed.

To assess generalization and robustness, evaluation is performed on a held‑out target model and a held‑out evaluation set that are not accessible during development. Participants may tune and validate their methods on the provided development models and datasets, but final scores are computed exclusively on the unseen model and dataset. You may assume the model is a HuggingFace CausalLM built with Transformers version 4.52.xx running dtype bfloat16 smaller than 7B parameters. The dataset is a HuggingFace dataset similar to the one provided.

Submissions are evaluated using the Area Under the Receiver Operating Characteristic curve (AUC‑ROC). AUC‑ROC is a standard metric for membership inference attacks and is widely used in prior work. We choose AUC‑ROC over metrics such as True Positive Rate at a fixed False Positive Rate (TPR@xFPR) because there is no established consensus on what constitutes an acceptable false‑positive rate in this setting. By integrating performance across all possible decision thresholds, AUC‑ROC provides a threshold‑agnostic and comprehensive measure of attack effectiveness.