About

Background

LLMs often exhibit memorization, reproducing verbatim content from their training data. This poses a challenge for evaluation, if test data was seen during training, performance metrics become unreliable due to data contamination, which can falsely suggest better generalization.

For models with publicly available training data, contamination can be reduced by deduplicating the test set against the training corpus, a strategy we used in building The Heap, a benchmark dataset for evaluating LLMs on code tasks.

However, this approach is not feasible for closed-source models. In such cases, we need alternative methods to assess whether a test file may have been part of the training data.

One approach is to adapt membership inference techniques, which exploit a model’s tendency to memorize training data to determine whether a specific input was part of the training set. These methods can be used to assess the likelihood that a test file was seen during training.

Goal

Develop techniques that can detect data contamination in language models that do not publicly release their training data.

Objective

This competition invites participants to develop and improve techniques for membership inference in LLMs4Code. Given a dataset composed of a mixture of files, some belonging to a target model’s training data and others not, participants will design techniques to classify each file accordingly.

Submissions will be evaluated on accuracy using a held-out test set, and their ability to generalize by evaluating the techniques on a held out test model. Finally, we provide a baseline for the contestant to compare their results to.

Impact

We will use the most generalizable and accurate method from the competition to audit The Heap dataset, to detect and label potential training data overlaps with closed models. This results in a stronger, contamination-free benchmark for the community. The competition will:

Encourage the development of practical tools to assess contamination in data-opaque models.
Spark discussion on best practices for LLM4Code research with undisclosed training data.
Promote transparency and reproducibility in LLM4Code evaluation.