Note. The two DIS-CO variants differ in whether the overlap between caption-based and image-based predictions was removed. ⌊DIS-CO⌋ applies this removal, while DIS-CO considers all frames.
Motivated by the hypothesis that a VLM is able to recognize images from its training data, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development.
Large Vision-Language Models (VLMs) are trained on vast datasets scraped from the web, often with little transparency regarding data sources. This raises ethical and legal concerns, particularly when copyrighted content is suspected of having been used.
So how can we verify whether a VLM has seen a specific copyrighted work without access to its training data?
In a black-box setting, where model attributes like token probabilities are inaccessible, the most reliable way to determine whether a model has seen specific content is to make it reveal knowledge that goes beyond general understanding.
To achieve this, we need a task designed with two key properties:
If you are still unsure about this idea, below, you'll find an interactive challenge where you can try to perform the same task the models did. Can you score better than GPT-4o? 👀
Whether you completed the full quiz or just explored a few examples, we hope the key idea came accross: this task isn't easy! And for the cases you got right, think about it: had you seen those movies before? 🤔
What we found particularly intriguing is that even with these highly neutral images, the models still manage to recognize them pretty regularly.
This surprising ability led us to investigate the phenomenon more systematically, ultimately shaping our approach and leading to the development of DIS-CO.
We begin by assembling our MovieTection benchmark, where we collect 14,000 movie frames, categorized into main and neutral types to introduce varying levels of difficulty. For each frame, we also generate a corresponding caption using the Qwen2-VL 7B model.
Models are then queried with both the image frames and their corresponding captions, generating free-form predictions for each. We can then refine our detection of suspect content by excluding cases where the image-based predictions overlap with caption-based ones.
Our rationale is that when a model correctly identifies a movie based solely on its caption, it suggests that the frame could be highly representative of the movie. So much so that even a textual description provides enough clues for the model to make an accurate guess using its general knowledge acquired from public data (i.e. OpenSubtitles), rather than relying on memorization.
Finally, when deciding whether a suspect movie was part of a model’s training data, we compare its task performance against a baseline value reflecting general movie knowledge. If a movie’s recognition rate is significantly higher than expected - especially after removing cases where captions alone were sufficient for identification - it strongly suggests the model may have been exposed to that content during training.
GPT-4o | Gemini-1.5 Pro | LLaMA-3.2 90B | Qwen2-VL 72B | |
---|---|---|---|---|
Captions | 0.128 | 0.079 | 0.078 | 0.075 |
MCQA | 0.721 | 0.550 | 0.540 | 0.617 |
⌊DIS-CO⌋ | 0.226 | 0.152 | 0.134 | 0.122 |
DIS-CO | 0.338 | 0.209 | 0.176 | 0.176 |
@misc{duarte2025disco,
title={{DIS-CO: Discovering Copyrighted Content in VLMs Training Data}},
author={André V. Duarte and Xuandong Zhao and Arlindo L. Oliveira and Lei Li},
year={2025},
eprint={2502.17358},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.17358},
}