André V. Duarte

I am a 2nd year PhD student on the Dual Degree program between Carnegie Mellon University & Instituto Superior Técnico, supervised by Prof. Lei Li and Prof. Arlindo Oliveira.
Before starting my PhD, I completed my undergraduate studies in Electrical and Computer Engineering and a master's in Data Science, both at Instituto Superior Técnico.

My research in mainly focused on Security and Privacy of GenAI models, with an enphasis on Membership Inference Attacks. In other words: Trying to figure out wheter specific data was used to train a certain model!

I'm always happy to connect and collaborate. If you work in a similar area or see potential for us to work together, feel free to reach out :)


News
I will be giving a talk on Copyrighted Data Detection, where I will introduce DIS-CO, our latest work on the topic!
Priberam Machine Learning Seminar - University of Lisbon: Instituto Superior Técnico
Mar 2025
I will attend EMNLP 2024 to present LumberChunker. Come check our poster!
Poster Session: Information Retrieval and TextMining 3, 16:00-17:30, November 14.
Nov 2024
I will be at ICML 2024 to present DE-COP - Our work about Detecting Copyrighted Books on LLMs training data 📚
Poster Session: Hall C 4-9, 13:30-15:00, 25 July.
Jul 2024
Education
  • Carnegie Mellon University and Instituto Superior Técnico
    Dual Degree Ph.D. in Language Technologies and Computer Science
    Feb. 2024 - present
    Currently in Lisbon, Portugal
  • Instituto Superior Técnico
    MSc. in Data Science and Engineering
    Set. 2020 - Dec. 2022
    Lisbon, Portugal
  • Instituto Superior Técnico
    BSc. in Electrical and Computer Engineering
    Set. 2017 - Dec. 2020
    Lisbon, Portugal
Selected Publications (view all )
DIS-CO: Discovering Copyrighted Content in VLMs Training Data
DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Under review, 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Under review, 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

LumberChunker: Long-Form Narrative Document Segmentation
LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

DE-COP: Detecting Copyrighted Content in Language Models Training Data
DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

All publications