Publications - André V. Duarte

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

[Website] [Paper] [Code] [Dataset]

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

[Website] [Paper] [Code] [Dataset]

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

[Paper] [Code] [Dataset]

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

[Paper] [Code] [Dataset]

DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

[Paper] [Code] [Dataset]

DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

[Paper] [Code] [Dataset]

Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss

André V. Duarte, Arlindo L. Oliveira

ICMLA 2023

We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

[Paper]

Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss

André V. Duarte, Arlindo L. Oliveira

ICMLA 2023

We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

[Paper]

Improving Address Matching using Siamese Transformer Networks

André V. Duarte, Arlindo L. Oliveira

EPIA 2023

We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.

[Paper]

Improving Address Matching using Siamese Transformer Networks

André V. Duarte, Arlindo L. Oliveira

EPIA 2023

We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.

[Paper]