2025

DIS-CO: Discovering Copyrighted Content in VLMs Training Data
DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Under review, 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Under review, 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

2024

LumberChunker: Long-Form Narrative Document Segmentation
LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

DE-COP: Detecting Copyrighted Content in Language Models Training Data
DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

2023

Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss
Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss

André V. Duarte, Arlindo L. Oliveira

ICMLA 2023

We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss

André V. Duarte, Arlindo L. Oliveira

ICMLA 2023

We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

Improving Address Matching using Siamese Transformer Networks
Improving Address Matching using Siamese Transformer Networks

André V. Duarte, Arlindo L. Oliveira

EPIA 2023

We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.

Improving Address Matching using Siamese Transformer Networks

André V. Duarte, Arlindo L. Oliveira

EPIA 2023

We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.