André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Under review, 2025
DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.
André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Under review, 2025
DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.
André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Findings EMNLP 2024
LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.
André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Findings EMNLP 2024
LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.
André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum
DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.
André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum
DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.
André V. Duarte, Arlindo L. Oliveira
ICMLA 2023
We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.
André V. Duarte, Arlindo L. Oliveira
ICMLA 2023
We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.
André V. Duarte, Arlindo L. Oliveira
EPIA 2023
We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.
André V. Duarte, Arlindo L. Oliveira
EPIA 2023
We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.