André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li
Proceedings of ICML 2026
Human peer reviews differ from AI-generated ones in claim diversity, even after LLM refinement. We operationalize this insight in Sem-Detect, a classifier for peer-review authorship detection robust to LLM-refined human reviews.
André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li
Proceedings of ICML 2026
Human peer reviews differ from AI-generated ones in claim diversity, even after LLM refinement. We operationalize this insight in Sem-Detect, a classifier for peer-review authorship detection robust to LLM-refined human reviews.
João D. S. Marques*, André Vicente Duarte*, André Mendes Marques de Carvalho, Gil Rocha, Bruno Martins, Arlindo L. Oliveira (* equal contribution)
Proceedings of EMNLP Industry Track 2025
We present a real-world deployment of AI-assisted evaluation in two government funding programs, showing that AI can reduce reviewer workload and accelerate large-scale application processing while maintaining low error rates.
João D. S. Marques*, André Vicente Duarte*, André Mendes Marques de Carvalho, Gil Rocha, Bruno Martins, Arlindo L. Oliveira (* equal contribution)
Proceedings of EMNLP Industry Track 2025
We present a real-world deployment of AI-assisted evaluation in two government funding programs, showing that AI can reduce reviewer workload and accelerate large-scale application processing while maintaining low error rates.
André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li
Under review, 2025
RECAP is a new method for extracting memorized data from LLMs. It uses iterative feedback and jailbreak prompts. Evaluated on the EchoTrace benchmark of books and papers, RECAP greatly outperforms previous methods in extracting verbatim passages.
André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li
Under review, 2025
RECAP is a new method for extracting memorized data from LLMs. It uses iterative feedback and jailbreak prompts. Evaluated on the EchoTrace benchmark of books and papers, RECAP greatly outperforms previous methods in extracting verbatim passages.
André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Proceedings of ICML 2025
DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.
André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Proceedings of ICML 2025
DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Findings EMNLP 2024
LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.
André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Findings EMNLP 2024
LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.
André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum
DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.
André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum
DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

André V. Duarte, Arlindo L. Oliveira
ICMLA 2023
We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.
André V. Duarte, Arlindo L. Oliveira
ICMLA 2023
We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

André V. Duarte, Arlindo L. Oliveira
EPIA 2023
We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.
André V. Duarte, Arlindo L. Oliveira
EPIA 2023
We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.