2026

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews
Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2026

Human peer reviews differ from AI-generated ones in claim diversity, even after LLM refinement. We operationalize this insight in Sem-Detect, a classifier for peer-review authorship detection robust to LLM-refined human reviews.

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2026

Human peer reviews differ from AI-generated ones in claim diversity, even after LLM refinement. We operationalize this insight in Sem-Detect, a classifier for peer-review authorship detection robust to LLM-refined human reviews.

2025

Leveraging LLMs to Streamline the Review of Public Funding Applications
Leveraging LLMs to Streamline the Review of Public Funding Applications

João D. S. Marques*, André Vicente Duarte*, André Mendes Marques de Carvalho, Gil Rocha, Bruno Martins, Arlindo L. Oliveira (* equal contribution)

Proceedings of EMNLP Industry Track 2025

We present a real-world deployment of AI-assisted evaluation in two government funding programs, showing that AI can reduce reviewer workload and accelerate large-scale application processing while maintaining low error rates.

Leveraging LLMs to Streamline the Review of Public Funding Applications

João D. S. Marques*, André Vicente Duarte*, André Mendes Marques de Carvalho, Gil Rocha, Bruno Martins, Arlindo L. Oliveira (* equal contribution)

Proceedings of EMNLP Industry Track 2025

We present a real-world deployment of AI-assisted evaluation in two government funding programs, showing that AI can reduce reviewer workload and accelerate large-scale application processing while maintaining low error rates.

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li

Under review, 2025

RECAP is a new method for extracting memorized data from LLMs. It uses iterative feedback and jailbreak prompts. Evaluated on the EchoTrace benchmark of books and papers, RECAP greatly outperforms previous methods in extracting verbatim passages.

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li

Under review, 2025

RECAP is a new method for extracting memorized data from LLMs. It uses iterative feedback and jailbreak prompts. Evaluated on the EchoTrace benchmark of books and papers, RECAP greatly outperforms previous methods in extracting verbatim passages.

DIS-CO: Discovering Copyrighted Content in VLMs Training Data
DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2025

DIS-CO identifies copyrighted content in VLMs training data by showing that models can link movie frames to their titles in a free-form text generation setting, even when the frames are highly challenging, suggesting prior exposure during training.

2024

LumberChunker: Long-Form Narrative Document Segmentation
LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Findings EMNLP 2024

LumberChunker is a document segmentation method using LLMs to enhance retrieval by creating contextually coherent, variable-sized content chunks.

DE-COP: Detecting Copyrighted Content in Language Models Training Data
DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Proceedings of ICML 2024 ; Best Scientific Paper Award at Responsible AI Forum

DE-COP is a novel method to identify copyrighted content present in LLM training datasets. It works by showing that a model can recognize exact text excerpts if they were seen during training. It is applicable to models with/without logit outputs.

2023

Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss
Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss

André V. Duarte, Arlindo L. Oliveira

ICMLA 2023

We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss

André V. Duarte, Arlindo L. Oliveira

ICMLA 2023

We solve the Portuguese address matching problem using a custom BERT model trained from scratch, subsequently fine-tuned with a in-batch negatives loss.

Improving Address Matching using Siamese Transformer Networks
Improving Address Matching using Siamese Transformer Networks

André V. Duarte, Arlindo L. Oliveira

EPIA 2023

We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.

Improving Address Matching using Siamese Transformer Networks

André V. Duarte, Arlindo L. Oliveira

EPIA 2023

We explore the use of Pre-Trained BERT models to perform address matching for Portuguese Postal Data.