If you find LumberChunker useful in your research, please consider citing:
@inproceedings{duarte-etal-2024-lumberchunker,
title = "{L}umber{C}hunker: Long-Form Narrative Document Segmentation",
author = "Duarte, Andr{\'e} V. and
Marques, Jo{\~a}o DS and
Gra{\c{c}}a, Miguel and
Freire, Miguel and
Li, Lei and
Oliveira, Arlindo L.",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.377/",
doi = "10.18653/v1/2024.findings-emnlp.377",
pages = "6473--6486",
abstract = "Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content{'}s semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 ``needle in a haystack'' type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37{\%} in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro."
}