LumberChunker: Long-Form Narrative Document Segmentation

Blog created by Raymond Jiang3

1INESC-ID / Instituto Superior Técnico, 2NeuralShift AI, 3Carnegie Mellon University

We present LumberChunker, a method for semantically segmenting long-form narrative documents that achieves state-of-the-art retrieval performance while requiring significantly fewer embedding computations than existing approaches.

Introduction

Long-form narratives (novels, memoirs, transcripts) don’t break cleanly at fixed token counts. If we split them poorly, Retrieval-Augmented Generation (RAG) pipelines surface the wrong passages and LLMs start guessing. Structure-only chunkers (fixed tokens, paragraphs) are fast but blind to scene and topic flow; purely similarity-driven heuristics can fragment dialogue, miss coreference, and drift as context grows.

So how do we preserve the story’s flow and still keep chunking practical?

The Key Idea

Treat segmentation as boundary finding. LumberChunker reads a rolling window of consecutive paragraphs (up to a token count θ) and asks an LLM to return the earliest paragraph where the content clearly shifts from what came before. That paragraph marks a boundary; the next chunk starts there. Repeat until the document ends.

This simple prompt design yields reliable, human-like cuts because:

  • Low false positives when no shift exists. Without a genuine topic/scene turn, the model rarely flags a boundary, so chunks don’t fracture mid-thought.
  • High hit-rate when a shift does exist. When the narrative pivots (new scene, entity, objective), the model consistently identifies the earliest turning point, preserving coherence.

In practice, sweeping the window size shows a sweet spot around θ ≈ 550 tokens: enough context to recognize transitions, but not so much that the signal gets diluted.

Want to test your understanding? Try the interactive quiz below to see if you can identify the optimal way to segment a passage.

Text Segmentation Quiz

Choose the best way to segment this passage into semantic chunks

Passage:

How should this passage be segmented?

The LumberChunker Method

LumberChunker treats document segmentation as a boundary-finding problem. Instead of cutting by fixed tokens or paragraphs, we ask an LLM to read a rolling window of consecutive paragraphs and return the first paragraph where the content clearly shifts. That cut becomes the end of the current chunk and the start of the next, which yields variable-length, semantically coherent segments that track narrative flow.

1) Document i Paragraph Extraction

Cleanly split the book into paragraphs and assign stable IDs (p1, p2, …). This preserves the document’s natural discourse units and gives us safe candidate boundaries.

Example: From a novel, we extract:

p1: "The morning sun filtered through the dusty windows..."
p2: "She walked slowly to the door, hesitating..."
p3: "Meanwhile, across town, Detective Morrison reviewed the case files..."
p4: "The previous night's events had left him puzzled..."

Each paragraph gets a unique ID for tracking boundaries.

2) IDs Grouping (Gi) for LLM

Build a rolling group Gi by appending paragraphs until the group’s length reaches a token budget θ. This provides enough context for the model to judge when a topic/scene actually shifts.

Example: With θ = 550 tokens, we build:

G1 = [p1, p2, p3, p4, p5, p6]

This window contains ~550 tokens spanning multiple paragraphs, giving the LLM enough context to detect where the narrative shifts—perhaps from the woman's scene to the detective's investigation.

3) Gemini Query

Prompt the model with the paragraphs in Gi and ask it to return the first paragraph (not the very first in the group) where content clearly changes relative to what came before. Use that returned ID as the chunk boundary; start the next group at that paragraph and repeat to the end of the book.

Example: Given G1 = [p1, p2, p3, p4, p5, p6], the LLM responds:

"The content shifts at p3, where we transition from the woman's morning routine to Detective Morrison's investigation."

Step 4: Answer Extraction
We extract p3 as the boundary. This creates:
Chunk 1: [p1, p2] — Woman's scene
Next group starts at p3 — Detective's investigation

Choosing the context size (θ)

We sweep θ ∈ [450, 1000] tokens and find that θ ≈ 550 consistently maximizes retrieval quality: large enough for context, small enough to keep the model focused on the current turn in the story.

450 550 650 1000
550

Why this works

Narratives work off of topic turns, scene changes, and discourse shifts, not uniform token distances. By explicitly locating the earliest meaningful change inside a window, LumberChunker produces variable-length chunks that keep entities and events intact, improving retrieval quality downstream.

Granularity that “feels right”

Average chunk sizes (tokens) on the 100-book corpus: Paragraph ~79, Semantic ~185, Recursive ~399, LumberChunker ~334, Proposition ~12. Even with θ=550, the model frequently finds earlier shifts, yielding compact, on-topic chunks and reducing “lost-in-the-middle” effects.

Table 10: The average number of tokens per chunk and the total number of chunks after segmenting each book in the GutenQA.
Method Avg. #Tokens / Chunk Total #Chunks
Semantic Chunking 185 tokens 191059
Paragraph Level 79 tokens 248307
Recursive Chunking 399 tokens 31787
Proposition-Level 12 tokens 914493
LumberChunker 334 tokens 36917

GutenQA: Movie-style results, but for books

To evaluate chunking where it matters, we introduce GutenQA, a benchmark of 100 carefully cleaned public-domain books paired with 3,000 needle-in-a-haystack QA items (short, verifiable answers). This lets us measure passage retrieval precisely and then see how that lift translates into RAG QA.

Retrieval: LumberChunker leads ⭐

Across DCG@k and Recall@k, LumberChunker ranks first. At k=20, it reaches DCG ≈ 62.1 and Recall ≈ 77.9%, outperforming strong baselines like Recursive, Paragraph, Semantic, and Proposition chunking.

Retrieval Performance Comparison

GutenQA

  • DCG @ k
  • Recall@k
1 2 5 10 20
Semantic Chunking 29.50 35.31 40.67 43.14 44.74
Paragraph-Level 36.54 42.11 45.87 47.72 49.00
Recursive Chunking 39.04 45.37 50.66 53.25 54.72
HyDE 33.47 39.74 45.06 48.14 49.92
Proposition-Level 36.91 42.42 44.88 45.65 46.19
LumberChunker 48.28 54.86 59.37 60.99 62.09

Downstream QA: targeted retrieval beats giant context

Plugging chunks into a standard RAG pipeline (on autobiographies), RAG-LumberChunker surpasses RAG-Recursive and trails only RAG-Manual (hand-segmented ground truth). Notably, an “open-book” non-retrieval setting with huge context windows still underperforms RAG. (targeted passages beat raw context size.)

Closer to human boundaries

Compared against manual chunks, LumberChunker achieves ROUGE-L ≈ 0.709 vs. ≈ 0.689 for Recursive chunks, which is evidence that its boundaries align with how readers naturally perceive topic shifts.

Table 2: Average ROUGE-L scores of methods compared to Manual Chunks.
Method Average ROUGE-L Score
Recursive Chunks 0.689
LumberChunker 0.709

Conclusion

LumberChunker shows that LLM-guided narrative segmentation can strike a rare balance of preserving story flow without demanding massive compute or retraining. By letting an LLM detect where meaning actually shifts inside rolling paragraph windows, we obtain chunks that feel natural to humans and perform better for machines.

On the GutenQA benchmark, LumberChunker consistently improves retrieval and downstream QA over traditional fixed-size and recursive methods, approaching the quality of manual, human-curated segmentations. Its efficiency, roughly linear in paragraph count, makes it practical for large-scale preprocessing in RAG pipelines.

As RAG systems continue to scale, effective document segmentation will remain a key frontier. LumberChunker offers a practical step forward—one that respects both meaning and efficiency, making long-form understanding more accessible to modern language models.

Citation

If you find LumberChunker useful in your research, please consider citing:

@inproceedings{duarte-etal-2024-lumberchunker,
    title = "{L}umber{C}hunker: Long-Form Narrative Document Segmentation",
    author = "Duarte, Andr{\'e} V.  and
      Marques, Jo{\~a}o DS  and
      Gra{\c{c}}a, Miguel  and
      Freire, Miguel  and
      Li, Lei  and
      Oliveira, Arlindo L.",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.377/",
    doi = "10.18653/v1/2024.findings-emnlp.377",
    pages = "6473--6486",
    abstract = "Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content{'}s semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 ``needle in a haystack'' type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37{\%} in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro."
}