Reproducibility study of document chunking strategies for retrieval, accepted in SIGIR2026
This repository provides a pipeline to chunk, encode, and evaluate documents using multiple chunking strategies and embedding models, reproducing and extending the results from the original paper.
Results that could not fit in the paper due to page constraints are provided below.
| Dataset | Method | Pre-C Orig (Jina-v3) | Pre-C Repro (Jina-v3) | Con-C Orig (Jina-v3) | Con-C Repro (Jina-v3) | Pre-C Orig (Nomic) | Pre-C Repro (Nomic) | Con-C Orig (Nomic) | Con-C Repro (Nomic) |
|---|---|---|---|---|---|---|---|---|---|
| SciFact | Fixed-size | 0.718 | 0.717 | 0.732 | 0.730 | 0.707 | 0.703 | 0.706 | 0.707 |
| Sentence | 0.714 | 0.716 | 0.732 | 0.734 | 0.713 | 0.715 | 0.714 | 0.712 | |
| Semantic | 0.712 | 0.710 | 0.724 | 0.723 | 0.704 | 0.704 | 0.705 | 0.705 | |
| NFCorpus | Fixed-size | 0.356 | 0.355 | 0.367 | 0.368 | 0.353 | 0.348 | 0.353 | 0.351 |
| Sentence | 0.358 | 0.357 | 0.366 | 0.367 | 0.347 | 0.350 | 0.355 | 0.355 | |
| Semantic | 0.361 | 0.360 | 0.366 | 0.367 | 0.353 | 0.351 | 0.303 | 0.353 | |
| FiQA | Fixed-size | 0.333 | 0.468 | 0.338 | 0.479 | 0.370 | 0.386 | 0.383 | 0.387 |
| Sentence | 0.304 | 0.433 | 0.339 | 0.480 | 0.351 | 0.362 | 0.377 | 0.380 | |
| Semantic | 0.303 | 0.440 | 0.337 | 0.322 | 0.348 | 0.356 | 0.369 | 0.266 | |
| TRECCOVID | Fixed-size | 0.730 | 0.739 | 0.772 | 0.766 | 0.729 | 0.758 | 0.750 | 0.750 |
| Sentence | 0.724 | 0.714 | 0.765 | 0.769 | 0.742 | 0.747 | 0.768 | 0.779 | |
| Semantic | 0.747 | 0.747 | 0.762 | 0.699 | 0.743 | 0.743 | 0.761 | 0.730 |
Correlation between chunk size and retrieval performance for all four models:
This project evaluates on two dataset groups:
- Narrative (in-document retrieval): GutenQA_Paragraphs
- BEIR (in-corpus retrieval): beir
- trec-covid, nfcorpus, fiqa, arguana, scidocs, scifact
- Download the datasets and place them under
src/data/. - For GutenQA, create a subfolder
src/data/GutenQA. - For BEIR, unzip each dataset into
src/data/(e.g.src/data/nfcorpus/).
pip install -r requirements.txtThe repository is organized into three core modules:
| Module | Description |
|---|---|
| Chunker | Splits documents into chunks using various strategies |
| Encoder | Transforms chunks into embeddings (Regular or Late/Contextualized) |
| Evaluator | Computes ranking metrics (nDCG, DCG, Recall) for evaluation |
| Category | Methods |
|---|---|
| Structure-based | ParagraphChunker, SentenceChunker, FixedSizeChunker |
| Semantic/LLM | SemanticChunker, LumberChunker, Proposition |
| Model | Short Name |
|---|---|
| jinaai/jina-embeddings-v2-small-en | Jina-v2 |
| jinaai/jina-embeddings-v3 | Jina-v3 |
| nomic-ai/nomic-embed-text-v1 | Nomic |
| intfloat/multilingual-e5-large-instruct | E5-Large |
Run all modules end-to-end using the provided shell scripts:
# Step 1: Chunk documents
nohup ./run_chunker.sh > run_chunker.log 2>&1 < /dev/null &
# Step 2: Encode chunks and queries
nohup ./run_encoder.sh > run_encoder.log 2>&1 < /dev/null &
# Step 3: Evaluate retrieval performance
nohup ./run_evaluator.sh > run_evaluator.log 2>&1 < /dev/null &Note: Each step depends on the outputs of the previous one. Before running the Encoder, update
QUERY_ID_BY_DATASETinrun_encoder.shwith the query IDs generated by the Chunker. Similarly, verify encoder run IDs before running the Evaluator.
Splits documents into smaller units for encoding and retrieval. Supports strategies from LumberChunker and Late Chunking.
Example:
python -m src.runner chunker \
--processor_name beir \
--dataset_name nfcorpus \
--data_folder src/data \
--chunker ParagraphChunker \
--output_folder src/outputs \
--queryArguments:
| Argument | Description |
|---|---|
--processor_name |
Data processor (GutenQA, beir) |
--dataset_name |
Dataset to process |
--data_folder |
Path to dataset folder |
--chunker |
Chunking strategy (e.g. ParagraphChunker, SentenceChunker) |
--output_folder |
Output directory for chunks |
--query |
Enable query mode (saves queries alongside chunks) |
--sample |
Optional: number of documents to sample |
Transforms text chunks into vector embeddings. Two encoding strategies are supported:
- RegularEncoder: Encodes each chunk independently.
- LateEncoder: Concatenates chunks from the same document, encodes jointly, then splits embeddings back (contextualized chunking).
Example:
python -m src.runner encoder \
--encoder_name RegularEncoder \
--dataset_name nfcorpus \
--chunk_run_id SentenceChunker \
--backbone JinaaiV2 \
--model_name jinaai/jina-embeddings-v2-small-en \
--batch_size 10 \
--output_folder src/outputs \
--query \
--query_run_id 20250921-183217-beir-8f3497a6Arguments:
| Argument | Description |
|---|---|
--encoder_name |
Encoder class (RegularEncoder, LateEncoder) |
--dataset_name |
Dataset name |
--chunk_run_id |
ID of the chunking run to encode |
--backbone |
Embedding backbone (e.g. JinaaiV2, JinaaiV3) |
--model_name |
HuggingFace model ID |
--batch_size |
Texts per batch |
--output_folder |
Output directory (default: src/outputs) |
--query |
Enable query encoding mode |
--query_run_id |
ID of the query run to encode |
To add a custom encoder, create a class under src/encoders/ inheriting from BaseEncoder. For custom embedding models, add a class under src/models/embedding/ inheriting from BaseEmbeddingModel.
Computes ranking-based metrics (nDCG, DCG, Recall) to benchmark chunking and encoding configurations.
Example:
python -m src.runner eval \
--chunk_run_id 20250902-171846-ParagraphChunker-GutenQA-c78b6f37 \
--query_run_id 20250902-171849-GutenQA-ed7846b6 \
--chunk_embedding_run_id 20250902-175213-RegularEncoder-Qwen3-Qwen3-Embedding-0.6B-77072742 \
--query_embedding_run_id 20250902-175652-RegularEncoder-Qwen3-13d17bce \
--dataset_name GutenQA \
--scope document \
--source_path src/outputsArguments:
| Argument | Description |
|---|---|
--chunk_run_id |
ID of the chunking run |
--query_run_id |
ID of the query run |
--chunk_embedding_run_id |
ID of the chunk embedding run |
--query_embedding_run_id |
ID of the query embedding run |
--dataset_name |
Dataset used for evaluation |
--scope |
document (within-document) or corpus (full corpus) |
--source_path |
Path to outputs (default: src/outputs) |




