Every RAG tutorial gives you the same advice: split your documents into ~512 character chunks, use a decent embedding model, and you’re done.
I believed this for months. Then I built a tool to actually measure retrieval quality — and discovered that for specialized domains, this “best practice” can cost you 7% recall. That’s the difference between a helpful AI assistant and one that misses critical information.
The Often Overlooked Problem
When you build a RAG system, you’re making decisions that directly impact whether your LLM gets the right context:
- Chunk size: How big should each text segment be?
- Top-k: How many chunks should you retrieve?
- Embedding model: Which one actually works for your data?
Most teams guess. They copy settings from a tutorial, ship to production, and hope for the best. There’s no easy way to know if your retrieval layer is actually working — until users complain about bad answers.
I wanted to change that.
What I Built
RagTune is a CLI tool that benchmarks your RAG retrieval layer with real metrics:
- Recall@K: What percentage of relevant documents did you actually retrieve?
- MRR: How well-ranked is the first relevant result?
- Coverage: What percentage of queries found any relevant document?
It’s framework-agnostic — works with any vector store, any embedding model. The goal is measurement, not magic.
The Experiment
I ran RagTune against two datasets:
- HotpotQA — General knowledge Wikipedia paragraphs (200 multi-hop questions)
- CaseHOLD — Legal case holdings from Harvard Law (500 queries)
For each dataset, I tested three chunk sizes: 256, 512, and 1024 characters.
ragtune compare --collections legal-256,legal-512,legal-1024 \
--queries ./queries.json --top-k 5The Results
General Knowledge (HotpotQA):
All chunk sizes work great. Differences are negligible. The tutorials were right — for Wikipedia-style text.
Legal Text (CaseHOLD):
Completely different story:
- Overall recall dropped from 99% to 66%
- Chunk size caused a 7% swing in retrieval quality
- The “safe default” of 1024 was the worst performer
Why Legal Text Breaks the Rules
Legal language operates as a specialized register (dense, precise, and self-referential). Unlike conversational text, it assumes readers already know the statutory context. What works for Wikipedia fails when coherence depends on that shared frame:
- Small word changes have big legal implications
- Similar-sounding clauses mean different things
- Generic embeddings miss domain nuance
When you use large chunks on legal text, you’re forcing the embedding model to summarize complex concepts into a single vector. It can’t. Smaller chunks (256) preserve the precision that legal retrieval requires.
The Takeaway
For production RAG, measure. Don’t guess.
Your domain might behave like Wikipedia — or it might behave like legal contracts. The only way to know is to benchmark with real queries and ground truth.
Update: What Happens at Scale?
After running these experiments, I wondered: do these findings hold at production scale?
I scaled HotpotQA from 400 documents to 5,000 documents (12.5x larger) and re-ran the same tests. The results surprised me:
ScaleBest Chunk SizeRecall@5400 docs256 (smallest)0.9955,000 docs1024 (largest)0.893
The optimal chunk size flipped. At small scale, smaller chunks won. At production scale, larger chunks performed best.
Even more striking: recall dropped from 99% to 89% — a 10% degradation just from having more documents competing for attention in the vector space.
What this means:
- Tutorial-scale benchmarks can be misleading
- The “distractor” problem is real — more documents = more noise
- You need to benchmark at your production scale, not demo scale
I’ll explore this scale phenomenon in more depth in a future article. For now, the lesson is clear: benchmark at the scale you’ll actually operate at.
A note on methodology: These results are from single runs on specific datasets. The small-scale tests used nomic-embed-text (Ollama); the 5K scale test used OpenAI text-embedding-3-small. Your results may vary depending on your data distribution, embedding model, and query patterns. The code and datasets are open source — I encourage you to reproduce and extend these experiments on your own data.
Try It Yourself
RagTune is open source: github.com/metawake/ragtune
Install (30 seconds):
brew install metawake/tap/ragtune
# or: curl -sSL https://raw.githubusercontent.com/metawake/ragtune/main/install.sh | bashReproduce article results (10 minutes):
git clone https://github.com/metawake/ragtune
cd ragtune# Generate the datasets
cd benchmarks/casehold-500 && pip install datasets && python prepare.py
cd ../hotpotqa-1k && python prepare.py
cd ../..# Run the legal benchmark with different chunk sizes
ragtune ingest ./benchmarks/casehold-500/corpus --collection legal-256 --chunk-size 256
ragtune ingest ./benchmarks/casehold-500/corpus --collection legal-1024 --chunk-size 1024
ragtune simulate --collection legal-256 --queries ./benchmarks/casehold-500/queries.json
ragtune simulate --collection legal-1024 --queries ./benchmarks/casehold-500/queries.json
You should see the 7% recall difference between chunk sizes on legal text.