I Built a RAG Tuning Tool and Discovered Intuition Fails on Legal Text

4 min read Original article ↗

Every RAG tutorial gives you the same advice: split your documents into ~512 character chunks, use a decent embedding model, and you’re done.

I believed this for months. Then I built a tool to actually measure retrieval quality — and discovered that for specialized domains, this “best practice” can cost you 7% recall. That’s the difference between a helpful AI assistant and one that misses critical information.

The Often Overlooked Problem

When you build a RAG system, you’re making decisions that directly impact whether your LLM gets the right context:

  • Chunk size: How big should each text segment be?
  • Top-k: How many chunks should you retrieve?
  • Embedding model: Which one actually works for your data?

Most teams guess. They copy settings from a tutorial, ship to production, and hope for the best. There’s no easy way to know if your retrieval layer is actually working — until users complain about bad answers.

I wanted to change that.

What I Built

RagTune is a CLI tool that benchmarks your RAG retrieval layer with real metrics:

  • Recall@K: What percentage of relevant documents did you actually retrieve?
  • MRR: How well-ranked is the first relevant result?
  • Coverage: What percentage of queries found any relevant document?

It’s framework-agnostic — works with any vector store, any embedding model. The goal is measurement, not magic.

The Experiment

I ran RagTune against two datasets:

  1. HotpotQA — General knowledge Wikipedia paragraphs (200 multi-hop questions)
  2. CaseHOLD — Legal case holdings from Harvard Law (500 queries)

For each dataset, I tested three chunk sizes: 256, 512, and 1024 characters.

ragtune compare --collections legal-256,legal-512,legal-1024 \
--queries ./queries.json --top-k 5

The Results

General Knowledge (HotpotQA):

All chunk sizes work great. Differences are negligible. The tutorials were right — for Wikipedia-style text.

Legal Text (CaseHOLD):

Completely different story:

  • Overall recall dropped from 99% to 66%
  • Chunk size caused a 7% swing in retrieval quality
  • The “safe default” of 1024 was the worst performer

Why Legal Text Breaks the Rules

Legal language operates as a specialized register (dense, precise, and self-referential). Unlike conversational text, it assumes readers already know the statutory context. What works for Wikipedia fails when coherence depends on that shared frame:

  • Small word changes have big legal implications
  • Similar-sounding clauses mean different things
  • Generic embeddings miss domain nuance

When you use large chunks on legal text, you’re forcing the embedding model to summarize complex concepts into a single vector. It can’t. Smaller chunks (256) preserve the precision that legal retrieval requires.

The Takeaway

For production RAG, measure. Don’t guess.

Your domain might behave like Wikipedia — or it might behave like legal contracts. The only way to know is to benchmark with real queries and ground truth.

Update: What Happens at Scale?

After running these experiments, I wondered: do these findings hold at production scale?

I scaled HotpotQA from 400 documents to 5,000 documents (12.5x larger) and re-ran the same tests. The results surprised me:

ScaleBest Chunk SizeRecall@5400 docs256 (smallest)0.9955,000 docs1024 (largest)0.893

The optimal chunk size flipped. At small scale, smaller chunks won. At production scale, larger chunks performed best.

Even more striking: recall dropped from 99% to 89% — a 10% degradation just from having more documents competing for attention in the vector space.

What this means:

  • Tutorial-scale benchmarks can be misleading
  • The “distractor” problem is real — more documents = more noise
  • You need to benchmark at your production scale, not demo scale

I’ll explore this scale phenomenon in more depth in a future article. For now, the lesson is clear: benchmark at the scale you’ll actually operate at.

A note on methodology: These results are from single runs on specific datasets. The small-scale tests used nomic-embed-text (Ollama); the 5K scale test used OpenAI text-embedding-3-small. Your results may vary depending on your data distribution, embedding model, and query patterns. The code and datasets are open source — I encourage you to reproduce and extend these experiments on your own data.

Try It Yourself

RagTune is open source: github.com/metawake/ragtune

Install (30 seconds):

brew install metawake/tap/ragtune
# or: curl -sSL https://raw.githubusercontent.com/metawake/ragtune/main/install.sh | bash

Reproduce article results (10 minutes):

git clone https://github.com/metawake/ragtune
cd ragtune
# Generate the datasets
cd benchmarks/casehold-500 && pip install datasets && python prepare.py
cd ../hotpotqa-1k && python prepare.py
cd ../..
# Run the legal benchmark with different chunk sizes
ragtune ingest ./benchmarks/casehold-500/corpus --collection legal-256 --chunk-size 256
ragtune ingest ./benchmarks/casehold-500/corpus --collection legal-1024 --chunk-size 1024
ragtune simulate --collection legal-256 --queries ./benchmarks/casehold-500/queries.json
ragtune simulate --collection legal-1024 --queries ./benchmarks/casehold-500/queries.json

You should see the 7% recall difference between chunk sizes on legal text.