Show HN: RAG chunk size "best practices" failed on legal text – I benchmarked it
medium.comAuthor here. Built RagTune to stop guessing at RAG configs.
Surprising findings:
1. On legal text (CaseHOLD), 1024 chunks scored WORST (0.618). The "small" 256 chunks won (0.664). 7% swing.
2. On Wikipedia text? All chunk sizes hit ~99%. No difference.
3. Plot twist: At 5K docs, optimal chunk size FLIPPED from 256→1024. Scale changes everything.
Code is MIT: github.com/metawake/ragtune
Happy to discuss methodology.
Now that you have 5K docs, can you try estimating the statistical uncertainty of the Recall@5 and MRR metrics measured via smaller datasets? Just make some different 400-document subsets of the whole 5K HotpotQA dataset and recalculate the metrics.
Great suggestion!! this is exactly the right methodology for establishing confidence intervals.
I've added this to the roadmap as `--bootstrap N`:
The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.ragtune simulate --queries queries.json --bootstrap 5 # Output: # Recall@5: 0.664 ± 0.012 (n=5) # MRR: 0.533 ± 0.008 (n=5)This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.
Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.