Settings

Theme

Show HN: RAG chunk size "best practices" failed on legal text – I benchmarked it

medium.com

2 points by metawake 8 days ago · 3 comments

Reader

metawakeOP 8 days ago

Author here. Built RagTune to stop guessing at RAG configs.

Surprising findings:

1. On legal text (CaseHOLD), 1024 chunks scored WORST (0.618). The "small" 256 chunks won (0.664). 7% swing.

2. On Wikipedia text? All chunk sizes hit ~99%. No difference.

3. Plot twist: At 5K docs, optimal chunk size FLIPPED from 256→1024. Scale changes everything.

Code is MIT: github.com/metawake/ragtune

Happy to discuss methodology.

  • patrakov 8 days ago

    Now that you have 5K docs, can you try estimating the statistical uncertainty of the Recall@5 and MRR metrics measured via smaller datasets? Just make some different 400-document subsets of the whole 5K HotpotQA dataset and recalculate the metrics.

    • metawakeOP 8 days ago

      Great suggestion!! this is exactly the right methodology for establishing confidence intervals.

      I've added this to the roadmap as `--bootstrap N`:

          ragtune simulate --queries queries.json --bootstrap 5
          
          # Output:
          # Recall@5:  0.664 ± 0.012 (n=5)
          # MRR:       0.533 ± 0.008 (n=5)
      
      The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.

      This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.

      Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection