Settings

Theme

Seeking Datasets for Evaluating File Chunking Strategies for RAG/LLM

1 points by chiccomagnus 2 years ago · 3 comments · 1 min read

Reader

Hi everyone,

I’m delvingg into optimizing RAG for ingest pipelines involving textual documents such as PDFs, Office documents (Word, PowerPoint, etc.), HTML files, emails, and plain text. My aim is to benchmark file chunking strategies acros these formats to evaluate the impact on the accuracy of the RAG process itself. For this, I need a dataset that includes:

- Textual Documents: A diverse set of PDFs, Office files, HTML documents, emails, and plain texts to test chunking strategies.

- Associated Questions: A set of questions or queries tailored to the content of these documents, to assess how well the RAG process retrieves and generates accurate information based on the chunks.

- Evaluation Metrics or Ground Truth: Ideally, the dataset would come with a benchmark or ground truth for the answers to these questions, allowing for a clear assessment of the RAG's accuracy and performance.

If anyone has come across datasets fitting this description or has experience creating or using similar datasets for RAG accuracy evaluation, I’d greatly appreciate your insights. Additionally, recommendations for tools, frameworks, or methodologies for conducting these evaluations would be incredibly valuable.

Thanks in advance for any help or direction you can provide!

matthewddy 2 years ago

Same question :)

  • Yisz 2 years ago

    We have a number of production RAG datasets (human-verified) and have custom generated evaluation datasets for a variety of companies.

    Feel free to reach out if you are interested: yi@relari.ai

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection