rag to riches: synthetic data for training rag agents - cgft blog

tl;dr

training a great rag agent with rl requires a dataset of hard, grounded questions, and almost nobody has one sitting around.
but most teams do have a corpus: docs, support articles, internal wikis. we can leverage those resources to generate a dataset automatically.
we built a pipeline that goes from corpus → quality synthetic qa dataset, inspired by sage, a search-augmented generator + a search agent that filters and refines.
this means, with just a corpus, teams can quickly and easily train a fine-tuned agentic rag model, that retrieves faster and more accurately than a general-purpose model

why qa dataset generation matters

reinforcement learning (rl) is what turns a small model into a great retrieval agent. as we showed in our previous post, rl-finetuned models can outperform competitors many times their size on domain-specific retrieval tasks. however, rl training requires a dataset of questions that are grounded in your actual corpus and diverse enough to generalize.

almost nobody has that dataset ready to go. collecting human-annotated question-answer (qa) pairs is slow, expensive, and requires domain experts who understand both your content and what “hard” looks like for a retrieval agent. most teams that want to train a rag model are stuck before they start.

but here’s the thing: most teams do have a corpus. docs, support articles, internal wikis, api references — the raw material is already there. the real issue isn’t whether to generate synthetic data, it’s how to craft questions that are firmly grounded and demanding enough to force real retrieval.

in this post, we explore three different approaches to qa dataset generation — naive, sage, and cgft — analyzing metrics like question quality and training costs in an effort to build a robust pipeline for training great agents.

the naive approach

you can sample a chunk, hand it to an llm, and ask it to write a question-answer pair. no need for linking or verification. it’s fast and cheap, and the questions come from your corpus, not thin air.

these questions, however, are shallow. the llm writes what the chunk suggests: a single-hop fact. there’s no incentive to make the questions challenging, no check that retrieval is actually required to answer them, and no mechanism for connecting ideas across documents. a rag agent trained on these questions learns to pattern-match rather than retrieve and reason.

naive generation is a fine starting point for understanding your corpus. it’s not sufficient for training.

the sage approach: search-augmented generation

our starting point was sage, from google ai. instead of asking an llm to write from a chunk in isolation, you give it access to search tools such as bm25. the “generator” llm then actively searches for related content and reasons across multiple documents. from there, it produces multi-hop question-answer pairs.

sage introduces a second component: a search agent that acts as a judge. for each generated question, it searches the corpus and tries to independently verify the answer. if the agent answers the question too easily or can’t answer it at all, it sends feedback to the generator llm to try again. the two iterate until the questions meet a target difficulty.

the sage approach showed meaningful gains over naive baselines. but there’s a cost: every question potentially involves dozens of llm calls across both the generator and the judge. at scale, that’s slow and expensive.

our approach: the same insight, fewer calls

the key observation from sage is that search tool access is what makes good questions. the generator explores the corpus to build multi-hop questions. the judge searches to verify difficulty. that iterative searching is the mechanism that makes quality rise. unfortunately, it makes costs rise as well.

so we asked: can we preserve the mechanism without the call count?

on the generation side

instead of having the llm actively search the corpus mid-generation, we separated chunk linking into a dedicated pre-step. a single llm-guided tool call finds related chunks — parents, siblings, semantically similar content — before the generator ever writes a question. we also generate a lightweight corpus profile upfront: a summary, example queries, extracted entities, domain terminology. this makes the tool call smarter without adding per-question overhead.

the generator then receives a seed chunk plus its linked neighbors as context and writes a multi-hop question grounded in all of them. essentially, we cap the iterative searching to one llm call.

on the filtering side

the sage approach judge calls the search tool multiple times to simulate a retrieval agent that verifies difficulty. we replaced this again with a single tool call that retrieves the top results for the question, then asks a lightweight llm judge whether those results make the question trivially answerable.

two things this catches: questions answerable from common knowledge (no retrieval needed at all), and questions where the seed chunk appears directly in the top results (the question leaks its own answer). both get flagged and either are refined or rejected.

we also added a separate grounding filter that answers the question, “does the generated answer actually follow from the source chunks?”. this runs independently and catches hallucinated or unsupported answers before they enter the dataset.

the result is a filtering chain where every qa pair gets validated for retrieval necessity and answer faithfulness, the two properties that matter most for training rag agents.

what the questions actually look like

naive generation produces single-hop lookups. useful, but a rag agent trained is little more than a pattern-matching bot. its applicability is limited in scope.

the sage approach produces intricate and practical questions that feel almost as if they come from a senior engineer. this question requires synthesizing across the rest api reference and three separate sdk guides simultaneously. impressive, but generating that question took over 50 llm calls, with the generator and judge iterating back and forth across multiple doc sections.

our approach targets the space in between. the goal is questions that require real retrieval and synthesis, without the cost of full agentic search. this question spans the feature flag configuration docs, the sdk migration guide, and the node sdk changelog. three distinct sources, all of which are necessary to answer the question. yet, it took a fraction of the llm calls that sage required, over 10x less.

what we found

our approach is substantially better than naive single-chunk generation. the retrieval difficulty filter alone eliminates a large fraction of easy questions that would otherwise pollute the dataset. the corpus-grounded linking produces genuine multi-hop structure rather than superficial questions that technically mention two facts.

the full sage pipeline produced higher-fidelity questions, particularly for complex multi-hop cases where the iterative search allowed the generator to discover non-obvious connections. our approach doesn’t fully replicate that ceiling. however, for most production use cases, where you need thousands of questions, not dozens, the cost and latency of full agentic generation makes it impractical. our’s generates at scale. it maintains relative affordability without sacrificing question quality, making it the most balanced option.

what this means

you don’t need pre-curated training pairs to rl-tune a retrieval agent. all you need is a corpus.

give us your docs, support articles, internal wiki, and we’ll generate a synthetic qa dataset, train an agentic rag model against it, and ship you something that retrieves faster and more accurately than a general-purpose model, all without you having to label a single example.

want to go from corpus to trained retrieval agent? join the waitlist for early access.