
tl;dr
- we trained a small 4b model for domain-specific agentic rag.
- with reinforcement learning, the small model outperforms gpt-5.2 on domain-specific retrieval-heavy tasks.
- this means faster, cheaper inference with better retrieval quality - teams can ship search-centric ai features that feel instant & scale w/o crazy latency or cost.
retrieval quality & latency now dominate llm workloads
retrieval is core to llm systems. even the best models need grounding. they need to query external knowledge bases and pull in fresh information to produce reliable answers.
over the last year, retrieval has shifted from one-shot vector (bm25, embeddings, etc.) similarity systems to agentic retrieval. instead of issuing a single query, models plan and search multiple times in a loop. this shift makes latency a bottleneck. every reasoning step/tool call adds delay. so while retrieval’s getting smarter, it’s also getting more expensive.
today, devs rely on large general-purpose llms to power these retrieval agents. these models are optimized for broad reasoning, not fast iterative search. effective retrieval is also domain-specific: a model must understand the structure, terminology, and implicit signals of the knowledge base it queries.
rl enables fast agentic rag
reinforcement learning helps smaller, faster models become expert retrieval agents for specific domains. the result is specialized models that match/exceed the retrieval performance of larger models while running far more efficiently.
this blogpost walks through how we applied this approach to a concrete financial use case, training an agentic rag model specialized for finance-focused retrieval and demonstrating how RL can turn even compact models into high-performance search agents.
want to train retrieval agents like this on your own data? we’ll soon be launching a platform that lets you rl fast retrieval systems for your own docs & datasets. join the waitlist to get free early beta access.
dataset & setup
dataset
we use FinDer, a dataset for financial q/a over 10K filings. some notes on the dataset:
- queries include fact lookups, calculations & more open-ended analysis. queries are also domain-specific & require multi-hop reasoning.
- queries are ambiguous, key-word style & technical (e.g. fy23-fy24 net sales growth for azo).
- dataset comes with both ground truth answers and golden reference chunks i.e. exact piece(s) of source text from the original docs that contain info needed to get to the ground truth answer.
- we used the split of the dataset that evaluates quantitative reasoning.
chunking & indexing
the search tool available to the agent was strictly bm25 search vs. embedding search. embedding-based search can add noise during RL training (small changes in query wording can shift retrieval results).
train/validation split
to check for generalization across entities, we split at the stock-ticker level. this ensures that the stock tickers that appear in the train set are entirely distinct from the val set — enforcing domain reasoning rather than memorization.
reward function
retrieval quality was judged using these 3 metrics:
- llm as judge rubric to evaluate correctness of final answer → we compare the ground truth answer with the llm generated answer to make sure content, numbers, etc. are accurate
- conciseness of final answer → we don’t want the model to keep yapping, so we have another rubric that guards against that.
- percentage of reference chunks retrieved across tool calls
the last metric serves as one safeguard against reward hacking because it evaluates how many ground-truth reference chunks were actually retrieved, rather than relying solely on an llm’s subjective judgment.
results

with rl-finetuning, the 4b model was able to produce answers matching the ground truth ~35% more often than gpt-5.2, a model that’s likely at least 100x larger.

at the same time, pass@8 increased by ~63% during training. pass@8 measures the probability that at least one of eight sampled attempts solves the task. the rise here indicates the model is actually learning to solve more questions over time (instead of just getting better at producing the same answers more reliably).
early training behavior - query echoing
initially, the model tended to simply search for the exact user query once and stop. over time, rl pushed the model towards more multi-turn behavior - issuing multiple searches when information was incomplete and terminating appropriately when enough info was gathered.
randomizing judge prompts reduces reward hacking
llm as a judge rubrics can bring subtle failure modes where the model learns to exploit quirks in the judge, vs actually optimizing for the task at hand. one example: the model discovered that sprinkling emojis increased the conciseness score.
to mitigate this, we construct a small set of semantically equivalent judge prompts and sample one at reward computation time. this ensures the model doesn’t overfit to the quirks of a particular prompt, ensuring robustness.
train-inference mismatch
rl training has two components: a rollout engine and a trainer. we sample a few completions from the rollout engine, compute rewards for each of them & then the trainer pushes the model towards generating completions with higher rewards.
for higher throughput, we use sglang as our rollout/inference engine → this difference between the rollout engine/trainer can result in probability distributions being different between what’s computed during sampling vs training. this resulted in very unstable training for us.
we tried a variety of methods (tis, mis, etc.). we noticed a fundamental tradeoff between convergence speed and stability (masking aggressively increased stability but prevented the model from exploring novel paths & reduced convergence speed).
we ended up picking dppo proposed by qi et al. → the method ensures that low probability tokens aren’t overly penalized (ensuring the model tries new approaches) whilst also constraining cases where there is a high mismatch between the trainer ↔ rollout components.
your turn
in short, smaller models can
- learn domain-specific retrieval behaviors via rl
- outperform larger models on real tasks.
- and can do it at a fraction of the cost and latency.
want to train retrieval agents like this on your own data? we’ll soon be launching a platform that lets you rl fast retrieval systems for your own docs & datasets. join the waitlist to get free early beta access.