Your RAG Pipeline Has No Brakes

Prompts Are Suggestions. Checkpoints Are Guarantees.

You’ve built a RAG system. Your chunking is solid, your embeddings are tuned, your hybrid search returns relevant results. You put “only answer based on the provided context” in the system prompt. The demo goes well.

Then a user asks a question about last quarter’s revenue, and the LLM confidently reports a number that appears nowhere in any of your source documents.

The prompt told it to stick to the context. It didn’t.

This is the norm, not the exception. Prompts are inputs to a probabilistic system. They influence behavior but they don’t constrain output. Telling an LLM to be grounded is like telling a user to only submit valid form data. It helps, but it isn’t enforcement.

What if you could put actual enforcement between the LLM and the user? Not more prompting. Not a second LLM checking the first. Deterministic validation that the LLM physically cannot bypass.

Press enter or click to view image in full size

Where RAG Grounding Stands Right Now

If you look at most RAG tutorials or reference architectures. The pipeline is almost the same everywhere: ingest documents, chunk them, embed them, store them. At query time, retrieve relevant chunks, pass them to the LLM as context, return the response.

All the engineering effort sits on the left side of that pipeline. Better chunking. Better embeddings. Hybrid search with BM25 plus vector similarity. Reranking with cross-encoders. There’s real value in all of it.

But the right side? The part where the LLM takes those carefully retrieved chunks and generates a response? Almost nobody validates that before it reaches the user.

The working assumption is: give the LLM good context and a clear prompt, and it will produce grounded answers. Often it does. Sometimes it doesn’t. And you have no reliable way to tell which happened until a user catches a wrong answer.

Think about how that would fly anywhere else in your stack. “We validate user input carefully, but we trust the database to write correct data so we skip read validation.” Nobody would accept that. For LLM output, it’s standard practice.

Treat LLM Output as Untrusted Data

The fix is conceptually simple. Stop treating the LLM as the final authority. Treat it as a drafting engine whose output needs validation before anyone sees it.

Between “the LLM produced a response” and “the user sees the response,” insert a series of validation checkpoints. Each checkpoint is a function that takes the LLM’s output, runs a deterministic check, and returns pass or fail. These are database lookups, vector similarity computations, string pattern matching. Not more LLM calls. Code.

The LLM cannot skip a checkpoint. It cannot argue with a cosine similarity score. It cannot talk its way past a missing database record. If a check fails, the response either gets sent back for revision or the user gets a clear “I don’t have enough information to answer that” instead of a confident hallucination.

Same principle as input validation, schema constraints, foreign key checks. Just applied to a new kind of untrusted input: LLM output.

The Four Checkpoints

These are running in production in a system I built that generates presentation decks from ingested technical documents. Each checkpoint is a function that returns a boolean and an error list. The LLM’s output cannot proceed to the user without passing all four.

The specifics are general enough to apply to any RAG system, whether you’re using Postgres, OpenSearch, Pinecone, or anything else with a search index and embeddings.

Press enter or click to view image in full size

1. Retrieval Quality

Runs before the LLM sees anything.

Did the search return enough relevant chunks? Is the top result’s relevance score above a minimum threshold?

If this check fails, the LLM never gets called. The user gets a direct answer: “I don’t have enough information in the available documents to answer that.” No generation. No hallucination risk.

This single check eliminates a large category of bad answers. Most hallucination in RAG happens when the retrieved context is sparse or only loosely related to the question. The LLM fills the gaps from its training data, and the user can’t tell the difference. Catch the bad retrieval upfront and that failure mode disappears.

Implementation is minimal: check that you got at least N chunks (say, 3) and that the top relevance score exceeds a threshold (say, 0.65). Exact numbers depend on your embedding model and data. Start conservative. Loosen based on observed false refusals.

2. Citation Existence

Runs after the LLM responds.

The prompt instructs the LLM to cite specific chunk or document IDs from the context it received. After the response comes back, look up each cited ID in your index. Does that chunk actually exist?

If your documents have versions (S3 objects with version IDs, for example), take it a step further: is the cited chunk from the current version of the source document, or has the document been updated since that chunk was extracted?

This catches two failure modes. First, fabrication: LLMs sometimes cite source IDs that look plausible but don’t correspond to anything real. Second, staleness: if the underlying document was updated and re-chunked, old citations point at content that may no longer be accurate.

The check itself is a batch lookup. In OpenSearch, an mget. In Postgres, an EXISTS query. Computationally trivial.

3. Semantic Grounding

Runs after the LLM responds. After citation existence passes.

This is the strongest checkpoint. It catches the case where the LLM cites a real source but says something the source doesn’t actually support.

Here is how it works:

Split the LLM’s response into individual claims or sentences.
Embed each claim using the same embedding model you use for document chunks.
For each claim, compute cosine similarity against the embeddings of the specific chunks it cites. Not all chunks in the index. Just the ones the response says it’s drawing from.
If the similarity between a claim and its best-matching cited chunk falls below a threshold, flag that claim as ungrounded.

Press enter or click to view image in full size

The threshold controls how tightly the LLM must stick to its sources. At 0.85+, only close paraphrases pass. At 0.65, reasonable summarization and synthesis get through but meaning drift gets caught. Tune to your use case.

Important: this does not require verbatim reproduction. The LLM can rephrase, condense, and combine freely. “Revenue grew 14% year over year” and “Year-over-year revenue was up 14%” score high similarity. They mean the same thing. But “Revenue grew 40% year over year” would score meaningfully lower against a source that says 14%, because the embedding captures the semantic content, not the surface phrasing.

One limitation to be honest about: embeddings handle topical drift and factual contradictions well but can miss small numerical differences. “14%” vs “15%” are semantically very close and may not be flagged. If exact numerical precision matters in your domain, add a targeted extraction step that pulls out numbers and compares them literally. For the general case of “is the LLM saying something its sources actually support,” embedding similarity works.

4. Response Structure

Runs after the LLM responds. Can run in parallel with checkpoints 2 and 3.

The simplest check. Does the response include citations at all? Is it suspiciously short or empty? Does it contain signals that the LLM fell back to its training data instead of the provided context? (“Based on my training data,” “As an AI model,” “I don’t have access to that information.”)

String matching and length checks. It catches the easy, structural failures that the more sophisticated checks might miss because they don’t look for them.

You Need to Own the Last Mile

There’s an architectural requirement behind all of this: your code must sit between the LLM and the user. If the LLM’s response streams directly to the frontend with nothing in between, you cannot validate anything.

Press enter or click to view image in full size

If you own the backend (the most common setup for production chatbots), this is straightforward. Your API endpoint calls the LLM, gets the raw response, runs the four checks, returns the validated result. The LLM never talks to the user directly. You control the last mile.

If you’re using a hosted chat UI with MCP tools (like Claude Desktop with your custom MCP server), you have a gap. MCP tools run during the retrieval phase, before the LLM generates its response. Checkpoint 1 (retrieval quality) works fine inside an MCP tool. But checkpoints 2, 3, and 4 need the LLM’s output, which goes straight to the user. You can expose a “validate this draft” tool and instruct the LLM to call it before presenting its final answer. Claude is fairly reliable at following that pattern. But the LLM can skip the step. It’s strong advisory, not hard enforcement.

If you’re using an open-source chat UI like LibreChat, there’s a clean workaround. Instead of pointing the UI at the LLM provider directly, point it at your own service that speaks the OpenAI-compatible API. Your service receives the request, calls the real LLM, runs validation, and returns the checked response. The chat UI doesn’t know anything happened. You get the nice interface without giving up output control.

The streaming question. Hard enforcement means buffering the complete response before validating and displaying it. That adds latency and kills the typewriter effect users expect from chat interfaces. The alternative: stream the response to the user, run validation in parallel, and append a confidence signal at the end (“Sources verified” vs “Some claims could not be verified”). Financial and medical applications should probably buffer. Internal knowledge base chatbots can stream and signal. Pick based on what a wrong answer costs you.

What This Doesn’t Solve

Embedding similarity is not exact match. It catches meaning drift but can miss small numerical errors. “$4.2M” vs “$4.3M” will likely not be flagged. If that level of precision matters, add targeted number extraction.

Thresholds require tuning. The right similarity threshold depends on your embedding model, chunk size, and content type. There is no universal number. Start at 0.65 and adjust as you observe false positives and negatives in your actual data.

Latency goes up. The grounding check embeds each claim and runs similarity computations. For a five-sentence response, that’s five embedding calls plus vector comparisons. Typically adds 1 to 3 seconds. Batch the embedding calls to reduce round trips.

Bad sources produce validated bad answers. If your source documents are wrong, these checks will happily confirm the LLM’s response against wrong sources. The checkpoints verify consistency between response and sources. They say nothing about whether the sources themselves are correct.

Retrieval still matters. Checkpoints 2, 3, and 4 validate what the LLM did with the context. If retrieval missed the most relevant chunks entirely, the LLM is working with incomplete information. Checkpoint 1 catches the worst cases (too few results, scores too low) but it can’t tell you if the right chunks were retrieved. Good retrieval engineering is still the foundation.

The Point

You have two tools for keeping LLM output grounded. Prompts and checkpoints. Prompts improve first-draft quality by telling the LLM what you expect. Checkpoints guarantee that bad drafts don’t reach the user by running deterministic validation the LLM cannot influence.

Most RAG systems use only the first tool. Adding the second is a small amount of work. A retrieval quality check is ten lines of code. Citation existence is a database lookup. Semantic grounding is an embedding call plus cosine similarity. None of it requires new infrastructure.

If your RAG pipeline doesn’t have a validation layer between the LLM and the user, you have a pipeline with no brakes.

The reference implementation for this approach, including the database validation functions, MCP wrapper, and the orchestrator that enforces the checkpoint sequence, is at github.com/payals/pg_rag_slide_generator.