Why Retrieval-Augmented Generation Is An Engineering Problem

Aditya V Kashyap, AI and Innovation Leader, driving enterprise transformations through trusted strategy, governance and bold leadership.

Every few months, the enterprise AI conversation resets around the same flawed premise that better models solve the problem. When large language models hallucinate, the instinct is to reach for a newer version. When outputs drift from reality, the fix must be a smarter architecture, a tighter prompt or a different foundation model.

Retrieval-augmented generation (RAG) arrived as the practical answer to this cycle, promising to ground LLMs in real organizational knowledge. The problem is that many RAG systems fail, and the model is rarely the reason.

Why RAG Fails

I have watched organizations deploy RAG across compliance workflows, internal knowledge bases and client-facing operations. The pattern is consistent. The model performs brilliantly in isolation. The moment it connects to enterprise data through a retrieval pipeline, the results become unpredictable, expensive to debug and genuinely difficult to trust. The failure does not announce itself. It arrives quietly, wrapped in fluent, confident prose.

The misconception begins at the design table. Leadership views RAG as a model-enhancement strategy. If GPT-4 hallucinates, attach a knowledge store. If the model does not know your regulatory policy, feed it in at inference time. This framing is intuitive but wrong. It treats retrieval as a passive conduit, a pipe that simply moves data closer to the model. In practice, retrieval is a system with its own failure modes, its own latency budget and its own quality requirements. Getting the model right is the easy part.

What actually breaks is everything that happens before the model sees a single token of retrieved content. Document chunking is the first and most underestimated problem. Splitting a 40-page policy document into fixed-size chunks with no awareness of semantic boundaries is the data engineering equivalent of answering a question by reading random paragraphs out of the wrong book.

The model cannot compensate for context that was never coherently structured. It will generate something, and it will sound right, but it will be wrong in ways that are difficult to trace back to a chunking decision made during indexing.

Embedding quality compounds this. Generic embeddings trained on broad web corpora perform poorly against the dense, domain-specific language of financial services, legal compliance or medical documentation. The semantic distance between how a compliance officer phrases a question and how a regulation is actually written can be large enough that the retrieval layer simply returns the wrong documents.

Precision and recall are in constant tension. Tune for recall and you flood the context window with noise. Tune for precision and you miss the exact passage the model needed. No default configuration resolves this. It requires deliberate calibration against the actual query distribution of the system in production.

The Importance Of Data, Architecture And Latency

The data problem runs deeper than most organizations are prepared to accept. RAG is not just constrained by data quality. It is constrained by data freshness and data structure simultaneously. Retrieval layers built on documents that were accurate six months ago will generate outputs that are confidently grounded in obsolete information.

In regulated industries, that is not an accuracy problem. It is a liability problem. Enterprises routinely underestimate the engineering effort required to maintain a retrieval layer that stays current, clean and semantically coherent as the underlying knowledge base evolves.

Architecture decisions matter more than model selection at almost every operational threshold. In my experience, hybrid retrieval, combining dense vector search with sparse keyword matching, consistently outperforms either approach in isolation across complex enterprise use cases.

The organizations getting durable results are investing in vector database design, indexing strategy and pre-computation pipelines, not in cycling through foundation models. Caching frequently retrieved context reduces latency and cost. Smart chunking with semantic overlap reduces context fragmentation. These are engineering decisions with direct, measurable impact on output quality.

Latency deserves more attention than it receives in RAG discussions. Multistage pipelines introduce retrieval delays, re-ranking overhead and additional API calls that accumulate into user-facing latency budgets that real applications cannot absorb. There is a genuine trade-off between retrieval depth and response speed, and it cannot be resolved by throwing more compute at it. The organizations building production RAG systems are making explicit architectural choices about where accuracy yields to performance, and where it cannot.

The Silent Failure Mode

The failure mode that concerns me most is the one that looks like success. Retrieval-augmented systems fail silently in a way that hallucinating models do not. When a model hallucinates, the output is often detectably wrong. When a RAG system retrieves plausible but subtly irrelevant context, the model produces output that is coherent, well-structured and grounded in the wrong data. It passes a surface-level review. It sounds authoritative. And in a compliance or risk context, it is more dangerous than a visible error because it is harder to catch.

Most leaders still believe RAG is a shortcut out of the hallucination problem. It is not. It is a trade for a harder, less visible class of problems. The gap between a compelling pilot and a reliable production system is almost entirely an engineering gap. Pilots run on clean, curated document sets. Production systems ingest messy, contradictory, unevenly maintained organizational knowledge. That gap does not close by upgrading the model.

The organizations that will build durable, trustworthy AI systems are not the ones waiting on the next generation of foundation models. They are the ones treating retrieval as a first-class engineering discipline, investing in data infrastructure and accepting that reasoning over imperfect, dynamic organizational knowledge at scale is a systems problem that cannot be delegated to the model layer. The future of enterprise AI will not be won by whoever has the most capable model. It will be won by whoever has the most reliable retrieval system underneath it.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

​Why RAG Fails

​The Importance Of Data, Architecture And Latency

​The Silent Failure Mode

Why RAG Fails

The Importance Of Data, Architecture And Latency

The Silent Failure Mode