Retrieval-Augmented Generation (RAG) is a design pattern where you retrieve relevant documents from an external data source and inject them into a language model’s prompt to ground its responses in specific information. The canonical tutorial goes something like this: chunk your documents, generate vector embeddings, store them in a vector database, and at query time, retrieve the most relevant chunks and stuff them into the context window.
For a thorough overview, see IBM’s explanation of RAG, AWS’s RAG documentation, or Anthropic’s guide to RAG.
Here’s the thing: most of you don’t need any of this anymore.
When RAG was popularized in 2023, context windows were small. GPT-3.5 had 4,096 tokens. You literally couldn’t fit a long document into a single prompt. Chunking and retrieval weren’t just convenient — they were necessary.
That constraint is gone. As of early 2026, Gemini 2.0 offers 2 million tokens. Claude Sonnet 4 supports up to 1 million tokens. Llama 4 Scout shipped with a 10 million token context window. Even the smaller models routinely handle 128K–200K tokens. A million tokens is roughly 750,000 words — that’s longer than the entire Lord of the Rings trilogy.
The typical RAG use case — “chat with your internal docs,” “answer questions about our company wiki,” “query our knowledge base” — involves a corpus that is, at most, a few megabytes of text. That fits comfortably in a single prompt. No chunking strategy, no embedding pipeline, no retrieval step. You just put the documents in the prompt and ask your question.
Yes, there are tradeoffs. Longer contexts cost more per query and increase latency. Research from Meibel and others has shown that model attention degrades over very long sequences, and output token generation slows as input length grows. But for the vast majority of RAG deployments — internal chatbots, document Q&A, knowledge base search — the corpus is small enough that these tradeoffs are negligible compared to the engineering overhead of maintaining a full RAG pipeline.
The other major RAG use case is searching over large, existing datasets — social media posts, support tickets, product catalogs, forum threads. The pitch is that vector search gives you semantic retrieval: you can find documents by meaning, not just keywords.
But platforms with large datasets already have search infrastructure. Reddit has search. Twitter has search. Your company probably runs Elasticsearch or Solr or something similar. These systems have been tuned over years for relevance ranking, filtering, faceting, and scale.
The RAG approach says: duplicate all of that data into a vector database, build a separate ingestion pipeline to keep it in sync, and query the vector index for semantic similarity. But this creates real operational problems. You now have a synchronization problem between your source of truth and your vector index. You have duplicated storage costs. You have a new system to monitor, maintain, and debug. And you have a new failure mode where your vector index is stale or inconsistent with the primary data.
The alternative is simpler: query your existing search infrastructure, take the top N results, and put them in the context window. The language model handles the semantic understanding. You get 80–90% of the benefit of semantic retrieval with zero additional infrastructure. If you need better ranking, you can rerank the search results with an LLM — still no vector database required.
This brings us to the core argument. What do Pinecone, Weaviate, Qdrant, and the other vector databases actually do? They take vectors, build an approximate nearest neighbor (ANN) index, and return the closest matches to a query vector.
That’s search. That’s what search engines do.
Calling it a “database” implies it’s your source of truth — that it has durability guarantees, ACID properties, and is where your data lives. But nobody uses Pinecone as their primary data store. The actual data lives in Postgres, S3, your CMS, wherever. The vector database is just an index sitting on top of it.
And here’s the thing: your existing database probably already supports this. PostgreSQL has pgvector, an open-source extension that adds vector similarity search with HNSW and IVFFlat indexing — directly alongside your relational data, with no separate system required. Elasticsearch has supported dense vector fields and kNN search since version 8.0, meaning you can add semantic search to your existing Elasticsearch cluster without spinning up anything new.
Vector search is a feature of your existing stack, not a product you need to buy separately. The dedicated vector database market depends on the assumption that you need specialized infrastructure for this workload. For the vast majority of use cases, you don’t.
To be fair, there are real use cases where dedicated vector infrastructure makes sense:
Multimodal search. If you’re searching images, audio, or video by semantic meaning — “find photos similar to this one” — there’s no legacy text search to fall back on. You need embeddings and you need to search over them.
Recommendation systems at scale. Spotify-style or TikTok-style “find similar items” across hundreds of millions of items with low latency. You can’t stuff that in a context window and keyword search doesn’t capture the learned relationships.
Cross-lingual search. Embeddings cluster semantically similar content regardless of language. Traditional search can’t do this natively.
High-volume cost optimization. If you’re making thousands of queries per minute, stuffing full documents into context windows every time gets expensive. Retrieval of targeted chunks can be significantly cheaper at that scale.
Notice what all of these have in common: they’re either genuinely large-scale problems or genuinely novel retrieval problems. None of them are “I want my chatbot to answer questions about our internal docs” — which is approximately 80% of what people are building RAG for today.
In 2026, for most workloads:
If your corpus is small (internal docs, knowledge bases, wikis): skip RAG entirely. Put the documents in the context window. It’s simpler, more accurate, and easier to maintain.
If your corpus is large (millions of records, social media scale): use your existing search infrastructure to retrieve candidates and let the language model handle semantic understanding. You avoid duplicating data and maintaining a parallel system.
If you need vector search: add it as a feature to your existing database. pgvector for Postgres, dense vectors for Elasticsearch. You don’t need a separate product.
Save yourself the chunking strategy debates, the embedding model selection, the retrieval pipeline maintenance, and the vector database vendor lock-in. The context window is big enough now. Your search engine is good enough now. Use them.