What I Learned from Building Search at Google and How It Inspires RAG Development

From 2019 to 2023, I worked at Google on Search, which is often considered a “solved” problem by the public. Far from being solved, Search is a constantly evolving technology that requires a complex system and methodology to develop. Later, I transitioned to a startup and worked on Retrieval Augmented Generation (RAG), which, in my opinion, is a new and more accessible form of search powered by Large Language Models (LLM) and other AI tools. I realized that the insights gained from Search are still incredibly valuable in today’s RAG development.

RAG is based on a very intuitive idea: to avoid LLM’s hallucination, it retrieves the most relevant information from a knowledge base and uses it to ground the LLM’s answer. The information retrieval part resembles Search, a system that strives to find the most relevant and high-quality information from a pool of sources. Just like Search is a complex system, a minimum viable RAG encompasses many components: document parser, embedding model, vector database, prompt, and LLM. While developers focus on adopting various optimization techniques to improve RAG quality, most fail to realize that this is a system problem. Optimizing one component does not guarantee good overall performance unless there is a scientific way to test the end-to-end quality.

There are other examples of how the “old science” search could help today’s AI developers. I will share a few in this article.

Press enter or click to view image in full size

How does the RAG tech stack resemble Search

Measure It Before Optimizing It

To the disappointment of many, a hello-world RAG system doesn’t always return satisfying answers. Many developers are eager to seek optimizations to improve the answer quality. However, “good answer” is a very subjective thing. Unless the answer is completely wrong for factual questions, people may judge the answer with a granular scale based on its completeness, helpfulness, or if it directly meets the user’s intention. What users need is not a system that does extremely well on one query and fails on many others. The ultimate goal is to develop a system that consistently performs well on all possible queries. Unfortunately, this is not practical in reality, as improving the system’s performance for one query may have negative effects on other queries. Therefore, it is necessary to optimize the system in a way that balances its performance across all possible queries.

At Google Search, this is solved by developing a process called “quality evaluation”. As explained in how-search-works, a team of trained human raters serves as the ultimate judge. They follow the guidelines to rate the quality of a search query by comparing the answers with and without an experimental change. Hundreds or more queries are tested, and the statistics are studied by engineers and experts to avoid biased judgment toward a small set of special cases. This process helps to improve the overall quality of search results.

This methodology is highly effective for RAG development as well. If you want to achieve better than out-of-the-box RAG performance, quality evaluation is crucial. Instead of relying on magical techniques found in a random paper, you need to quantitatively measure the impact of any candidate change to ensure it improves the system overall. Fortunately, LLMs like GPT-4 have excellent judgment close to an ordinary human, so you don’t need to build a team of trained people just to score a Q&A pair. There are emerging projects that automate the scoring process, such as ragas, TruLens and Phoenix, but you still need a dataset of queries labeled with good answers to run these evaluations over.

Use Comprehensive Dataset

Just as mentioned above, the performance of a system can’t be judged on a single good or bad case. It’s the cumulative score on a wide variety of cases that’s truly representative. The “blend” of the question should be designed for a robust evaluation, including both cases that can be answered by a short paragraph in the document as well as those that require summarizing several pieces (which is more difficult to answer). Furthermore, the questions need to cover a diverse range of topics that users may ask about in the real world.

Curating such a dataset can be extremely costly. The best way is of course sampling real user queries in your production system, as no model can better represent the production system than the sample of itself. However, obtaining sample queries in production traffic and labeling them may not be feasible for everyone. This is especially unrealistic for someone who has just started developing RAG — the production traffic doesn’t exist yet.

There are a few ready-to-use datasets that provide comprehensive cases and cover a wide variety of question types and topic domains. MS MACRO and BEIR are the most widely adopted ones. However, there is a risk that most publically available RAG components, open-source embedding models for example, have been overfitted to them, as their training data and evaluation methods tend to heavily skew with these two datasets.

Your Mileage May Vary

Information retrieval is a complex problem. There is no one-size-fits-all solution. The standard of searching for a beautiful hat on macys.com varies a lot from searching for a software bug from stackoverflow.com. The same principle applies to RAG applications. I have observed vastly different use cases of RAG, from asking about the interactions between characters in a fictional book to inquiring about the functionality of a specific API based on the technical documentation of open-source software. What works well for someone else’s specific use case, or even a general one, might not work as effectively for your particular situation.

This is why at Google there are many Search products and each has its specific evaluation dataset. (Yes, Google Search is not a monolithic product.) There is a basic intuition behind that: the food recipe Search and shopping product Search are designed for completely different queries.

To reliably reflect how well a RAG system may perform in your target scenario, the best way is to curate your own dataset, at least with a blend of indicative real-world cases in the evaluation. For example, using a general dataset such as MS MACRO as the main test, but also testing on a private dataset of tens of questions tailored to your use case.

Build a Continuous Cycle with Feedback

Developing a successful information retrieval system requires a systematic quality evaluation and review process. It’s essential to integrate this process into the development lifecycle. Quality is not the outcome of a magic technique. It’s a result of many small improvements. In 2022, Google ran over 800,000 experiments that resulted in more than 4,000 improvements to Search. In addition, live traffic experiments are conducted to gather real user feedback on experimental improvements. Only the best candidates are kept.

A RAG service that aims to improve its quality continuously can follow this process. First, an improvement idea is evaluated offline using self-curated datasets. If it passes the offline evaluation, a small portion of online traffic will be served with the new change. The effectiveness of the change is judged again based on user feedback metrics, such as engagement activities or like/dislike ratings. This process can be run repeatedly on different ideas to constantly improve the quality of the production system.

Press enter or click to view image in full size

Continuous Quality Improvement Cycle

RAG is an exciting technology that has revolutionized search. For a long time, the only available search technology in the open-source community was based on keyword matching. But now, with the introduction of general-purpose embedding models and vector database technologies, semantic search is no longer a privilege of large companies with dedicated machine learning teams. Open-source tools have made it possible for every developer to build private searches with semantic understanding beyond mechanical keyword matching. However, it is still a challenge to return accurate and relevant results based on any type of document. To overcome this challenge, it’s important to learn how web search technology has evolved.

With these lessons learned, at Zilliz, we partnered with the evaluation frameworks to test the quality of different RAG implementations on the market (example). We also applied this idea of continuous quality evaluation and improvement to build a retrieval API service — Zilliz Cloud Pipelines, offering state-of-the-art retrieval quality and ease of DevOps to developers. In the future, we will publish a series of articles on zilliz.com/learn to share our practice on RAG development and quality evaluation. Please let me know if you would like to read more content like this!