LLMs as Retrieval and Recommendation Engines

Youtube video going through the blog post

Motivation

For the past year, I’ve been exploring how LLMs can be used as retrieval and recommendation agents, and I put together this blog post as a simple introduction to the topic. My goal is to keep it clear and accessible, so even if you don’t have a technical background, don’t worry — this post is written for you too.

I have tried to explain the core concepts as simply as possible, use intuitive examples from different research papers, and, on top of that, provide a short script with code showing how to use LLMs as retrieval agents with vLLM and Hugging Face models (it’s in part 2).

Press enter or click to view image in full size

Fig 1. Image generated with ChatGPT: A robot selecting the most relevant documents from a large corpus.

What is retrieval and why it matters?

Retrieval is the process of finding and returning the most relevant items from a large collection, given a query. That collection could be all the webpages on the internet, the books in a library, or the products in an online store.

Examples are everywhere: Google surfaces relevant URLs from billions of pages in response to your search query, YouTube and Spotify pull up the videos or songs you may like from massive catalogs for you, and Amazon or Instacart return the products that best match what you typed in the search bar or what you may want to purchase today.

Retrieval is the first stage in both search and recommendation engines. The retrieval engines usually return tens or hundreds of items, selected from millions or even billions of products, and the ranker decides the final order presented to the user.

When a user searches for something, related items are retrieved and then ranked before being shown. In recommendation systems, search doesn’t happen explicitly, but the process is the same: based on the user’s past history, we retrieve most related items to user and then a ranker orders the retireved results.

A brief review of retrieval methods (pre-LLMs)

Traditionally retrieval engines have mostly relied on keyword matching or vector similarity.

Sparse Retrieval (Keyword-Based, the era before Deep Learning):

Examples: BM25, TF-IDF
How it works: Finds documents that share the most words with the query.
Pros: Very fast, no training required, and easy to interpret.
Cons: Lacks semantic understanding and struggles with synonyms.
Example: For example, if you search ‘How to bake croissants,’ the system might completely miss a document called ‘Making pastries at home’ — even though croissants are pastries. This is because the query and the title don’t share any words.

Dense Retrieval (Vector-Based, Deep Learning):

Examples: DPR, ANCE, ColBERT
How it works: Neural encoders embed queries and documents into the same vector space, and retrieval is done through similarity search. As shown in figure 2, there usually two towers (encoders). One tower learns embeddings for the query and the other for the document so that related pairs have a higher dot-product score.
Pros: Captures semantic meaning beyond exact keywords; handles synonyms and paraphrasing much better.
Cons: Requires storing dense vectors for every document, which can consume massive amounts of memory. Also if you update the encoder (tower), you have to rebuild the entire index, which is pretty costly. In addition, as queries and documents get longer, the performance of embedding-based retrieval systems tends to decline.

Press enter or click to view image in full size

Fig 2. An example of a two tower dense retrieval architecture. From “Retrieving and Reading : A Comprehensive Survey on Open-domain Question Answering”, Zhu et al. ,2021

Generative Information Retrieval

The new approach is Generative Information Retrieval (GenIR). Instead of relying on keyword or vector similarity, we directly use an LLM to generate the titles or identifiers (DocIDs) of relevant items. The idea is simple: feed the model a user query or profile (purchase history, past interactions, etc.), then instruct it to output the top-k most related items. Given access to the full corpus of items, the LLM can retrieve the most related items directly.

LLM as the recommendation agent

Press enter or click to view image in full size

Fig 3. Different ways of using an LLM as a recommender . From “Recommender Systems in the Era of Large Language Models (LLMs)”, Zhao et al., 2024

In recommendation systems, there is always one or more ranking steps after retrieval. The same applies when using LLMs. There different was of using LLM in recommendation systems:

LLM for retrieval + a separate ranker: The LLM retrieves candidate items, and a separate ranker (which could also be an LLM) orders them by relevance.
LLM for both retrieval and ranking: A single LLM retrieves items and directly ranks them, unifying both steps. This end-to-end approach is an emerging research trend aimed at simplifying recommendation pipelines.

One example of using LLMs for item recommendation is shown in figure 4, the user’s past purchase history is given to the LLM as a list, and the model is tasked with predicting the next item. In this way, the LLM directly outputs the product most likely to be purchased next. To make the model more familiar with the full catalog of items, a fine-tuning step is often helpful.

Press enter or click to view image in full size

Fig 4. An example prompt from an LLM-based recommendation agent that suggests products to a user based on their purchase history, adapted from “CALRec: Contrastive Alignment of Generative LLMs for Sequential Recommendation” (Li et al., 2024).

Semantic IDs vs Titles

Instead of generating full document or product titles, LLMs can also retrieve using Semantic IDs. These are structured identifiers made up of a combination of numbers (or tokens), where each part of the sequence encodes specific attributes of an item.

For example, a DocID consisting of 3 numbers for a product catalog:

The first number might indicate the category (shoes),
The second the brand (Nike),
The third the color (orange).

These IDs aren’t assigned randomly — they are usually learned by quantizing semantic embeddings of items (e.g., from a pretrained text encoder like Sentence-BERT or T5). Methods such as Residual Quantized Variational Autoencoders (RQ-VAE) or product quantization map each item’s embedding into a tuple of discrete codes, which form its Semantic ID.

This hierarchical structure makes Semantic IDs compact, consistent, and easier for models to generate than long free-text titles. However, to use them effectively, the LLM typically needs to be trained on these IDs — either during pretraining or through later supervised finetuning — so it learns the mapping between queries and the correct ID sequences.

Press enter or click to view image in full size

Fig 5. From “Recommender Systems with Generative Retrieval”, Rajput et al., 2023

Semantic ids have a caveat: new items in the catalog must first be assigned IDs, and unless the model is updated or fine-tuned with those IDs, it may fail to retrieve them. Titles, by contrast, are more flexible since LLMs can often generalize to unseen text.

Handling large item corpora and hallucinations

For the LLM to output items from our corpus, it either needs to have seen them during training, or we have to provide the full list in the prompt.

If fine-tuning is required, that means we can’t just use the LLM out of the box — we’d need to train it on our own data, which takes extra effort. And even if we skip fine-tuning, there’s still the problem of context limits. Some models now support million-token prompts, but it’s not practical to cram an entire catalog into the context. Even if we do, the model might still hallucinate and output items that don’t exist.

The solution is Constrained Decoding. By restricting the output space at each generation step, we ensure the model can only produce valid sequences that correspond to real item titles (or DocIDs). One common method is using a prefix tree (trie) built on the catalog. For instance, in a shoe store catalog, the tree ensures the model generates only valid product titles like “Nike Running Shoes” or “Adidas Sandals”, never random or made-up ones.

Think of Constrained Decoding as putting the model on train tracks: it can move forward in different directions, but only along rails you’ve already laid down.

Press enter or click to view image in full size

Fig 6. An example of prefix tree built on a product catalog of a shoe store. For the sake of simplicity we have considered each word to be one token here.

Putting It All Together

In summary, we can provide the LLM with the user’s query and/or profile, and have it generate valid item titles only from the constrained output space. This way, retrieval stays efficient and firmly grounded in the actual catalog.

Moreover, many research works show that fine-tuning the LLM further boosts performance. This usually involves creating training pairs of (query, relevant item) and applying supervised finetuning so the model learns retrieval behavior more reliably.

Yes, But

Not always a replacement: Think of it this way: if you just type ‘Toronto weather today,’ a classical search engine is more than enough. But if you write a long, nuanced request like ‘Show me some places for a two-day trip in Toronto with family-friendly activities near a lake,’ that’s where GenIR really shines.
Model size matters: Bigger models, like GPT-5, can often beat traditional retrieval and recommendation systems on nuanced tasks. But a smaller model — say, a 1B-parameter LLaMA — won’t automatically outperform well-tuned classical methods.
Fine-tuning helps: Training on query–document pairs can significantly boost retrieval accuracy. Practical tip: Use a powerful LLM (that performs well in your evaluations) to generate labels, then distill knowledge into a smaller, cheaper model for production.

Parting thoughts

Using LLMs as retrieval and recommendation engines is still a new and fast-growing area of research, but already shows strong results. For instance, Bevilacqua et al.’s SEAL generative retriever outperforms many dual-encoder baselines on various KILT tasks, and Rajput et al.’s TIGER framework achieves up to 29% improvement in NDCG@5 on the Amazon Beauty dataset.

Several major companies — including Netflix, Google, and Meta — are actively exploring Generative Retrieval and unified retrieval-ranking approaches. Recent studies (such as the ones mentioned in this blog post) show that these methods often outperform other retrieval and recommendation systems, while also being more flexible in handling complex queries and recommendation tasks.

We’re still early in this shift, but it’s clear retrieval is moving from keywords and embeddings toward LLM-powered generation. The ideas we’ve seen in research are now being tested in production, making this the perfect time for you to start experimenting with GenIR.

A Few essential reads:

One of the early papers discussing GenIR and introducing constrained generation with LLMs (SEAL framework):

Bevilacqua, Michele, et al. “Autoregressive search engines: Generating substrings as document identifiers.” Advances in Neural Information Processing Systems 35 (2022): 31668–31683.

This paper discusses the use of DocIDs and finetuning LLMs for next product recommendation (TIGER framework).

Rajput, Shashank, et al. “Recommender systems with generative retrieval.” Advances in Neural Information Processing Systems 36 (2023): 10299–10315.

A comprehensive survey going through most important works in this area:

Zhao, Zihuai, et al. “Recommender systems in the era of large language models (llms).” IEEE Transactions on Knowledge and Data Engineering 36.11 (2024): 6889–6907.

Next: code tutorial

In part 2, I’ll actually build a toy retrieval engine with an open-weight LLM, so you can see these ideas in action. So make sure to check it out!