Multistage RAG with LlamaIndex and Cohere Reranking: A Step-by-Step Guide

Press enter or click to view image in full size

Retrieval Augmented Generation (RAG) is a powerful technique that allows language models to draw upon relevant information from external knowledge bases when generating responses. However, the effectiveness of RAG heavily relies on the quality of the retrieved results. In this article, we’ll explore an advanced multistage RAG architecture using LlamaIndex for indexing and retrieval and Cohere for semantic reranking. We’ll provide a detailed, step-by-step guide to implementing this architecture, complete with code snippets from the accompanying Colab notebook.

To provide the best possible context for our LLM, we need to surface the most relevant snippets possible. When our document store is large, it’s very difficult to retrieve very relevant documents in just one step. To remedy this, we will retrieve in two stages: first by searching individual sentences to find any docs relevant to our query vector, and then reranking the wider context in which the sentence was found. Luckily, the SentenceWindowParser from LlamaIndex allows us to not only separate our document into sentences, but also add some metadata — in this case three sentences on each side of our target sentence. This will come in handy in our reranking step!

Here’s a full image of our pipeline. Don’t be intimidated! We can achieve this with very little code:

Press enter or click to view image in full size

Step 1: Set Up the Environment

First, let’s install the necessary libraries:

!pip install cohere spacy llama-index kdbai_client llama-index-vector-stores-kdbai llama-index-embeddings-fastembed

Then, import the required modules:

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core import Document, VectorStoreIndex
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
import kdbai_client as kdbai
import cohere

Step 2: Data Preparation

We’ll be using the Paul Graham Essay Dataset as our knowledge corpus. Download the dataset:

!llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

Step 3: KDB.AI Setup

We’re using KDB.AI here due to its fast insertion speeds and support for metadata filtering. However, if you have only a few thousand documents, you might not need multistage retrieval or even a vector database — Cohere reranking on its own can be a perfectly reasonable solution.

You can sign-up for KDB.AI server for free here: https://trykdb.kx.com/kdbai/signup/

Create a KDB.AI session and table to store the embeddings:

# start session with KDB.AI Server
session = kdbai.Session(endpoint="http://localhost:8082")

# Schema definition
schema = [
    {"name": "document_id", "type": "bytes"},
    {"name": "text", "type": "bytes"},
    {"name": "embedding", "type": "float64s"}  # Updated to float64s
]# Index definition for the embedding
indexes = [
    {
        "name": "embedding_index",  # Name of the index
        "type": "flat",  # Index type (flat)
        "params": {"dims": 384, "metric": "L2"},  # Dimensions and metric
        "column": "embedding"  # The column the index refers to
    }
]
# Reference the 'default' database
database = session.database("default")
# ensure no table called "company_data" exists
try:
    for t in database.tables:
            if t.name == KDBAI_TABLE_NAME:
                t.drop() 
    time.sleep(5)
except kdbai.KDBAIException:
    pass
# Create the table with the specified schema and index definition
table = database.create_table(KDBAI_TABLE_NAME, schema=schema, indexes=indexes)

Step 4: Parsing Documents into Sentences

The core idea behind our multistage RAG approach is to index and retrieve at the granularity of individual sentences, while providing the language model with a broader sentence window as context for generation.

We use LlamaIndex’s SentenceWindowNodeParser to parse documents into individual sentence nodes, while preserving metadata about the surrounding sentence window.

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

nodes = node_parser.get_nodes_from_documents(docs)
parsed_nodes = [node.to_dict() for node in nodes]

Here, we use a window_size of 3, meaning for each sentence, we keep the 3 sentences before and 3 sentences after as its "window". This window is stored in the node metadata.

Here is an example pipeline for Sentence Window Retrieval without reranking:

Press enter or click to view image in full size

It’s worth noting that sentence window parsing is just one type of small-to-big retrieval. Another approach is to use smaller chunks referring to bigger parent chunks. This strategy isn’t included in this notebook, but here is a diagram of chunking based small-to-big retrieval:

Press enter or click to view image in full size

Chunking Based Small-to-Big Retrieval Pipeline

Step 5: Indexing and Storing Embeddings

Next, we generate embeddings for each sentence node using FastEmbed and store them in our KDB.AI table. FastEmbed is a fast and lightweight library for generating embeddings, and supports many popular text models. The default embeddings come from the Flag Embedding model which has 384 dimensions, but many popular embedding models are supported.

parent_ids = []
sentences = []
embeddings = []embedding_model = TextEmbedding()
for sentence, parent_id in sentence_parentId:
    parent_ids.append(parent_id)
    sentences.append(sentence)
embeddings = list(embedding_model.embed(sentences))
# Convert document_id and text to bytes
parent_ids_bytes = [str(parent_id).encode('utf-8') for parent_id in parent_ids]
sentences_bytes = [str(sentence).encode('utf-8') for sentence in sentences]
# Create a DataFrame
records_to_insert_with_embeddings = pd.DataFrame({
    "document_id": parent_ids_bytes,  # Convert to bytes
    "text": sentences_bytes,          # Convert to bytes
    "embedding": embeddings           
})
# Insert the DataFrame into the table
table = database.table(KDBAI_TABLE_NAME)
table.insert(records_to_insert_with_embeddings)

Step 6: Querying and Reranking

With our knowledge indexed, we can now query it with natural language questions. The retrieval process has two stages:

Initial sentence retrieval
Reranking based on sentence windows

For the initial retrieval, we generate an embedding for the query and use it to retrieve the 1500 most similar sentences from the vector database. 1500 is arbitrary — but it’s good to go big because you don’t want to miss any sentences which might have a relevant window.

# Embed the query and convert the embedding to a list
query = "How do you decide what to work on?"query_embedding = list(embedding_model.embed([query]))[0].tolist()  # Convert generator to list
# Perform the search
search_results = database.table(KDBAI_TABLE_NAME).search(
    vectors={"embedding_index": [query_embedding]},
    n=1500  # Retrieve 1500 preliminary results
)
# Print the search results
print(search_results)

Performing this first-pass retrieval at the sentence level ensures we don’t miss any potentially relevant windows.

The second stage is where the magic happens. We take the unique sentence windows from the initial retrieval results and rerank them using Cohere’s powerful reranking model. By considering the entire window, the reranker can better assess the relevance to the query in context.

unique_parent_ids = search_results_df['document_id'].unique()texts_to_rerank = [parentid_parentTexts[id] for id in unique_parent_ids
                   if id in parentid_parentTexts]
reranked = co.rerank(
    model='rerank-english-v3.0',
    query=query,
    documents=texts_to_rerank,
    top_n=len(texts_to_rerank)
)

After reranking, the top sentence windows provide high-quality, contextually relevant information to be used for generating the final response.

This multistage approach offers several key advantages:

Indexing and initial retrieval on the sentence-level is fast and memory efficient.
The initial sentence retrieval stage is highly scalable and can support very large knowledge bases.
Reranking based on sentence windows allows incorporating broader context without sacrificing the specificity of the initial retrieval.
Using an external reranking model allows leveraging a larger, more powerful model for assessing relevance, while keeping the main generative model lightweight.
Providing sentence windows as context to the generative model strikes a balance between specificity and sufficient context.

Multistage RAG with LlamaIndex and Cohere showcases the power of thoughtful retrieval architectures for knowledge-intensive language tasks. By indexing at a granular sentence level, performing efficient initial retrieval, and reranking with a powerful model, we can provide high-quality, contextually relevant information to generative language models — enabling them to engage in grounded, information-rich conversations without sacrificing specificity or efficiency.

To learn more about optimizing RAG for production and making the most of vector databases, check out the KDB.AI Learning Hub, chocked full of useful resources.

Connect with me on LinkedIn for more AI Engineering tips.

I also encourage you to experiment with this approach on your own datasets and knowledge domains. The full code is available in the accompanying Colab notebook below.

https://colab.research.google.com/drive/1r-4g-r9JphE6qEKX4Vap-DupZ3AWH2nW?usp=sharing