Late Chunking vs Contextual Retrieval: The Math Behind RAG's Context Problem
The Unseen Loss of Context When Chunking and Embedding Documents
In retrieval-augmented generation (RAG), the quality of document embeddings plays a crucial role in retrieval accuracy. While much attention is paid to document segmentation strategies, a critical challenge often goes overlooked: how to create embeddings that preserve the broader context of the document, even when working with individual segments.
To the human eye, a document is rich with interwoven ideas, definitions, and references. When we chunk for AI processing, we're essentially breaking these connections, reducing a holistic narrative to isolated fragments that often miss the big picture. This can lead to retrieval failures, decreased relevance, and lower performance in RAG systems.
We can see this clearly in the following image: when we split this Wikipedia article section into chunks, phrases like 'its' and 'the city' refer to Berlin, but an embedding model would have no idea of this, and thus the vector representations of the chunks would be all out of whack!
[Image Source https://jina.ai/news/late-chunking-in-long-context-embedding-models/]
The solution to this problem lies in two advanced embedding strategies: Late Chunking and Contextual Retrieval. Both offer unique ways to preserve context, each approaching the problem from different angles. In this article, we'll mathematically define the problem of chunking in RAG, explore how each of these methods works, and compare their strengths and trade-offs.
The Mathematical Challenge of Adding Context: What We're Losing in Splits
To understand the limitations of naive chunking, let's break down what happens when we embed a document in fragments rather than as a whole.
Imagine a document 𝐷 composed of tokens (𝑑₁, 𝑑₂, ..., 𝑑ₙ), where the full meaning is spread across the entire sequence. We often represent documents as embeddings 𝐸(𝐷) in a high-dimensional vector space to capture their semantic content. When chunked, however, 𝐷 is split into sub-documents {𝐷₁, 𝐷₂, …, 𝐷ₖ}, each with its own embedding 𝐸(𝐷ᵢ).
The Fragmentation Problem
When we split the document into chunks, each chunk's embedding 𝐸(𝐷ᵢ) represents only a fraction of 𝐷's total meaning. As a result, important semantic connections—such as pronoun references, acronyms, or ideas spanning multiple chunks—may be lost. Imagine you are reading a book and see a chapter. You would understand it very differently than seeing the same chapter, having never read the book!
Mathematically, if we define the document embedding 𝐸(𝐷) as the aggregation of chunk embeddings:
the approximation becomes less accurate as chunking reduces coherence. Moreover,
Where Eₜᵣᵤₑ(Dᵢ | D) represents how chunk i should be embedded given the full context of document D. The embedded vectors E(Dᵢ) may therefore not only lose or distort the information carried by D as a whole but also fail to capture the true semantic meaning of the chunks themselves, especially if split mid-sentence or across conceptually linked paragraphs.
This loss of semantic integrity—both at the document and chunk level—is the main challenge in naive chunking strategies, which both Late Chunking and Contextual Retrieval aim to solve.
Strategy 1: Late Chunking—Embedding the Whole Before Splitting
Concept and Mechanism
Late Chunking, introduced by Jina AI, approaches the document coherence problem differently than traditional embedding methods. Rather than splitting the document and processing chunks in isolation, late chunking maintains document-wide context while still producing distinct chunk embeddings.
The process works as follows:
- Initial Chunking: Given a document D, we first analyze the text to determine the desired chunk boundaries (c₁, ..., cₙ) ← Chunker(D, S) using any chosen chunking strategy S (e.g., fixed token length, sentence boundaries). Some common strategies include Lanchain's RecursiveCharacterTextSplitter or a text segmentation library like NLTK or Spacy. Jina's segmentation API uses a very complex Regex, which is unique but seems to work well.
- Full Document Processing: Instead of embedding these chunks separately, the entire document is tokenized into (τ₁, ..., τₘ) with corresponding character lengths (o₁, ..., oₘ). The transformer model then processes all tokens together as one sequence to produce token embeddings (ϑ₁, ..., ϑₘ). Each token embedding ϑᵢ is a high-dimensional vector that represents that token's meaning in the context of the full document.
More generally, for any chunk with start position cueₛₜₐᵣₜ and end position cueₗₐₛₜ, its embedding is:
More generally, for any chunk with start position
[Image Source https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii/]
To understand why this matters, consider a Wikipedia article about Berlin. When processing a chunk containing the phrase "the city", traditional chunking would embed this phrase in isolation. With late chunking, however, those tokens are embedded with awareness of the earlier mention of "Berlin" since all tokens were processed together. The final pooling step then preserves these contextual relationships while maintaining the desired chunk structure. This results in chunk embeddings that capture both local meaning and broader document context, leading to more semantically accurate representations.
[Image Source https://jina.ai/news/late-chunking-in-long-context-embedding-models/]
Implementation and Advantages
Late Chunking's advantage lies in its simplicity and speed. Embedding the document as a whole requires only one pass through a long-context model, followed by straightforward chunking. By ensuring that chunks are created with full-document knowledge, Late Chunking mitigates semantic loss and supports more meaningful retrieval without excessive computational overhead.
Here's the full algorithm taken straight from the paper, but all that's important to know is that this is a mechanism for inserting the context of an entire document into each chunk:

Late chunking can be applied to any embedding model that supports mean pooling, but the easiest way to get started is to use the Jina Embedding API, as their newest model supports late chunking. To try late chunking, just set late_chunking to True! We pass a bunch of chunks, and all the chunks concatenated together is considered the full document.
Here's what an embedding function might look like:
# This function can also be used for contextual chunking--see below:
JINA_API_KEY = "YOUR_JINA_API_KEY" def get_embeddings(chunks, late_chunking=False, contexts=None): url = 'https://api.jina.ai/v1/embeddings' headers = { 'Content-Type': 'application/json', 'Authorization': f'Bearer {JINA_API_KEY}' } # If using contextual chunking, combine contexts with chunks if contexts: input_texts = [f"{ctx} {chunk}" for ctx, chunk in zip(contexts, chunks)] else: input_texts = chunks data = { "model": "jina-embeddings-v3", "task": "text-matching", "dimensions": 1024, "late_chunking": late_chunking, "embedding_type": "float", "input": input_texts } response = requests.post(url, headers=headers, json=data) return [item["embedding"] for item in response.json()["data"]]
This method works especially well for relatively long documents or narratives that fit within a large model's context window, with Jina's limit being 8192 tokens.
It should also be noted that we can make our document whatever we want. For example, let's say we wanted to add the abstract of the document as context to an entire paper's chunks. The paper might be longer than 8k tokens, but we can just pass the abstract as the first chunk, and then ignore the first embedding returned!
data = { "model": "jina-embeddings-v3", "task": "text-matching", "dimensions": 1024, "late_chunking": True, # LATE CHUNKING IS SET TO TRUE "embedding_type": "float", "input": [ "PAPER ABSTRACT", "PAPER PAGE 27 CHUNK 1", "PAPER PAGE 27 CHUNK 2", "PAPER PAGE 27 CHUNK 3", ... ] }
Now, we just use response[1:] as our vectors (ignoring the one for the abstract), and now all of our chunk embeddings will have the added context of the abstract. We can get as creative as we want here, passing any context we would like to each chunk embedding. If we implement late chunking directly instead of by using the Jina Embedding API, we can simply get the embeddings for the chunks while ignoring the context.
Here's a link to that implementation:
According to a Jina blog on the topic, it looks like late chunking seems to work better as the size of the document approaches 8k tokens:
Not bad for what is in many cases zero change to our RAG architecture!
Strategy 2: Contextual Retrieval—Embedding Chunks with Added Context
Concept and Mechanism
Contextual Retrieval is a strategy introduced by Anthropic, where an LLM adds additional context to each chunk.
In Contextual Retrieval, each chunk Dᵢ is augmented by concatenating a context summary generated by an LLM: S(D, Dᵢ), which captures information from both the document D and the specific chunk Dᵢ. Once we pass this to an embedding function, we get an enriched chunk embedding E(Dᵢ + S(D, Dᵢ)) that reflects the chunk's local content alongside relevant contextual information from the entire document:
Sorry, enough with the math! Here's a diagram from Anthropic that explains it better than a formula:
While Anthropic separates the vector database from a TF-IDF index in this image, many vector databases support BM25, an improved variant of TF-IDF. So in practice, you would have a single index for your contextual vectors, as well as a BM25 (keyword) index.
Implementation and Benefits
The Anthropic implementation of Contextual Retrieval is much more straightforward than late chunking: For every chunk in a document, pass the document and a chunk to an LLM to generate context and add it to the chunk. However, this method repeats a lot of work, as each chunk is processed independently, even if they overlap in content.
One way to improve efficiency here is by using context caching. With large language models, KV caching stores the attention keys and values, which makes generating new tokens cheaper if the same prompt is reused. This caching can make the process up to 90% cheaper in the case of Anthropic's models, bit it's perfectly possible to generate all the chunk contexts in one prompt! If we don't use a single prompt to generate context for all chunks, there's a lot of repeated work, but this can be optimized by parallelizing the context generation process.
For practical implementation, we could either:
- Pass each chunk individually to generate context, achieving high-quality context-specific embeddings but at a cost of efficiency.
- Pass multiple chunks with identifiers to a large language model and generate context for each ID simultaneously, which might be cheaper and easier from an engineering perspective but could be less precise, as this is a much less straightforward task for an LLM.
Theres's a reason Anthropic went with option 1: option 2 is harder to implement and can have all kinds of bugs (like timeouts on long reponses, or LLMs misunderstanding the task, especially if a small LLM is used). I'd recommend this option only if you are using a larger LLM.
Here's a basic implementation:
chunks = [ "Germany is has a lot of great dishes.", "It isn't the largest country in Europe, but it's up there.", "I've always wanted to go visit its many museums and try its local cuisine.", "When I was a child, I visited once.", "It was a fantastic time, I tried schnitzel, which was delicious.", "Now as an adult, I would visit again.", "I've also visited many other countries in Europe, but Germany is my favorite." ] document = " ".join(chunks)
First, we creae a bunch of chunks. All of these are about Germany, which is clear based on the document, but individual chunks may not mention Germany at all.
We can then define a generate_contexts function that uses an LLM to generate context for each chunk, and then use our embedding function from above to get naive, late, and contextual embeddings for each chunk.
async def generate_contexts(document, chunks): async def process_chunk(chunk): response = await client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."}, {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."} ], temperature=0.3, max_tokens=100 ) context = response.choices[0].message.content return f"{context} {chunk}" # Process all chunks concurrently contextual_chunks = await asyncio.gather( *[process_chunk(chunk) for chunk in chunks] ) return contextual_chunks df = pd.DataFrame({'text': chunks}) contexts = await generate_contexts(document, chunks) df['naive_embedding'] = get_embeddings(chunks, late_chunking=False) df['late_embedding'] = get_embeddings(chunks, late_chunking=True) df['contextual_embedding'] = get_embeddings(chunks, late_chunking=False, contexts=contexts) df['context'] = contexts
Now we can insert into our vector database. I'm a Developer Advocate for KDB.AI, so I'll use it here. One bonus of KDB.AI is that we can add multiple indexes, so we don't need to duplicate our data and can have all our embeddings together, which means that we can search using a single index, or a combination of indexes using hybrid search.
Of course a vector db is overkill for this specific example, but if you would like to scale this demo up, it would be extremely easy to run experiments with this setup.
If you would like to follow along, you can get your free instance cloud instance at KDB.AI.
Let's first instantiate our db:
database_name = "contextual_rag" table_name = "contextual_chunks" KDBAI_ENDPOINT="KDBAI_ENDPOINT" KDBAI_API_KEY="KDBAI_API_KEY" session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY) try: session.database(database_name).drop() except kdbai.KDBAIException: pass database = session.create_database(database_name) schema = [ {"name": "text", "type": "str"}, {"name": "context", "type": "str"}, {"name": "naive_embedding", "type": "float32s"}, {"name": "late_embedding", "type": "float32s"}, {"name": "contextual_embedding", "type": "float32s"} ] indexes = [{ 'type': 'qFlat', 'name': 'embedding_index_naive', 'column': 'naive_embedding', 'params': {'dims': 1024, 'metric': 'L2'}, }, { 'type': 'qFlat', 'name': 'embedding_index_late', 'column': 'late_embedding', 'params': {'dims': 1024, 'metric': 'L2'}, }, { 'type': 'qFlat', 'name': 'embedding_index_contextual', 'column': 'contextual_embedding', 'params': {'dims': 1024, 'metric': 'L2'}, } ] table = database.create_table(table_name, schema=schema, indexes=indexes)
Here, we use a qFlat index so that our data is stored on disc, and we do an exact search. For better performance, an approximate index could be used. To learn more about indexes, see our in-depth article on the topic: https://kdb.ai/learning-hub/articles/indexing-basics/
Let's insert our data from earlier:
And finally, we can search using any index we have defined. Let's search using our contextual embeddings:
def search_chunks(table, query, chunking_type="naive", query_embedding=None, top_k=3): results = table.search( vectors={f"embedding_index_{chunking_type}": query_embedding}, n=top_k ) return results[0] query = "Germany" query_embedding = get_embeddings([query], late_chunking=False) contextual_results = search_chunks(table, query, "contextual", query_embedding)
Notice how the every chunk in the context mentions Germany, which was our query? This is a good sign that contextual chunking will improve our retrieval quality. The context isn't as good as what a human would write, indicating that the Anthropic prompt isn't perfect, and can be tuned to improve the quality of the context.
Contextual Retrieval is especially useful in scenario where models lack domain-specific fine-tuning or context-specific knowledge. Each chunk's embedding carries a part of the full-document context, making retrieval more accurate for complex queries.
For Retrieval-Augmented Generation (RAG), hybrid search is essential for boosting both retrieval and generation quality. By combining dense vector similarity and sparse keyword matching, hybrid search ensures highly relevant document retrieval, forming a solid foundation for coherent AI-generated responses. This approach is so effective that it's one of the few enhancements you can confidently implement without extensive benchmarking.
Moreover, hybrid search isn't the end of the line—it creates a launchpad for advanced techniques like Contextual Retrieval or Late Chunking, which further improve retrieval precision. Adding reranking on top of these can refine your pipeline even more, delivering superior results. While hybrid search is a must-have, layering these advanced strategies on top can amplify your RAG system's performance, achieving unparalleled accuracy and coherence in generatio
Contextual Retrieval has the added advantage of allowing this context to be passed to a large language model. When chunks are presented without context, the language model loses the connections that exist within the full document. With Contextual Retrieval, we can reintroduce this context, improving not only retrieval quality but also the quality of subsequent generation tasks, leading to more coherent responses. But for this to be true, we need to make sure that no hallucinations occur when generating the context.
Contextual Retrieval combines the advantages of dense embeddings with sparse retrieval's keyword specificity, reducing retrieval errors by up to 49%, or 67% with a reranking model.
The metric of failed retrievals is not the most popular for benchmarking retrieval methods, but makes a lot of sense--in practice we are passing results to an LLM, so we just need to make sure that the correct document is one of the ones returned at the end of our search process.
It also shows a significant advantage of Contextual Retrieval--the LLM knows which information is missing for a chunk that is necessary for it to be retrieved properly, and is adding it to the chunk, much like a human would in real-world retrieval scenarios.
But this might not be the best strategy. For instance, does every single chunk need context to be understood? Probably not! There's also a lot of repeated work with lots of chunks needing the same context in many cases.
In practice, this strategy feels fragile, is hard to iterate on, and should only be implemented if you know what you are doing and have extremely robust evals. Don't take for granted that Contextual Retrieval will improve performance, as that's not a given!
Late Chunking vs. Contextual Retrieval: An Analytical Comparison
Context Preservation
- Late Chunking: Provides a high level of context preservation by embedding the entire document first, creating enriched chunk embeddings. However, it's bounded by the context window of the embedding model. LLMs typically have a larger context than embedding models.
- Contextual Retrieval: Each chunk of a document is embedded along with additional context from surrounding chunks or the entire document. This approach is particularly effective for documents exceeding the context window of a single model. By summarizing or including relevant context, the resulting embeddings are more semantically accurate, especially for queries requiring a document-wide understanding. When we use failed retrievals as a metric, we can get a significant boost--we can think of this like a human annotator adding information to chunks to make them more contextually relevant.

The additional context can be passed to the language model (or directly to the user), resulting in richer responses and improving downstream tasks such as generation or answering complex queries. For RAG, this is the major advantage of Contextual Retrieval over Late Chunking, as both improve retrieval performance.
What's Happening in Practice. Why is Performance Improving?
Why does adding context or information about the document improve retrieval?
To find out, I (or rather my AI collaborator ChatGPT) did PCA analysis on the chunks above:
import numpy as np from sklearn.decomposition import PCA import plotly.graph_objects as go # Get all embeddings from DataFrame naive_embeds = np.array(df['naive_embedding'].tolist()) late_embeds = np.array(df['late_embedding'].tolist()) contextual_embeds = np.array(df['contextual_embedding'].tolist()) # Get the document embedding document_embedding = get_embeddings([document], late_chunking=False)[0] # Stack embeddings and add query and document all_embeds = np.vstack([ naive_embeds, late_embeds, contextual_embeds, query_embedding, # This is your query embedding document_embedding # Add the document embedding ]) # Create labels labels = ( ['Naive'] * len(naive_embeds) + ['Late'] * len(late_embeds) + ['Contextual'] * len(contextual_embeds) + ['Query'] + ['Document'] ) # Perform PCA pca = PCA(n_components=2) embeddings_2d = pca.fit_transform(all_embeds) # Create scatter plot fig = go.Figure() # Split the 2D embeddings naive_2d = embeddings_2d[:len(naive_embeds)] late_2d = embeddings_2d[len(naive_embeds):len(naive_embeds) + len(late_embeds)] contextual_2d = embeddings_2d[len(naive_embeds) + len(late_embeds):-2] query_2d = embeddings_2d[-2:-1] document_2d = embeddings_2d[-1:] # Helper function to adjust text position def adjust_text_positions(positions, offset): return [pos + offset for pos in positions] # Add each type of embedding separately fig.add_trace(go.Scatter( x=adjust_text_positions(naive_2d[:, 0], 0.1), y=adjust_text_positions(naive_2d[:, 1], 0.1), mode='markers+text', name='Naive', text=df['text'], textposition="top center", marker=dict(size=10, opacity=0.8) )) fig.add_trace(go.Scatter( x=adjust_text_positions(late_2d[:, 0], -0.1), y=adjust_text_positions(late_2d[:, 1], -0.1), mode='markers+text', name='Late', text=df['text'], textposition="bottom center", marker=dict(size=10, opacity=0.8) )) fig.add_trace(go.Scatter( x=adjust_text_positions(contextual_2d[:, 0], 0.15), y=adjust_text_positions(contextual_2d[:, 1], -0.15), mode='markers+text', name='Contextual', text=df['text'], textposition="top right", marker=dict(size=10, opacity=0.8) )) fig.add_trace(go.Scatter( x=query_2d[:, 0], y=query_2d[:, 1], mode='markers+text', name='Query', text=['Query'], textposition="top center", marker=dict(size=15, symbol='star', opacity=1.0) )) fig.add_trace(go.Scatter( x=document_2d[:, 0], y=document_2d[:, 1], mode='markers+text', name='Document', text=['Document'], textposition="top center", marker=dict(size=15, symbol='diamond', opacity=1.0) )) # Update layout for clarity fig.update_layout( title="PCA Visualization of Different Embedding Strategies", xaxis_title="First Principal Component", yaxis_title="Second Principal Component", width=1000, height=800, legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5), margin=dict(l=50, r=50, t=100, b=50), ) fig.show()
Whoa, this isn't quite what I expected! Let's unpack:
First, the naive embeddings are all over the place. They’re widely scattered, which makes sense—they don’t consider any context beyond the chunk itself. While this independence captures unique chunk features, it often leads to inconsistencies and misses broader relationships.
Now, look at Late Chunking. The embeddings form a tight cluster near the document. This clustering suggests that Late Chunking effectively applies a transformation that pulls each chunk closer to the document’s overall representation. It’s as if the embeddings are magnetized toward the document, which helps preserve coherence at the document level but might blur the nuances that are crucial for query-specific retrieval.
Then there’s Contextual Retrieval. Its embeddings sit clustered around what looks like an arbitrary point to the right of the document and the query. What’s likely happening here is that the added context reshapes the embeddings, creating a blend of repeated patterns and structure influenced by the prompt. The result is what feels like an arbitrary cluster, which is technically slightly closer to the query, but not in an interpretable way.
But clustering isn’t inherently good. While Late Chunking’s tight clustering aligns with the document, it could overgeneralize, making it less precise for specific queries. On the other hand, Contextual Retrieval’s cluster likely aligns with some semantic habits of the LLM when prompted in this way, which indicates that the prompt is extremely important and not very well-tuned towards transforming in the direction of the document.
This visual makes one thing clear: Contextual Retrieval isn’t just about embedding the full document context into every chunk as the previous section suggested. Instead, it may shine in patching weak chunks—those that lose important meaning due to segmentation. Late Chunking excels at preserving document-level coherence, but my best guess is that Contextual Retrieval fills in the gaps where segmentation fails.. Two very different approaches, each with unique strengths.
Retrieval Speed and Efficiency
- Late Chunking: Embeds once, and can be done in a similar time frame to naive embedding. This makes it efficient and fast for large document retrieval tasks where model context windows are sufficient.
- Contextual Retrieval: Embedding each chunk with context incurs higher computational costs due to repeated summary/context generation. However, once the context is generated, we can then use naive embedding to generate embeddings for each chunk, or even Late Chunking if we would like! There is no reason why these strategies can't be combined. However, in production, it might be better to add context only to chunks that are in danger of being misinterpreted out of context by an LLM.
Use Cases and Practicality
- Late Chunking is ideal for cases where we want a performance boost without changing our architecture. It's comparable to simply using a smarter embedding model while keeping the number of dimensions the same.
- Contextual Retrieval works best when we need to squeeze out every theoretical smidge of performance, and when we want to do something with the context generated by an LLM.
Conclusion: Integrating Context into Chunking for Enhanced Retrieval
Late Chunking and Contextual Retrieval each provide solutions to the context loss inherent in naive chunking methods. Late Chunking uses an embedding-first approach, preserving semantic content across chunks, while Contextual Retrieval enriches chunks with additional document-wide context, enabling a more nuanced RAG process.
In deciding which method to adopt, consider the needs of your RAG system: if embedding speed is paramount, Late Chunking is often the optimal choice. If retrieval accuracy is absolutely critical, Contextual Retrieval's layered approach to chunk context can yield great results if properly implemented. Both methods highlight the importance of context in retrieval, underscoring that effective chunking is about more than splitting text—it's about preserving meaning.
Bonus: Other Ways to Add Context to Your RAG Pipeline
Adding context to your RAG pipeline doesn’t stop at Late Chunking or Contextual Retrieval. Here are some additional strategies you can use to enhance your retrieval quality:
- Manually Review and Add Context
If your dataset is small (e.g., 1,000 chunks), consider manually adding context to the chunks. Human intuition can significantly outperform an LLM, especially when you deeply understand your product or goals. Reviewing chunks yourself ensures higher-quality context tailored to your specific use case. - Optimize Poorly Performing Queries
Identify queries where retrieval results are weak and augment the corresponding chunks to improve their ranking. One effective approach is to use a reranking model, such as Cohere Rerank, to analyze queries with low-quality results. You can also use a LLM to rank the search results quality and detect hallucinations. Adjust the chunk context until the results meet your standards. This iterative process fine-tunes your chunks to align better with user queries. - Use Contextual Retrieval Only When Needed
Not every chunk requires additional context. Classify chunks as either "good" (understandable on their own) or "bad" (requiring additional context). Apply Contextual Retrieval selectively to "bad" chunks, saving computational resources while maintaining retrieval quality. - Introduce Context Across Multiple Documents
When chunks need context from multiple documents, explore embedding strategies beyond Late Chunking. For example, small models likecde-small-v1achieve state-of-the-art performance by leveraging methods described in the Contextual Document Embeddings paper. These strategies enable you to incorporate multi-document context efficiently and effectively.
By combining these approaches, you can tailor your pipeline to achieve better performance, making it robust, scalable, and aligned with your specific needs.










