voyage-multimodal-3.5: a new multimodal retrieval frontier with video support

TL;DR – We’re excited to introduce voyage-multimodal-3.5, our next-generation multimodal embedding model built for retrieval over text, images, and videos. Like voyage-multimodal-3, it embeds interleaved text and images (screenshots, PDFs, tables, figures, slides), but now adds explicit support for video frames. It’s also the first production-grade video embedding model to support Matryoshka embeddings for flexible dimensionality. voyage-multimodal-3.5 attains 4.56% higher retrieval accuracy than Cohere Embed v4 across 15 visual document retrieval datasets and 4.65% higher than Google Multimodal Embedding 001 across 3 video retrieval datasets, while matching state-of-the-art text models on pure-text search.

We released voyage-multimodal-3, the industry’s first production-grade multimodal model capable of embedding interleaved texts and images, over a year ago. Since then, voyage-multimodal-3 has enabled numerous customers to build search and retrieval pipelines over text, PDFs, figures, tables, and other documents rich with visuals.

Today, we’re excited to announce voyage-multimodal-3.5, which introduces support for embedding videos while further improving upon voyage-multimodal-3 in terms of retrieval quality.

Model architecture. Similar to voyage-multimodal-3, voyage-multimodal-3.5 adopts a model architecture where both visual and text modalities are passed through a single transformer encoder. This unified architecture preserves contextual relationships between visual and textual information, enabling effective vectorization of interleaved content such as document screenshots, complex PDFs, and annotated images. This stands in contrast to CLIP-based models (such as earlier Cohere multimodal models), which route images and text through separate, independent model towers.

CLIP-like models generate embeddings with a well-documented problem known as the modality gap, which we discussed in our voyage-multimodal-3 blog post. In practice, this means a text query will often retrieve irrelevant text over a highly relevant image simply because text vectors sit closer to other text vectors in the embedding space. By processing all inputs through the same backbone, voyage-multimodal-3.5 embeds text, screenshots, PDFs, figures, and now videos into a shared vector space where similarity reflects semantic meaning rather than modality.

Video embeddings. voyage-multimodal-3.5 also explicitly supports embedding videos, enabling accurate text-to-video retrieval over a corpus of scenes. Architecturally, videos are represented as an ordered sequence of frames and input to the model as images. Specifically, every 1120 pixels of a video counts as a token, for a maximum of 32k tokens.

We provide a series of best practices for embedding videos below:

Split long videos into scenes: Consider splitting videos that exceed 32k tokens of context length into segments. Each segment should contain related frames and corresponding transcript text, creating more focused embeddings.
Align splits with transcript timestamps: If transcripts are available (e.g., from a speech-to-text transcription model), align the beginning and end of your segments with natural breaks in the transcripts. This keeps your scene boundaries in sync with the spoken content, so each segment captures a complete thought or topic.
Reduce resolution when necessary: If certain semantically continuous scenes still exceed 32k tokens of context, consider reducing the number of tokens by either reducing the resolution of the image or reducing the FPS. We also provide a short code snippet on how to effectively do this using the voyageai package in Appendix A.

For more sophisticated video embedding examples, check out our sample notebook.

Matryoshka embeddings and quantization. voyage-multimodal-3.5 supports 2048, 1024, 512, and 256 dimensional embeddings enabled by Matryoshka learning and multiple embedding quantization options – including 32-bit floating point, signed and unsigned 8-bit integer, and binary precision – while minimizing quality loss.

Evaluation Details

Datasets. We evaluate voyage-multimodal-3.5 across 18 multimodal datasets spanning two different tasks: visual document retrieval (ViDoRe, ViDoRe v2, and MIRACL-VISION benchmarks), and video retrieval (MSR-VTT, YouCook2, and DiDeMo datasets). We also evaluate on a standard text retrieval task spanning 38 datasets in 6 domains (law, finance, conversation, code, web, and tech).

For all datasets, the query is text, while the document could be a figure, photo, text, document screenshot, or a combination of these. For each task, we use prior top-performing models as the baseline.

Baselines. For the visual document retrieval and standard text retrieval tasks, we evaluate voyage-multimodal-3.5 alongside Cohere Embed v4 (embed-v4.0), Amazon Nova 2 Multimodal Embeddings (amazon.nova-2-multimodal-embeddings-v1:0), and voyage-multimodal-3. For the video retrieval task, we evaluate voyage-multimodal-3.5 alongside Google Multimodal Embeddings 001 (multimodalembedding@001).

Metrics. Given a query, we retrieve the top 10 results by cosine similarity and report the NDCG@10.

Results

Visual document retrieval. As shown in the figure below, voyage-multimodal-3.5 outperforms Google Multimodal Embedding 001, Cohere Embed v4, Amazon Nova 2 Multimodal, and voyage-multimodal-3 by 30.57%, 2.26%, 8.38%, and 3.03%, on visual document retrieval, respectively. Note that Google Multimodal Embedding 001 averages 51.78% NDCG@10 on visual document retrieval; we rounded this up to 65% for the visualization purposes.

Video retrieval. As shown in the figure below, voyage-multimodal-3.5 outperforms Google Multimodal Embedding 001 on video retrieval by 4.65% while being ~6x cheaper per video (512×512 resolution).

Standard text retrieval. As shown in the figure below, voyage-multimodal-3.5 outperforms Cohere v4, Amazon Nova 2 Multimodal, and voyage-multimodal-3 by 3.49%, 8.48%, and 5.22%, respectively. The performance of voyage-multimodal-3.5 is within 0.29% of that of voyage-3-large, making voyage-multimodal-3.5 very close to the current SoTA text embedding model while being $0.06 cheaper per million tokens.

Try voyage-multimodal-3.5 today!

voyage-multimodal-3.5 is available today with flexible, token-based pricing. Head over to our docs to get started and learn more; the first 200M tokens and 150B pixels are free.

Appendix A – Getting started with video embeddings

import voyageai
from voyageai.video_utils import Video
import PIL 

vo = voyageai.Client()
# This will automatically use the environment variable VOYAGE_API_KEY.
# Alternatively, you can use vo = voyageai.Client(api_key="<your secret key>")

# Example input containing a text string, Image object, and Video object
inputs = [
    ["This is a banana.", PIL.Image.open("banana.jpg"), Video.from_path("banana.mp4", model="voyage-multimodal-3.5")]
]

# Vectorize inputs
result = vo.multimodal_embed(inputs, model="voyage-multimodal-3.5")
print(result.embeddings)