Leading performance on global benchmarks; best-in-class accuracy for Indian languages.
·6 min read

Introduction
Today, we are introducing Sarvam Vision. We have released models and applications across voice and text. With this release, we extend that work to vision. We live in a multimodal world, and vision is a crucial modality to ensure all perception problems can be solved for users and enterprises. Some of these problems surround document intelligence, general vision ("What am I seeing?") capabilities, among many others.
As part of the sovereign model series, we introduce a 3B-parameter state-space vision-language model. The model is capable of a range of visual understanding tasks, including image captioning, scene text recognition, chart interpretation, and complex table parsing.
A central challenge in vision today is high-accuracy document intelligence, particularly for Indian languages. Much of India's knowledge remains embedded in physical documents, scanned archives, and historical collections. This is knowledge locked in plain sight. Unlocking this material is essential for long-term preservation, access, and reuse across research, governance, and enterprise workflows.
Frontier Vision Language Models have established a high bar for processing modern English documents. However, a significant gap remains in the industry: most global models treat Indian languages as secondary, often resulting in lower accuracy for regional scripts. Along with pushing the frontiers of accuracy, our VLM is an inference-efficient 3B state-space model.
Model Training, Performance, and Benchmarks
At a high-level, our document intelligence architecture comprises the sovereign VLM and two harness modules - (a) semantic layout parser and (b) reading order network. The primary advances we made were for data curation and training algorithms.
The data curation effort underwent a rigorous process of creating high-quality synthetic and real-world document image-text samples for all Indian languages, alongside English. The data consisted of various domains like scientific literature, financial documents, government bulletins, historical manuscripts, textbooks, magazines, newspapers, among others. Each domain underwent data generation tailored to the specific use case. For example, in the case of chart understanding, our data consisted of chart-text pairs for a variety of tasks - like structured extraction, description, analysis. In the case table parsing, we built datasets that focus on structure and relationship recognition of the table cells.
On the algorithmic side, we performed a round of continual pretraining on the base Sarvam sovereign 3B model; followed by supervised fine-tuning and reinforcement learning using verifiable rewards.
Global Benchmarks
olmOCR-Bench
A benchmark for evaluating document-level OCR that performs pass-fail unit tests which are simple, unambiguous, and deterministically machine-verifiable. For the evaluation, we filtered out 1,258 samples out of 1,403 total samples in order to ensure the benchmarking is performed only on English documents.olmOCR-Bench-English. The implementation details can be found in this
github repository.
| Category | Sarvam Vision | Mistral OCR 3 | Chandra | Gemini 3 Pro | PaddleOCR VL 1.5 | PaddleOCR VL | DeepSeek OCR v2 | Gemini 3 Flash | GPT 5.2 |
|---|---|---|---|---|---|---|---|---|---|
| ArXiv Math | 86.5 | 85.4 | 81.4 | 70.6 | 85.4 | 85.4 | 81.9 | 66.5 | 61 |
| Base | 99.6 | 99.9 | 99.8 | 99.8 | 98.8 | 98.6 | 99.8 | 99.8 | 99.8 |
| Hdr/Ftr | 96.3 | 93.8 | 88.8 | 84 | 96.9 | 96.9 | 95.6 | 83.8 | 75.6 |
| TinyTxt | 91 | 88.9 | 91.9 | 90.3 | 80.8 | 80.8 | 88.7 | 88.2 | 62.2 |
| MultCol | 82.2 | 82.1 | 82.9 | 79.2 | 82.6 | 82.5 | 83.6 | 73.7 | 70.2 |
| OldScan | 49.8 | 48.8 | 49.2 | 47.5 | 39.2 | 38.8 | 33.7 | 46 | 34.6 |
| OldMath | 81 | 68.3 | 73.6 | 84.9 | 66.4 | 66.4 | 68.8 | 85.8 | 75.8 |
| Tables | 88.3 | 86.1 | 88.2 | 84.9 | 84.1 | 83.9 | 78.1 | 75.9 | 79 |
Scroll
Scroll
olmOCR (Category-wise Performance Comparison)
OmniDocBench V1.5
A comprehensive benchmark for evaluating document parsing, featuring various document and layout types (academic papers, financial reports, and handwritten notes). We report the performance on the official English-only split from the evaluation set which contains 628 samples.
OmniDocBench V1.5 (Category-wise Performance Comparison)
Sarvam Indic OCR Bench
Global benchmarks focus heavily on English document parsing, and at present there is no Indic benchmark of similar standard to the best of our knowledge. We bridge this gap by creating Sarvam Indic OCR Bench which contains 20,267 samples from various document pages. The sample set is distributed across 22 official Indian languages - ranging from 1800-present and with varying quality of scans and content. Furthermore, they are curated at a semantic block-level to robustly evaluate character and word accuracy. We report word accuracy in this section which is computed as 100 x (1 - WER).
Language-wise accuracy on Sarvam Indic OCR Bench across all 22 scheduled Indian languages
| Language | Sarvam Vision | Gemini 3 Pro | GCV | Opus 4.5 | Surya | Gemma3-27B | GPT 5.2 |
|---|---|---|---|---|---|---|---|
| Hindi | 95.91 | 95.12 | 90.94 | 93.08 | 81.85 | 85.57 | 84.86 |
| Bengali | 92.61 | 90.79 | 88.23 | 83.76 | 70.82 | 65.07 | 70.52 |
| Tamil | 93.42 | 92.73 | 89.69 | 89.62 | 75.92 | 77.14 | 61.87 |
| Telugu | 87.70 | 85.32 | 82.58 | 71.28 | 58.77 | 53.88 | 35.70 |
| Marathi | 93.13 | 90.39 | 87.86 | 81.66 | 72.29 | 70.61 | 63.81 |
| Malayalam | 91.60 | 87.10 | 88.30 | 82.88 | 83.80 | 20.03 | 56.66 |
| Kannada | 89.89 | 87.36 | 85.54 | 77.41 | 68.05 | 45.99 | 26.49 |
| Odia | 81.95 | 75.39 | 82.20 | 57.22 | 61.16 | -9.54 | 10.53 |
| Punjabi | 92.28 | 89.29 | 88.10 | 85.91 | 71.75 | 40.83 | 59.98 |
| Gujarati | 90.74 | 88.40 | 81.63 | 77.53 | 68.02 | 62.62 | 53.45 |
| Urdu | 87.01 | 85.76 | 81.17 | 77.89 | 55.17 | 64.97 | 57.49 |
| Sindhi | 90.24 | 86.31 | 86.71 | 71.89 | 61.31 | 56.69 | 49.00 |
| Santhali | 80.32 | 64.02 | 54.79 | 36.62 | 31.24 | 36.37 | 27.44 |
| Sanskrit | 81.65 | 76.62 | 64.90 | 4.25 | 44.77 | 34.85 | -21.22 |
| Nepali | 93.90 | 93.61 | 91.43 | 84.73 | 80.94 | 79.91 | 67.63 |
| Manipuri | 90.11 | 89.33 | 82.50 | 59.03 | 67.09 | 65.68 | 3.26 |
| Maithili | 81.95 | 50.96 | 49.04 | 26.07 | 1.94 | 3.16 | 13.68 |
| Konkani | 91.10 | 89.96 | 83.02 | 78.26 | 71.96 | 53.13 | 35.73 |
| Kashmiri | 55.93 | 44.46 | 33.41 | 29.89 | 9.76 | -18.03 | -0.60 |
| Dogri | 82.61 | 79.73 | 72.46 | 48.92 | 59.41 | 47.38 | 6.08 |
| Bodo | 89.19 | 87.21 | 78.64 | 62.60 | 68.04 | 55.76 | 34.19 |
| Assamese | 88.74 | 85.36 | 84.50 | 77.58 | 75.76 | 39.90 | 52.71 |
Scroll
Scroll
Core Document Intelligence Capabilities
Text Extraction ≠ Knowledge Extraction
Sarvam Vision fundamentally rethinks document intelligence as a knowledge extraction problem, while most alternatives stop at text extraction. Documents are more than words - they contain tables and visual elements like complex scientific charts, illustrations, and infographics. To truly extract all knowledge, any document intelligence model must attend to each and every pixel going beyond text. Sarvam Vision interprets visual logic that holds all information together. Whether it is extracting data points from a trend line or preserving a nested table, the model performs high-fidelity knowledge extraction end-to-end.
Illustrations of Various Domains
1. OCR on English + all 22 scheduled Indian languages
2. Complex table parsing
3. Multilingual visual reasoning
Visual components in a document play an important role. Oftentimes charts and illustrations communicate details that are not present in the extracted text. Sarvam Vision delivers natively multilingual reasoning capabilities for such visual elements in a document.
4. Visual data, structured outputs
In-the-Wild OCR and Perception
Sarvam Vision is built on a foundation of general image understanding and multilingual capabilities. While our current efforts are focused on pushing the frontiers of document intelligence, these broader capabilities remain a core part of the model.
Some illustrations of how Sarvam Vision interprets the natural image contexts:
Edge Cases
While the performance of the models are significantly better than other models for Indian languages; it is not perfect. We did find edge cases - a few of them are shared here. Incorrect translation of the Bengali script while describing the image.
For the above image, the model was prompted to describe the scene in Santhali (a low-resource Indian language). Instruction following for such long-tail requests can be low quality.
Experience Sarvam Vision & Get Started with Document Intelligence API Today
Sarvam Vision’s Document Intelligence is built to handle real-world, production-grade workloads and we’re just getting started! To kick things off and accelerate adoption, we’re making the Document Intelligence APIs & Vision experience completely free for the entire month of February, 2026. This is your chance to push the model to its limits, experiment at scale, and start building with zero friction.
Want to try it right away?Jump into our no-code, interactive experience on the Sarvam API Platform. Simply log in and enjoy unlimited usage for the month of February!https://dashboard.sarvam.ai/
Ready to integrate into your product?Head over to our API Developer Docs for ready-to-use SDKs, clear examples, and everything you need to get production-ready in minutes.
Building something exciting?Join our Discord Developer Community to stay up to date on new releases, share feedback, and collaborate directly with the Sarvam team.
We’re excited to work closely with developers and partners to build on this strong foundation and unlock powerful downstream applications across education, healthcare, video intelligence, and more. Now’s the time to explore, experiment, and build with Sarvam Vision.