Nanonets-OCR-s – OCR model that transforms documents into structured markdown

huggingface.co

361 points by PixelPanda 15 days ago


PixelPanda - 15 days ago

Full disclaimer: I work at Nanonets

Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:

LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.

Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.

Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.

Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.

Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.

Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.

Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s

Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

kordlessagain - 15 days ago

I created a Powershell script to run this locally on any PDF: https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...

It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.

Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.

el_don_almighty - 15 days ago

I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!

mvac - 15 days ago

How does it compare to Datalab/Marker https://github.com/datalab-to/marker ? We evaluated many PDF->MD converters and this one performed the best, though it is not perfect.

ks2048 - 15 days ago

It’s a shame all these models target markdown and not something with more structure and a specification. There are different flavors of Markdown and limited support for footnotes, references, figures, etc.

temp0826 - 15 days ago

I have a Shipibo (indigenous Peruvian language) to Spanish dictionary that I've been trying to translate into a Shipibo to English dictionary using a couple different llms but keep struggling with formatting (two columns, strange line breaks, but also both Shipibo and Spanish in the definitions make it difficult to grok). That all plus being pretty poorly scanned. May need to give this a try.

ZQ-Dev8 - 15 days ago

How's this compare with docling (https://github.com/docling-project/docling)?

constantinum - 15 days ago

It would be interesting to know how it compares with Llamaparse, LLMWhisperer, Marker, Reducto

silversmith - 15 days ago

I'm curious, how does it do with non-english texts? It's my understanding that LLM-based OCR solutions fall way behind traditional ones once you introduce other languages.

raus22 - 15 days ago

With models like these, when multilingual is not mentioned it will perform really bad on real life non-english pdfs.

progval - 15 days ago

It's not open-source (nor open-weight): https://huggingface.co/nanonets/Nanonets-OCR-s/discussions/2

Bestora - 15 days ago

How does it handle documents with multi column or multi row tables?

e.g. https://www.japanracing.de/Teilegutachten/Teilegutachten-JR1... page 1 rowspan page29 colspan

nehalem - 15 days ago

How does it do with multi-column text and headers and footers?

tensor - 15 days ago

There are no benchmarks or accuracy measures on a hold out set?

jwr - 14 days ago

Thank you! This is very interesting — I'm just curious, why use such a small model?

I can comfortably run 27B models on my Mac and I'd much rather process my PDF library with something that is less prone to hallucinations and handles multiple languages better…

b0a04gl - 15 days ago

the interesting bit is it's tagging semantics during parsing itself. knowing something's a signature or watermark or checkbox before layout reconstruction. most pipelines bolt that on later using heuristics or classifiers.

> curious what that pre-tagging does to downstream simplification, especially for converting into json/html without extra passes.

>also wondering how they handle ambiguity in visual cues without layout metadat

nnurmanov - 14 days ago

Are there benchmarks for such kind of tools? How does it handle tables? Different languages?

huqedato - 14 days ago

Can it extract data from scientific graphs like barcharts, time-series, etc. ?

Eisenstein - 15 days ago

How does it do with handwriting?

CMCDragonkai - 14 days ago

Can this work on diagrams? Like box and lines?

karn97 - 15 days ago

[dead]

- 13 days ago
[deleted]
- 13 days ago
[deleted]