Settings

Theme

Ask HN: Any PDF Benchmarks?

2 points by nnurmanov 8 months ago · 1 comment · 1 min read


I am testing several PDF parsing libraries, and unfortunately, most of them have issues. For example, many struggle with non-English languages, while others can’t reliably handle tables. Having a standardized PDF benchmark would be incredibly helpful — at the very least, it would allow us to identify obvious shortcomings without needing to install and test each library individually.

Is anyone building such a benchmark?

nnurmanovOP 8 months ago

For example, markitdown from MS can't recognize Cyrillic text, when I started researching into it, I found that they use pdfminer.six under the hood and there is an unresolved issue with supporting languages.

Docling is OK with tables, but fails with cyrillic text;

marker-pdf is OK with tables, but it also fails with cyrillic text;

What other pdf parser libraries exist? I am looking for preferably on-premise solutions, but if I won't find a reliable and accurate solution, I might consider cloud based solutions as well.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection