Settings

Theme

Ask HN: What's the best way to extract text from information dense pdfs?

1 points by rkwz 4 months ago · 2 comments · 1 min read

Reader

Examples for PDFs: Pitch Decks, Annual Reports etc which have text, charts, tables etc.

incomingpain 4 months ago

Obviously all the big named online AI will do this trivially. Upload file, ask for everything you want in whatever format you want. If I were doing it: https://mistral.ai/news/mistral-ocr

To do it offline due to privacy, vision enabled LLM. Biggest Gemma you can handle, qwen2.5 vl, or Mistral small. I'd probably choose mistral.

Openwebui does pdfs built in. https://docs.openwebui.com/features/document-extraction/

TBH havent tried it myself but I bet it works.

pestatije 4 months ago

pdf2txt

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection