Settings

Theme

Ask HN: Extracting the same table from PDF layouts?

3 points by oliver236 3 months ago · 0 comments · 1 min read


I’m processing a large volume of PDFs (30–40 pages each) with inconsistent layouts. Each document contains one specific table I need to extract, but every company formats it differently.

Current stack: – Azure Document Intelligence (prebuilt) – preprocessing (PyMuPDF → image → filters) – a multimodal LLM to turn the detected table into clean JSON

Main issue: – To localize the table, I currently rely on template-specific configs. At scale, this becomes unmanageable because there may be hundreds of unique layouts.

Has anyone solved this class of problem? Looking for strategies for: – robust table localization across many templates, – hybrid rule-based + ML approaches, – layout-based detection, – or “templateless” methods that generalize better.

No comments yet.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection