Ask HN: Extracting the same table from PDF layouts?
I’m processing a large volume of PDFs (30–40 pages each) with inconsistent layouts. Each document contains one specific table I need to extract, but every company formats it differently.
Current stack: – Azure Document Intelligence (prebuilt) – preprocessing (PyMuPDF → image → filters) – a multimodal LLM to turn the detected table into clean JSON
Main issue: – To localize the table, I currently rely on template-specific configs. At scale, this becomes unmanageable because there may be hundreds of unique layouts.
Has anyone solved this class of problem? Looking for strategies for: – robust table localization across many templates, – hybrid rule-based + ML approaches, – layout-based detection, – or “templateless” methods that generalize better.
No comments yet.