🔍 RAG PDF Audit
See your documents through an AI's eyes.
A simple tool to check if your PDF will choke your RAG pipeline before you waste hours debugging.
⚠️ First Run Notice
Docling downloads ML models (~2GB) on first startup. This takes 30-60 seconds and only happens once. After that, models are cached and startup is fast.
Quick Start
# 1. System Dependencies (for OCR) sudo apt-get install tesseract-ocr # Linux # brew install tesseract # macOS # 2. Python Dependencies pip install -r requirements.txt # 3. Run streamlit run app.py
Then open http://localhost:8501 and upload a PDF.
What Does It Do?
Side-by-side comparison of two approaches:
- 💀 Naive Approach: Standard pypdf (what many RAG tutorials use)
- 🧠 Intelligent Approach: Docling with layout-awareness & OCR
Why?
Many RAG tutorials ignore the hard parts:
- Scanned PDFs (no text layer → RAG sees nothing)
- Tables (become garbled text soup)
- Multi-column layouts (wrong reading order breaks context)
This tool shows you instantly whether your document is "toxic" for standard RAG.
Output
- ✅ Green: Standard RAG might work (but watch out for layouts)
- ❌ Red: You need layout-aware parsing (e.g., Docling)
How It Works
Naive Parsing:
reader = PdfReader(file) text = page.extract_text() # Hope for the best
Result: Chaos with tables, fails completely on scans.
Intelligent Parsing:
converter = DocumentConverter(do_ocr=True, do_table_structure=True) result = converter.convert(file) markdown = result.document.export_to_markdown() # LLMs love this
Result: Clean markdown with preserved structure, tables, and hierarchy.
Tech Stack
- Frontend: Streamlit (interactive UI)
- Naive Parser: pypdf (standard library)
- Smart Parser: Docling (IBM Research, layout-aware)
- OCR: Tesseract (for scanned documents)
Use Cases
- Before building a RAG system: Check if your documents are suitable
- Debugging RAG quality issues: See what your pipeline actually sees
- Evaluating document ingestion strategies: Compare approaches visually
Extensibility
The parsers are modular and easy to swap. Want to try a different parser?
- Open
app.py - Replace the parsing logic in either column
- The side-by-side comparison makes it easy to evaluate alternatives
Popular alternatives: PyMuPDF, Unstructured, LlamaParse, Azure Document Intelligence
License
MIT

