GitHub - yigitkonur/api-llm-ocr: PDF to markdown using vision LLMs — tables, layouts, and structure preserved

3 min read Original article ↗

LLM-powered PDF to markdown. uses vision models to actually read your documents — tables, headers, mixed layouts — and outputs clean, structured markdown. not traditional OCR.

curl -X POST "http://localhost:8000/ocr" -F "file=@document.pdf"

python FastAPI license


demo

video.mp4

NASA Apollo 17 flight docs — mixed orientations, messy layouts — converted to structured markdown.


what it does

  • vision model OCR — understands context, not just character shapes
  • parallel processing — 50-page PDF in seconds, not minutes
  • table preservation — detected and formatted as proper markdown tables
  • smart batching — configurable pages-per-request for speed vs accuracy tradeoff
  • retry with backoff — handles rate limits and timeouts without crashing
  • flexible input — file upload or URL, your choice
  • image descriptions — non-text elements get [Image: description] annotations

cost

using OpenAI as an example (~1,500 tokens/page average):

model cost per 1,000 pages
GPT-4o ~$15
GPT-4o mini ~$8
batch API ~$4

works with any OpenAI-compatible vision API. swap the endpoint and model in config.

install

git clone https://github.com/yigitkonur/api-llm-ocr.git
cd api-llm-ocr

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

configure

create a .env file:

# required
OPENAI_API_KEY=your_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_DEPLOYMENT_ID=your_vision_model_deployment

# optional
OPENAI_API_VERSION=gpt-4o
BATCH_SIZE=1
MAX_CONCURRENT_OCR_REQUESTS=5
MAX_CONCURRENT_PDF_CONVERSION=4

run

# pick one
uvicorn main:app --reload
uvicorn swift_ocr.app:app --reload
python -m swift_ocr
python -m swift_ocr --host 0.0.0.0 --port 8080 --workers 4

API lives at http://127.0.0.1:8000. auto-generated docs at /docs.

usage

upload a file

curl -X POST "http://127.0.0.1:8000/ocr" \
  -F "file=@/path/to/document.pdf"

process from URL

curl -X POST "http://127.0.0.1:8000/ocr" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

response

{
  "text": "# document title\n\n## section 1\n\nextracted text...",
  "status": "success",
  "pages_processed": 5,
  "processing_time_ms": 1234
}

health check

curl http://127.0.0.1:8000/health

error codes

code meaning
200 success
400 bad request (no file/URL, or both provided)
422 validation error
429 rate limited — retry with backoff
500 processing error
504 timeout downloading PDF

configuration

variable default description
OPENAI_API_KEY API key
AZURE_OPENAI_ENDPOINT endpoint URL
OPENAI_DEPLOYMENT_ID vision model deployment ID
OPENAI_API_VERSION gpt-4o API version
BATCH_SIZE 1 pages per OCR request (1-10). higher = faster, less accurate
MAX_CONCURRENT_OCR_REQUESTS 5 parallel OCR calls
MAX_CONCURRENT_PDF_CONVERSION 4 parallel page renders. match your CPU cores

tuning

  • high accuracy: BATCH_SIZE=1
  • balanced: BATCH_SIZE=5, MAX_CONCURRENT_OCR_REQUESTS=10
  • max throughput: BATCH_SIZE=10, MAX_CONCURRENT_OCR_REQUESTS=20 (watch rate limits)

project structure

swift_ocr/
  __init__.py           — package init
  __main__.py           — CLI entry point
  app.py                — FastAPI app factory
  config/
    settings.py         — pydantic settings (type-safe config)
  core/
    exceptions.py       — custom exception hierarchy
    logging.py          — structured logging
    retry.py            — exponential backoff
  schemas/
    ocr.py              — pydantic request/response models
  services/
    ocr.py              — vision model OCR service
    pdf.py              — PDF conversion service
  api/
    deps.py             — dependency injection
    exceptions.py       — FastAPI exception handlers
    router.py           — route aggregation
    routes/
      health.py         — health check endpoints
      ocr.py            — OCR endpoints

troubleshooting

problem fix
missing env vars check .env has OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, OPENAI_DEPLOYMENT_ID
429 rate limits reduce MAX_CONCURRENT_OCR_REQUESTS or BATCH_SIZE
timeout errors large PDFs take time — backoff is built in
garbled output make sure your PDF isn't password-protected or corrupted
tables misformatted try BATCH_SIZE=1 for complex tables
failed to init client verify endpoint format: https://your-resource.openai.azure.com/

license

AGPL v3 — required by PyMuPDF dependency.

if you want MIT, swap PyMuPDF for pdf2image + Poppler. the rest of the code is yours.