GitHub - run-llama/liteparse: A fast, helpful, and open-source document parser

Looking for LiteParse V1? Follow this link to the old code

LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.

Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.

Sign up for LlamaParse free

Overview

Fast Text Parsing: Spatial text parsing using PDFium
Flexible OCR System:
- Built-in: Tesseract (zero setup, bundled with the library)
- HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
- Standard API: Simple, well-defined OCR API specification
Complexity Detection: Cheaply check whether a document needs OCR or heavier parsing — route, reject, or estimate cost before a full parse
Screenshot Generation: Generate high-quality page screenshots for LLM agents
Multiple Output Formats: Markdown, JSON, and Text
Markdown Output: Structured Markdown with headings, tables, lists, images, and links — great for feeding LLMs and RAG pipelines
Bounding Boxes: Precise text positioning information
Multi-language: Use from Rust, Node.js/TypeScript, Python, or the browser (WASM)
Multi-platform: Linux, macOS (Intel/ARM), Windows

flowchart LR
      subgraph Input["Input Formats"]
          direction TB
          PDF["PDF"]
          DOCX["DOCX"]
          XLSX["XLSX"]
          PPTX["PPTX"]
          IMG["Images"]
      end

      subgraph Core["Rust Core"]
          direction TB
          CONV["Format Conversion\nLibreOffice / Rust image + resvg + usvg crates"]
          EXTRACT["Text Extraction\nPDFium C library"]
          OCR["Selective OCR\nTesseract / HTTP / Custom"]
          MERGE["OCR Merge\nNative text + OCR results"]
          PROJ["Grid Projection\nSpatial layout reconstruction"]
          CONV --> EXTRACT
          EXTRACT --> OCR --> MERGE --> PROJ
          EXTRACT --> MERGE
      end

      subgraph Output[" Output "]
          direction TB
          JSON["Structured JSON\ntext + bounding boxes"]
          TEXT["Plain Text\nlayout-preserved"]
          SCREEN["Screenshots\nPNG rendering"]
      end

      subgraph Bindings["Language Bindings"]
          direction TB
          NAPI["Node.js / TypeScript\nnapi-rs"]
          PYO3["Python\nPyO3"]
          WASM["Browser / WASM\nwasm-bindgen"]
          CLI["CLI\ncargo / npm / pip"]
          NAPI ~~~ PYO3 ~~~ WASM ~~~ CLI
      end

      PDF --> EXTRACT
      DOCX & XLSX & PPTX & IMG --> CONV
      PROJ --> JSON & TEXT & SCREEN
      JSON & TEXT & SCREEN --> Bindings

      style Input fill:#F5F5F5,color:#000000,stroke:#37D7FA,stroke-width:2px
      style Core fill:#F5F5F5,color:#000000,stroke:#3E18F9,stroke-width:2px
      style Output fill:#F5F5F5,color:#000000,stroke:#FF8705,stroke-width:2px
      style Bindings fill:#F5F5F5,color:#000000,stroke:#FF8DF2,stroke-width:2px

      style PDF fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style DOCX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style XLSX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style PPTX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style IMG fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px

      style CONV fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style EXTRACT fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style OCR fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style MERGE fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style PROJ fill:#4B72FE,color:#FFFFFF,stroke:#3E18F9,stroke-width:2px

      style JSON fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
      style TEXT fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
      style SCREEN fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px

      style NAPI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
      style PYO3 fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
      style WASM fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
      style CLI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px

Installation

Install via your preferred package manager. All versions (except WASM) ship with the same lit CLI.

Language	Install	Library Docs
Node.js / TypeScript	`npm i -g @llamaindex/liteparse`	Node.js README
Python	`pip install liteparse`	Python README
Rust	`cargo install liteparse` (CLI) / `cargo add liteparse` (lib)	Rust README (crates.io)
Browser (WASM)	`npm i @llamaindex/liteparse-wasm`	WASM README

Agent Skill

You can use liteparse as an agent skill, downloading it with the skills CLI tool:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

Or copy-pasting the SKILL.md file to your own skills setup.

CLI Usage

The CLI is the same across all installations (npm, pip, cargo install).

Parse Files

# Basic parsing
lit parse document.pdf

# Parse to Markdown — headings, tables, lists, images, links
lit parse document.pdf --format markdown -o output.md

# Parse with specific format
lit parse document.pdf --format json -o output.json

# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"

# Parse without OCR
lit parse document.pdf --no-ocr

# Include page-scoped vector path data in JSON
lit parse document.pdf --format json --extract-vector-graphics

# Include rich per-item PDF text metadata
lit parse document.pdf --format json --extract-text-metadata

# Include page annotations in structured JSON
lit parse document.pdf --format json --extract-annotations

# Include AcroForm widget fields and values (repairs orphaned widgets in memory)
lit parse document.pdf --format json --extract-form-fields

# Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -

Markdown Output

LiteParse can render documents directly to Markdown. This means reconstructing headings, tables, lists, images, and links from the spatial layout. This is ideal for feeding documents to LLMs and RAG pipelines. This mode is purely heuristics and rule-based, so complex documents may not render perfectly, but it will be fast.

# Render to Markdown
lit parse document.pdf --format markdown -o output.md

# Strip images instead of emitting placeholders
lit parse document.pdf --format markdown --image-mode off

# Extract embedded images to disk and reference them from the markdown
lit parse document.pdf --format markdown --image-mode embed --extract-images --image-output-dir ./images

# Extract image bytes and metadata without changing Markdown image handling
lit parse document.pdf --format json --extract-images

# Emit link text as plain text (no [text](url) syntax)
lit parse document.pdf --format markdown --no-links

# Include tagged-PDF logical structure in JSON
lit parse document.pdf --format json --extract-structure-tree

Image handling is controlled by --image-mode:

Mode	Behavior
`placeholder` (default)	Emits `![](img_pN_K.png)` references in reading order
`off`	Strips images entirely
`embed`	Emits the same image references as `placeholder`

--extract-images is the only option that enables embedded-image extraction. --image-output-dir requires it and writes the extracted bytes to disk. JSON output contains each image's name, path, page bbox, intrinsic pixel dimensions, rotation, format, and duplicate relationship; pixel bytes are never embedded in JSON. Identical image resources reuse the same output file.

Library callers can opt in with extract_images: true (Rust), extractImages: true (Node/WASM), or extract_images=True (Python). It defaults to false. Markdown image mode controls presentation only; placeholder refs are still discovered without bytes.

Markdown reconstruction quality varies with document complexity. For the hardest documents (dense tables, multi-column layouts, scans), LlamaParse remains the most accurate option.

Vector Graphics

Vector path output is opt-in because path-heavy PDFs can produce large payloads. Enable it with --extract-vector-graphics, Rust/Python extract_vector_graphics = true, or JavaScript/WASM extractVectorGraphics: true. Each page then includes vector_graphics (vectorGraphics in JavaScript) with:

shapes: path bounding box, stroke/fill paint state and ARGB colors, and whether the path contains a Bezier curve.
lines: compatible horizontal/vertical segments merged using stroke width and paint colors, with top-left 72-DPI viewport coordinates.

The representation follows LlamaParse PDFium path extraction; LiteParse calls the shape rectangle bbox rather than PDFium's coords, and uses width / height rather than w / h. The field is absent (or None/undefined) by default. Diagonal and curved segments are represented by their parent shape but are not emitted as lines.

Tagged PDF structure tree

Enable --extract-structure-tree (Rust/Python extract_structure_tree, JavaScript/WASM extractStructureTree) to add a page-scoped structure_tree. It preserves every root and recursively exposes element type, ID, actual/alternate text, title, typed scalar attributes, marked-content IDs, children, and referenced link annotations. The field is absent by default; enabled untagged pages contain roots: [].

Document metadata, content bounds, and XFA packets

Parse results (Rust/Node/Python APIs) carry the document's /Info creator and producer entries when present; these are API-only and never appear in CLI JSON output. Enable --extract-content-bounds (Rust/Python extract_content_bounds, JavaScript/WASM extractContentBounds) to add a per-page content_bounds: the union bbox of the page's top-level content objects in viewport coords (absent for empty pages). Enable --extract-xfa-packets (Rust/Python extract_xfa_packets, JavaScript/WASM extractXfaPackets) to add xfa_packets with each raw XFA packet's index, name, byte length, and XML content; non-XFA documents yield an empty list. All of these are off by default, so default JSON output is unchanged.

Screenshot raster signals

Screenshots draw AcroForm field appearances (filled values, checkbox states) on top of the page raster, so form data is visible in the render and to OCR. Each screenshot result reports is_solid_fill (blank page after render), and with detect_screenshot_rects (Node detectScreenshotRects) also rects: solid same-color rectangles and lines found in the raster in viewport coords, which covers scanned/flattened pages that carry no vector paths.

Check Complexity

Before committing to a full parse, check whether a document actually needs OCR or heavier processing. This is a cheap, text-layer-only pass — useful for routing documents to different pipelines, rejecting ones you can't handle, or estimating cost.

# Print the complexity verdict and per-page JSON
lit is-complex document.pdf

# Use as a shell predicate — only parse with --no-ocr when the document is simple
lit is-complex document.pdf --quiet && lit parse document.pdf --no-ocr

# List the pages that need OCR
lit is-complex document.pdf --compact | jq '[.[] | select(.needs_ocr) | .page_number]'

It always prints per-page JSON to stdout, a human-readable verdict to stderr, and exits non-zero when any page needs OCR. Each page carries a needs_ocr verdict and a list of reasons (scanned, no-text, sparse-text, embedded-images, garbled, vector-text).

Batch Parsing

Parse an entire directory of documents:

lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.

# Screenshot all pages
lit screenshot document.pdf -o ./screenshots

# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots

# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots

CLI Reference

Parse Command

lit parse [OPTIONS] <file>

Options:
  -o, --output <file>          Output file path
      --format <format>        Output format: json|text|markdown [default: text]
      --no-ocr                 Disable OCR
      --ocr-language <lang>    OCR language, Tesseract format [default: eng]
      --ocr-server-url <url>   HTTP OCR server URL (uses Tesseract if not provided)
      --tessdata-path <path>   Path to tessdata directory
      --max-pages <n>          Max pages to parse [default: 1000]
      --target-pages <pages>   Pages to parse (e.g., "1-5,10,15-20")
      --dpi <dpi>              Rendering DPI [default: 150]
      --image-mode <mode>      Markdown image handling: off|placeholder|embed [default: placeholder]
      --extract-images         Extract embedded image bytes and metadata
      --image-output-dir <dir> Write extracted images; requires --extract-images
      --extract-text-metadata  Include rich PDF text metadata in text items
      --extract-vector-graphics Include page vector shapes and merged H/V lines
      --no-links               Emit link anchor text as plain text (no [text](url)) in markdown
      --extract-annotations    Include PDF annotations in page output
      --extract-form-fields    Include AcroForm widget fields and values
      --preserve-small-text    Keep very small text
      --password <password>    Password for encrypted documents
      --num-workers <n>        Concurrent OCR workers [default: CPU cores - 1]
  -q, --quiet                  Suppress progress output
  -h, --help                   Print help

Batch Parse Command

lit batch-parse [OPTIONS] <input-dir> <output-dir>

Options:
      --format <format>        Output format: json|text|markdown [default: text]
      --no-ocr                 Disable OCR
      --ocr-language <lang>    OCR language [default: eng]
      --ocr-server-url <url>   HTTP OCR server URL
      --tessdata-path <path>   Path to tessdata directory
      --max-pages <n>          Max pages per file [default: 1000]
      --dpi <dpi>              Rendering DPI [default: 150]
      --recursive              Recursively search input directory
      --extension <ext>        Only process files with this extension (e.g., ".pdf")
      --password <password>    Password for encrypted documents
      --num-workers <n>        Concurrent OCR workers
  -q, --quiet                  Suppress progress output
  -h, --help                   Print help

Screenshot Command

lit screenshot [OPTIONS] <file>

Options:
  -o, --output-dir <dir>       Output directory [default: ./screenshots]
      --target-pages <pages>   Pages to screenshot (e.g., "1,3,5" or "1-5")
      --dpi <dpi>              Rendering DPI [default: 150]
      --password <password>    Password for encrypted documents
  -q, --quiet                  Suppress progress output
  -h, --help                   Print help

Is-Complex Command

lit is-complex [OPTIONS] <file>

Options:
      --compact                Emit dense, whitespace-free JSON instead of pretty-printed
      --max-pages <n>          Max pages to check [default: 1000]
      --target-pages <pages>   Pages to check (e.g., "1-5,10,15-20")
      --password <password>    Password for encrypted documents
  -q, --quiet                  Suppress the stderr verdict
  -h, --help                   Print help

Prints per-page JSON to stdout and a COMPLEX/SIMPLE verdict to stderr; exits non-zero when any page needs OCR, so it composes as a shell predicate.

OCR Setup

Default: Tesseract

Tesseract is bundled and works out of the box:

lit parse document.pdf                    # OCR enabled by default
lit parse document.pdf --ocr-language fra # Specify language
lit parse document.pdf --no-ocr           # Disable OCR

For offline or air-gapped environments, set TESSDATA_PREFIX to a directory containing pre-downloaded .traineddata files:

export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language eng

Or pass the path directly:

lit parse document.pdf --tessdata-path /path/to/tessdata

Optional: HTTP OCR Servers

For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:

You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).

The API requires:

POST /ocr endpoint
Accepts file and language parameters
Returns JSON: { results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }

Multi-Format Input Support

LiteParse supports automatic conversion of various document formats to PDF before parsing.

Supported Input Formats

Office Documents (via LibreOffice)

Word: .doc, .docx, .docm, .odt, .rtf, .pages
PowerPoint: .ppt, .pptx, .pptm, .odp, .key
Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers

Install LibreOffice for automatic conversion:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
apt-get install libreoffice

# Windows
choco install libreoffice-fresh

On Windows, you may need to add LibreOffice's program directory (usually C:\Program Files\LibreOffice\program) to your PATH.

Images (native support)

Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

![NOTE]

As of v2.8.0, imagemagick is no longer required to convert images to PDF. Conversion is natively handled by the rust code.

Environment Variables

Variable	Description
`TESSDATA_PREFIX`	Path to a directory containing Tesseract `.traineddata` files. Used for offline/air-gapped environments.

Development

The project is a Rust workspace with the core library and language-specific binding crates.

crates/
├── liteparse/          # Core library + CLI binary
├── liteparse-napi/     # Node.js bindings (napi-rs)
├── liteparse-python/   # Python bindings (PyO3)
├── liteparse-wasm/     # WASM bindings (wasm-bindgen)
├── pdfium/             # PDFium Rust wrapper
└── pdfium-sys/         # PDFium FFI bindings
packages/
├── node/               # npm package (TS wrapper + native binary)
├── python/             # PyPI package (Python wrapper + native binary)
└── wasm/               # WASM npm package

Building

# Build the CLI
cargo build --release -p liteparse

# Build Node.js bindings
cd packages/node && npm run build

# Build Python bindings
cd packages/python && maturin develop --release

# Build WASM
cd packages/wasm && npm run build

We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.

License

Apache 2.0

Credits

Built on top of:

PDFium - PDF rendering and text extraction
Tesseract - OCR engine (via tesseract-rs)
EasyOCR - HTTP OCR server (optional)
PaddleOCR - HTTP OCR server (optional)
napi-rs - Node.js native bindings
PyO3 - Python native bindings
wasm-bindgen - WebAssembly bindings