|
|
|
Docs
LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.
Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.
Overview
- Fast Text Parsing: Spatial text parsing using PDF.js
- Flexible OCR System:
- Built-in: Tesseract.js (zero setup, works out of the box!)
- HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
- Standard API: Simple, well-defined OCR API specification
- Screenshot Generation: Generate high-quality page screenshots for LLM agents
- Multiple Output Formats: JSON and Text
- Bounding Boxes: Precise text positioning information
- Standalone Binary: No cloud dependencies, runs entirely locally
- Multi-platform: Linux, macOS (Intel/ARM), Windows
Installation
CLI Tool
Option 1: Global Install (Recommended)
Install globally via npm to use the lit command anywhere:
npm i -g @llamaindex/liteparse
Then use it:
lit parse document.pdf lit screenshot document.pdf
For macOS and Linux users, liteparse can be also installed via brew:
brew tap run-llama/liteparse brew install llamaindex-liteparse
Option 2: Install from Source
You can clone the repo and install the CLI globally from source:
git clone https://github.com/run-llama/liteparse.git
cd liteparse
npm run build
npm pack
npm install -g ./liteparse-*.tgz
Agent Skill
You can use liteparse as an agent skill, downloading it with the skills CLI tool:
npx skills add run-llama/llamaparse-agent-skills --skill liteparse
Or copy-pasting the SKILL.md file to your own skills setup.
Usage
Parse Files
# Basic parsing lit parse document.pdf # Parse with specific format lit parse document.pdf --format json -o output.md # Parse specific pages lit parse document.pdf --target-pages "1-5,10,15-20" # Parse without OCR lit parse document.pdf --no-ocr
Batch Parsing
You can also parse an entire directory of documents:
lit batch-parse ./input-directory ./output-directory
Generate Screenshots
Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.
# Screenshot all pages lit screenshot document.pdf -o ./screenshots # Screenshot specific pages lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots # Custom DPI lit screenshot document.pdf --dpi 300 -o ./screenshots # Screenshot page range lit screenshot document.pdf --target-pages "1-10" -o ./screenshots
Library Usage
Install as a dependency in your project:
npm install @llamaindex/liteparse
# or
pnpm add @llamaindex/liteparseimport { LiteParse } from '@llamaindex/liteparse'; const parser = new LiteParse({ ocrEnabled: true }); const result = await parser.parse('document.pdf'); console.log(result.text);
Buffer / Uint8Array Input
You can pass raw bytes directly instead of a file path. PDF buffers are parsed with zero disk I/O — no temp files are written:
import { LiteParse } from '@llamaindex/liteparse'; import { readFile } from 'fs/promises'; const parser = new LiteParse(); // From a file read const pdfBytes = await readFile('document.pdf'); const result = await parser.parse(pdfBytes); // From an HTTP response const response = await fetch('https://example.com/document.pdf'); const buffer = Buffer.from(await response.arrayBuffer()); const result2 = await parser.parse(buffer);
Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:
const screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);
CLI Options
Parse Command
$ lit parse --help
Usage: lit parse [options] <file>
Parse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)
Options:
-o, --output <file> Output file path
--format <format> Output format: json|text (default: "text")
--ocr-server-url <url> HTTP OCR server URL (uses Tesseract if not provided)
--no-ocr Disable OCR
--ocr-language <lang> OCR language(s) (default: "en")
--num-workers <n> Number of pages to OCR in parallel (default: CPU cores - 1)
--max-pages <n> Max pages to parse (default: "10000")
--target-pages <pages> Target pages (e.g., "1-5,10,15-20")
--dpi <dpi> DPI for rendering (default: "150")
--no-precise-bbox Disable precise bounding boxes
--preserve-small-text Preserve very small text
--config <file> Config file (JSON)
-q, --quiet Suppress progress output
-h, --help display help for command
Batch Parse Command
$ lit batch-parse --help
Usage: lit batch-parse [options] <input-dir> <output-dir>
Parse multiple documents in batch mode (reuses PDF engine for efficiency)
Options:
--format <format> Output format: json|text (default: "text")
--ocr-server-url <url> HTTP OCR server URL (uses Tesseract if not provided)
--no-ocr Disable OCR
--ocr-language <lang> OCR language(s) (default: "en")
--num-workers <n> Number of pages to OCR in parallel (default: CPU cores - 1)
--max-pages <n> Max pages to parse per file (default: "10000")
--dpi <dpi> DPI for rendering (default: "150")
--no-precise-bbox Disable precise bounding boxes
--recursive Recursively search input directory
--extension <ext> Only process files with this extension (e.g., ".pdf")
--config <file> Config file (JSON)
-q, --quiet Suppress progress output
-h, --help display help for command
Screenshot Command
$ lit screenshot --help
Usage: lit screenshot [options] <file>
Generate screenshots of PDF pages
Options:
-o, --output-dir <dir> Output directory for screenshots (default: "./screenshots")
--target-pages <pages> Page numbers to screenshot (e.g., "1,3,5" or "1-5")
--dpi <dpi> DPI for rendering (default: "150")
--format <format> Image format: png|jpg (default: "png")
--config <file> Config file (JSON)
-q, --quiet Suppress progress output
-h, --help display help for command
OCR Setup
Default: Tesseract.js
# Tesseract is enabled by default lit parse document.pdf # Specify language lit parse document.pdf --ocr-language fra # Disable OCR lit parse document.pdf --no-ocr
By default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the TESSDATA_PREFIX environment variable to a directory containing pre-downloaded .traineddata files:
export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language engYou can also pass tessdataPath in the library config:
const parser = new LiteParse({ tessdataPath: '/path/to/tessdata' });
Optional: HTTP OCR Servers
For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:
You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).
The API requires:
- POST
/ocrendpoint - Accepts
fileandlanguageparameters - Returns JSON:
{ results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }
See the example servers in ocr/easyocr/ and ocr/paddleocr/ as templates.
For the complete OCR API specification, see OCR_API_SPEC.md.
Multi-Format Input Support
LiteParse supports automatic conversion of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!
Supported Input Formats
Office Documents (via LibreOffice)
- Word:
.doc,.docx,.docm,.odt,.rtf - PowerPoint:
.ppt,.pptx,.pptm,.odp - Spreadsheets:
.xls,.xlsx,.xlsm,.ods,.csv,.tsv
Just install the dependency and LiteParse will automatically convert these formats to PDF for parsing:
# macOS brew install --cask libreoffice # Ubuntu/Debian apt-get install libreoffice # Windows choco install libreoffice-fresh # might require admin permissions
For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally
C:\Program Files\LibreOffice\program) to the environment variables and re-start the machine.
Images (via ImageMagick)
- Formats:
.jpg,.jpeg,.png,.gif,.bmp,.tiff,.webp,.svg
Just install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):
# macOS brew install imagemagick # Ubuntu/Debian apt-get install imagemagick # Windows choco install imagemagick.app # might require admin permissions
Environment Variables
| Variable | Description |
|---|---|
TESSDATA_PREFIX |
Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet. |
LITEPARSE_TMPDIR |
Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (os.tmpdir()). Useful in containerized or read-only filesystem environments. |
Configuration
You can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.
Config File Example
Create a liteparse.config.json file:
{
"ocrLanguage": "en",
"ocrEnabled": true,
"maxPages": 1000,
"dpi": 150,
"outputFormat": "json",
"preciseBoundingBox": true,
"preserveVerySmallText": false
}For HTTP OCR servers, just add ocrServerUrl:
{
"ocrServerUrl": "http://localhost:8828/ocr",
"ocrLanguage": "en",
"outputFormat": "json"
}Use with:
lit parse document.pdf --config liteparse.config.json
Development
We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.
# Install dependencies npm install # Build TypeScript (Linux/macOs) npm run build # Build Typescript (Windows) npm run build:windows # Watch mode npm run dev # Test parsing npm test
License
Apache 2.0
Credits
Built on top of:
- PDF.js - PDF parsing engine
- Tesseract.js - In-process OCR engine
- EasyOCR - HTTP OCR server (optional)
- PaddleOCR - HTTP OCR server (optional)
- Sharp - Image processing