zpdf (alpha stage - early version)
A PDF text extraction library written in Zig.
Features
- Memory-mapped file reading, zero-copy where possible
- Streaming text extraction with efficient arena allocation
- Multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength
- Font encoding support: WinAnsi, MacRoman, ToUnicode CMap
- XRef table and stream parsing (PDF 1.5+)
- Configurable error handling (strict or permissive)
- Structure tree extraction for tagged PDFs (PDF/UA)
- Geometric (Y→X) reading order for non-tagged PDFs
- Markdown export for structured PDFs
Benchmark
Text extraction performance on Apple M4 Pro (reading order):
| Document | Pages | zpdf | MuPDF | Speedup |
|---|---|---|---|---|
| Intel SDM | 5,252 | 582ms | 2,152ms | 3.7x |
| Pandas Docs | 3,743 | 640ms | 1,130ms | 1.8x |
| C++ Standard | 2,134 | 438ms | 1,007ms | 2.3x |
| PDF Reference 1.7 | 1,310 | 236ms | 1,481ms | 6.3x |
Build with zig build -Doptimize=ReleaseFast for best performance.
Requirements
- Zig 0.15.2 or later
Building
zig build # Build library and CLI zig build test # Run tests
Usage
Library
const std = @import("std"); const zpdf = @import("zpdf"); pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer _ = gpa.deinit(); const allocator = gpa.allocator(); const doc = try zpdf.Document.open(allocator, "file.pdf"); defer doc.close(); var buf: [4096]u8 = undefined; var bw = std.fs.File.stdout().writer(&buf); const writer = &bw.interface; defer writer.flush() catch {}; for (0..doc.pageCount()) |page_num| { try doc.extractText(page_num, writer); } }
CLI
zpdf extract document.pdf # Extract all pages (uses structure tree for reading order) zpdf extract -p 1-10 document.pdf # Extract pages 1-10 zpdf extract -o out.txt document.pdf # Output to file zpdf info document.pdf # Show document info zpdf bench document.pdf # Run benchmark
Python
import zpdf with zpdf.Document("file.pdf") as doc: print(doc.page_count) # Single page text = doc.extract_page(0) # All pages (accuracy mode is default) all_text = doc.extract_all() # Fast mode (higher throughput, stream-order extraction) fast_text = doc.extract_all(mode="fast") # Page info info = doc.get_page_info(0) print(f"{info.width}x{info.height}") # Zero-copy memory open (unsafe semantics for other language bindings) with zpdf.Document.open_memory_unsafe(open("file.pdf", "rb").read()) as doc: print(doc.page_count)
Build the shared library first:
zig build -Doptimize=ReleaseFast PYTHONPATH=python python3 examples/basic.py
Project Structure
src/
├── root.zig # Document API and core types
├── main.zig # CLI entry point
├── capi.zig # C ABI exports for FFI
├── wapi.zig # WASM API exports
├── parser.zig # PDF object parser
├── xref.zig # XRef table/stream parsing
├── pagetree.zig # Page tree resolution
├── decompress.zig # Stream decompression filters
├── encoding.zig # Font encoding and CMap parsing
├── agl.zig # Adobe Glyph List mappings
├── cff.zig # CFF/Type1 font parsing
├── interpreter.zig # Content stream interpreter
├── structtree.zig # Structure tree parser (PDF/UA)
├── layout.zig # Text layout and bounding boxes
├── markdown.zig # Markdown export
└── simd.zig # SIMD-accelerated parsing
python/zpdf/ # Python bindings (cffi)
examples/ # Usage examples
Reading Order
zpdf extracts text in logical reading order using a three-tier approach:
-
Structure Tree (preferred): Uses the PDF's semantic structure for tagged/accessible PDFs (PDF/UA). Correctly handles multi-column layouts, sidebars, tables, and captions.
-
Geometric Sort (fallback): When no structure tree exists, sorts text spans by Y→X position to approximate visual reading order.
-
Stream Order (last resort): When bounding box extraction fails, falls back to raw PDF content stream order.
| Method | Pros | Cons |
|---|---|---|
| Structure tree | Correct semantic order, handles complex layouts | Only works on tagged PDFs |
| Geometric sort | Works on any PDF, respects visual layout | May fail on complex multi-column layouts |
| Stream order | Always works | May not match visual order |
Comparison
| Feature | zpdf | pdfium | MuPDF |
|---|---|---|---|
| Text Extraction | |||
| Stream order | Yes | Yes | Yes |
| Tagged/structure tree | Yes | No | Yes |
| Visual reading order | No | No | Yes |
| Word bounding boxes | Yes | Yes | Yes |
| Font Support | |||
| WinAnsi/MacRoman | Yes | Yes | Yes |
| ToUnicode CMap | Yes | Yes | Yes |
| CID fonts (Type0) | Partial* | Yes | Yes |
| Compression | |||
| FlateDecode, LZW, ASCII85/Hex | Yes | Yes | Yes |
| JBIG2, JPEG2000 | No | Yes | Yes |
| Other | |||
| Encrypted PDFs | No | Yes | Yes |
| Rendering | No | Yes | Yes |
*CID fonts: Works when CMap is embedded directly.
Use zpdf when: Batch processing, tagged PDFs (PDF/UA), simple text extraction, Zig integration.
Use pdfium when: Browser integration, full PDF support, proven stability.
Use MuPDF when: Complex visual layouts, rendering needed.
License
CC0 - Public Domain