site2pdf
Generate a single PDF containing all pages of a website. Ideal for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks.
Features
- Portability - Combine multiple pages into a single shareable PDF
- AI Integration - Works with Google NotebookLM, ChatGPT GPTs, and other AI tools
- Visual Preservation - Maintains images and formatting for multimodal models
- Concurrent Processing - Processes multiple pages in parallel for faster generation
Quick Start
npx site2pdf-cli https://example.com
Output is saved to ./out/<domain>.pdf.
Installation (from source)
To install the tool globally on your machine from source, run:
git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install
npm run build
npm linkAfter installation, you can run the tool directly using the site2pdf command from anywhere:
site2pdf <main_url> [url_pattern]
Prerequisites
- Node.js (v18 or later recommended)
Linux Dependencies
Puppeteer requires these system libraries:
sudo apt-get update sudo apt-get install -y libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 \ libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 \ libgbm1 libasound2
Note: On newer Ubuntu versions (24.04+), use
libasound2t64instead oflibasound2.
Usage
npx site2pdf-cli <main_url> [url_pattern]
| Argument | Description |
|---|---|
<main_url> |
The starting URL to crawl and convert |
[url_pattern] |
Optional regex to filter which links to include (defaults to same domain) |
URL Pattern Formats
- Plain string:
'https://example.com/docs'- matches URLs containing this string - Regex literal:
'/https:\/\/example\.com\/docs/i'- full regex with flags
Examples
Basic usage (captures all same-domain links):
npx site2pdf-cli https://docs.example.com
Filter to specific section:
npx site2pdf-cli "https://www.typescriptlang.org/docs/handbook/" "https://www.typescriptlang.org/docs/handbook/2/"
Environment Variables
| Variable | Description |
|---|---|
CHROME_PATH |
Path to a custom Chrome/Chromium executable |
Troubleshooting
Windows: Sandbox Errors
Grant permissions to the Puppeteer cache:
icacls %USERPROFILE%/.cache/puppeteer/chrome /grant *S-1-15-2-1:(OI)(CI)(RX)
See Puppeteer Windows troubleshooting.
ARM64 Linux: Not Supported
Chrome does not provide ARM64 binaries for Linux. You'll see errors like:
- "Failed to launch the browser process!"
- "chrome-linux64/chrome: 1: Syntax error: "(" unexpected"
See Chrome for Testing ARM64 Support Issue.
How It Works
- Launches headless Chrome via Puppeteer
- Navigates to the main URL and extracts all matching links
- Generates a PDF for each page concurrently
- Merges all PDFs into a single document using pdf-lib
- Saves to
./out/<slugified-url>.pdf
Development
git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install| Command | Description |
|---|---|
npm run dev -- <main_url> [url_pattern] |
Run in development mode with watch |
npm run build |
Compile TypeScript |
npm test |
Run tests |
npx biome lint |
Check for lint issues |
npx biome format |
Format code |
Contributing
Issues and pull requests are welcome. Please follow the existing code style and include tests for new features.
License
MIT