Settings

Theme

Show HN: Site2pdf

github.com

5 points by laiso a year ago · 8 comments · 1 min read

Reader

Hello everyone, I created a tool called "site2pdf" over the weekend. This tool converts the main page and sub-pages of a website that match a specified URL pattern into a PDF file. It is particularly suitable for AI-based RAG (Retrieval-Augmented Generation) and QA (Question Answering) tasks.

GitHub: https://github.com/laiso/site2pdf/

# Features

- Generate PDFs of main and sub-pages

- Based on Node.js and Puppeteer

- Easy-to-use CLI tool

I want to make this software available online for my friends, but I'm struggling with the best architecture to use. I want to meet the following requirements:

- Cost-effective

- Use Cloudflare Workers' Browser Rendering API(Managed Puppeteer)

- Save to Workers Queue -> R2 bucket

I have already created a prototype, but it encounters ExceededCpu errors when running the consumer for a long time. It seems I need to implement a distributed architecture including merging, which seems challenging. I would appreciate any advice you can give. Thank you!

I look forward to your feedback!

miles a year ago

Thank you for crafting and sharing this.

You mention that the generated PDFs are "particularly suitable for AI-based RAG and QA tasks" - can you please share your preferred method/tools for that?

bosch_mind a year ago

can you elaborate on the execution failures?

If you’re referring to single pages taking too long and wanting to chunk the processing of a single page, consider cloudflare durable objects as a coordination point for a url.

You can parallelize chunks and track completion in DO. When the last chunk is complete, you can merge. Unlike KV, DO is strongly consistent.

unstatusthequo a year ago

Is there a way to simply feed it a list of known URLs directly without it trying to figure out the sub-pages? Complex and large site discovery can be a pain.

  • laisoOP a year ago

    Currently, site2pdf does not support feeding it a list of known URLs directly. However, I believe detecting sub-pages from a sitemap.xml file could work well. Thank you for the question!

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection