Show HN: Site2pdf

5 points by laiso 2 years ago · 8 comments · 1 min read

Reader

Hello everyone, I created a tool called "site2pdf" over the weekend. This tool converts the main page and sub-pages of a website that match a specified URL pattern into a PDF file. It is particularly suitable for AI-based RAG (Retrieval-Augmented Generation) and QA (Question Answering) tasks.

GitHub: https://github.com/laiso/site2pdf/

# Features

- Generate PDFs of main and sub-pages

- Based on Node.js and Puppeteer

- Easy-to-use CLI tool

I want to make this software available online for my friends, but I'm struggling with the best architecture to use. I want to meet the following requirements:

- Cost-effective

- Use Cloudflare Workers' Browser Rendering API(Managed Puppeteer)

- Save to Workers Queue -> R2 bucket

I have already created a prototype, but it encounters ExceededCpu errors when running the consumer for a long time. It seems I need to implement a distributed architecture including merging, which seems challenging. I would appreciate any advice you can give. Thank you!

I look forward to your feedback!

miles 2 years ago

Thank you for crafting and sharing this.

You mention that the generated PDFs are "particularly suitable for AI-based RAG and QA tasks" - can you please share your preferred method/tools for that?

laisoOP 2 years ago

I particularly like using Google NotebookLM, as it allows me to consolidate books and documents in one place for knowledge search.
https://github.com/user-attachments/assets/3c68298e-265f-410...
Additionally, I have created GPTs that can have conversations about Tauri using PDFs as the source.
https://chatgpt.com/g/g-Pa0nP2mJX-tauri-v2-helper
- miles 2 years ago
  
  Thanks for taking the time to respond. I was thinking of something local, especially in light of:
  Google's Gemini AI caught scanning Google Drive PDF files without permission https://news.ycombinator.com/item?id=40965892 .
  Looks like GPT4All[1] and AnythingLLM[2] are worth exploring. There's also the closed-source macOS app RecurseChat[3,4] which appeared on HN a few months ago[5].
  [1] https://github.com/nomic-ai/gpt4all
  [2] https://github.com/Mintplex-Labs/anything-llm
  [3] https://recurse.chat
  [4] https://recurse.chat/blog/posts/local-docs
  [5] https://news.ycombinator.com/item?id=39532367
  - laisoOP 2 years ago
    
    Exploring local solutions like GPT4All and AnythingLLM sounds promising. I'll also look into RecurseChat on macOS. Thanks again for the suggestions and for sharing the insights! Another tool that might be worth considering is Dify.
    https://docs.dify.ai/guides/knowledge-base/create-knowledge-...

bosch_mind 2 years ago

can you elaborate on the execution failures?

If you’re referring to single pages taking too long and wanting to chunk the processing of a single page, consider cloudflare durable objects as a coordination point for a url.

You can parallelize chunks and track completion in DO. When the last chunk is complete, you can merge. Unlike KV, DO is strongly consistent.

laisoOP 2 years ago

I have conducted further investigations following this post.
https://github.com/laiso/site2pdf/issues/6
The merge process using only KV has been successful. The real issue lies in the concurrency limits of Browser Rendering API.

unstatusthequo 2 years ago

Is there a way to simply feed it a list of known URLs directly without it trying to figure out the sub-pages? Complex and large site discovery can be a pain.

laisoOP 2 years ago

Currently, site2pdf does not support feeding it a list of known URLs directly. However, I believe detecting sub-pages from a sitemap.xml file could work well. Thank you for the question!

Settings

Show HN: Site2pdf

Keyboard Shortcuts