Settings

Theme

Show HN: POC to scrape and structure HTML into JSON for RAG

structured.pages.dev

9 points by nirvanist 10 months ago · 6 comments · 1 min read

Reader

Hey all,

I built a quick PoC that scrapes a webpage, sends the content to Gemini Flash, and outputs a clean, structured JSON — ready for RAG workflows.

In my case, I’ll use this structured data to enhance models by integrating external knowledge sources during the generation process.

Curious if you think this has potential or if there are any use cases I might have missed. Happy to share more details if there's interest!

mahi_novice 10 months ago

Do you mind sharing more about the implementation details? Any safeguards you have for the urls and all?

  • nirvanistOP 10 months ago

    Basically, I use a headless Chromium with Puppeteer to render the page. Then, some logic extracts and cleans the HTML content. Finally, I use Gemini with a specific schema to return a JSON response.

mahi_novice 10 months ago

How do you plan to use it?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection