Settings

Theme

Ask HN: How to remove Ads from a downloaded HTML file to output an ad free file?

1 points by suramya_tomar a year ago · 5 comments · 1 min read


Is there a tool/script that will allow me to filter out ads from a page when downloading it using curl. (Similar to how uBlock Origin works for a browser).

Basically, what I am doing is downloading a snapshot of a site using curl. But the sites have advertisements in them which I want to filter out. So is there a tool that will let me do that from the command line so that the output file doesn't have ads in it?

In short, I want something like uBlock Origin but for html files that I will be converting to PDF's or epubs. Something like:

curl https://www.google.com | AdRemover.sh | htmltopdf

Most of the solutions I found require you to update the /etc/hosts file to stop showing the ads but would rather avoid that if possible.

suramya_tomarOP a year ago

After taking a break and stepping away for a bit, I realized that I was recreating an archiving system for websites and that there are existing solutions that do the same thing.

I found https://github.com/ArchiveBox/ArchiveBox/ which is a self hosted web archiving system. It covers most of my usecases (and I can extend it for additional functionality) so I am going to set this up and try it out.

Thanks all for the help.

solardev a year ago

Do you have to use Curl? It wouldn't render a lot of sites correctly anyway (anything that uses JS for rendering).

Can you run a puppeteer/playwright instance (which control real browsers) and add an ad blocker to that? e.g. https://github.com/ghostery/adblocker or https://github.com/microsoft/playwright-python/issues/782

  • suramya_tomarOP a year ago

    I don't have to use curl, but in the past when I have setup something that opens browser instances it has usually been a bit unstable in the sense that it would crash intermittently.

    I wanted something that I could kick-off as a daily cron job.

    • solardev a year ago

      Hmm, can't you check for correctness somehow and retry on failure? Those headless browsers are often run in automated environments.

inhumantsar a year ago

Editing /etc/hosts is going to be the easiest option.

The best option would be to use a programming language and a good HTML parser to do the job. eg: Use Python and BeautifulSoup to dig through the tree looking for any HTML tag which references an ad-serving network.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection