Generate RSS feed for any website using CSS selectors

203 points by thirdplace_ 3 years ago · 54 comments

Reader

toastal 3 years ago

CSS selectors were more useful before the Tailwind fad of dropping meaningful classes names in favor of recreating inline styles but with abbreviations to memorize. I use μBlock Origin + userStyles a lot which both also uses CSS selectors & the last couple of years everything has become a lot harder on the end user to tweak/fix. If you’re lucky now, you’ll have some ARIA attributes to select on.

zelphirkalt 3 years ago

And it also became harder due to people thinking random ids and class names are totally fine. Super annoyed by that. It feels like they are forcing their vision onto the user, while the user does not want their vision and could not care less.
- toastal 3 years ago
  
  The web was nicer when you could inspect, learn, & riff off of what others where doing in the industry–like the old music industry used to do when covering & borrowing a phrase was considered homage not grounds for lawsuit. It’s now all meant to be closed off & behind build tools that complect the output where most folks don’t even know how their pipeline works; and this is strange since the simple tools of HTML, CSS, & JS simply construct the web without any build steps at all if you wanted.
  - xp84 3 years ago
    
    Agree completely. As a minor nicety, though, I’ve noticed most sites publish sourcemaps in production now. So, in a few ways it’s become more possible to inspect and study JS, compared to when sourcemaps weren’t there, and you could only see mangled source.
    
    zelphirkalt 3 years ago
    
    I am not so sure about sourcemaps being an adequate replacement. They are just one feature flag toggle away from disappearing at any given moment. All it takes is one over-zealous tech decision maker to make them disappear on a website. And I know the types that would rather turn it off to shave off a few kB from the delivery, instead of rethinking their choice of framework. Or someone suddenly thinking, that they need to "protect" (obfuscate) their frontend source code. Source maps are too easy to switch off.

snthd 3 years ago

RSSHub[0] is in the same ballpark, but consists of a large library of site-specific code[1][2].

[0]https://github.com/DIYgod/RSSHub/

[1]https://github.com/DIYgod/RSSHub/tree/master/lib/routes

[2]https://github.com/DIYgod/RSSHub/tree/master/lib/v2

PurpleRamen 3 years ago

RSS Bridge also has a large library of site-specific code, CSS is just another of the hundred of solution they offer. And there are some other projects collecting and maintaining recipes for scrapping data from sites. Calibre for example and youtube-dl/yt-dlp for videos. Seeing so many projects doing all the same, I kinda feel sad that they are not cooperating to maintain a central recipe-collection.

solardev 3 years ago

It ded.

Archive: https://web.archive.org/web/20230714202418/https://rss-bridg...

Sample feed: https://web.archive.org/web/20230308160413/https://rss-bridg...

crtasm 3 years ago

List of public instances: https://rss-bridge.github.io/rss-bridge/General/Public_Hosts...
edit: but the few I tried did not have the CSS Selector Bridge enabled so go with the original link or archive of it.

awesomegoat_com 3 years ago

I was always afraid to use on of these. I thought that the css selectors would be too brittle and ultimately break.

I have build my own solution that is automagical at https://awesomegoat.com/ but I am running into next set of issues which are various scraping protections. It seems that reasonable RSS gateway today needs to include botnet of residential proxies just to read content on the internet.

xnx 3 years ago

This is a great tool! Before I learned about nitter, this was my primary way to follow people on Twitter. I love the idea of trying to wrestle unsupported feeds (Twitter, Instagram, etc.) into a standard/open format.

jasonlotito 3 years ago

The lack of feed generation is why I so many of the latest blog platforms are non-starters in my book. It boggles my mind. Honestly, if you don't generate a feed of some sort, I really can't take you seriously.

nfriedly 3 years ago

I run my own instance of RSS Bridge to keep track of authors that I like on Goodreads.

It works pretty well, although every once in a while Goodreads hiccups, and then RSS bridge gives me a bunch of "new posts" that are actually error messages.

captn3m0 3 years ago

Hey, I wrote the Goodreads bridge for exactly this usecase. I’ll try to see if I can filter out the error messages.
- nfriedly 3 years ago
  
  Thanks! I've been meaning to play with the code and see if I could see if I could figure out how to add a few more features:
  * Generate RSS feeds from book series
  * Filter out translations
  * Filter out compilations (not sure if this one is really plausible)
  Any pointers on how I might accomplish some of those?

okuntilnow 3 years ago

Huginn is an another useful tool that allows you to wrangle CSS selectors and XPath nodes to create RSS feeds.

I use it quite successfully to get data out of undocumented APIs and out into RSS.

https://github.com/huginn/huginn

bubblematrix 3 years ago

This honestly is standard web scraping but these projects always catch my attention.

You're bound at the mercy of rate-limiting firewalls (so you'll have to rotate proxies if you intend on using this heavily) on top of the standard CloudFront bot detection recaptcha, and div-obfuscation (a good example of this is Facebook).

captn3m0 3 years ago

rss-Bridge has decent caching support, customisable on a bridge level, so that comes pre-tuned and works well at low volumes for personal use.
At large scale, like the kind of traffic I started seeing when I ran a public rss-bridge Instagram/Telegram bridge - rate limits are unavoidable.
- k1m 3 years ago
  
  That's been my experience too. Some of the bridges take into account the rate limits imposed by the platforms, and the steps required to get content without an API key.
  So using RSS Bridge to generate feeds from large platforms is often a lot more reliable than the typical scraping script I'd code up myself for other sites.

dagurp 3 years ago

These days I just let chagpt generate a script that scrapes a site and spits out an rss file. Then I run it with cron.

notadev 3 years ago

I’m guessing they paste a portion of the website’s source then tell ChatGPT to generate a script that can generate an RSS feed from that site.
- dagurp 3 years ago
  
  Yeah I just copy the html that's relevant. There's some manual work involved but it doesn't take a lot of time.
dopidopHN 3 years ago

Are you not limited by the cut off date of the content the model is trained off ?
- pinkcan 3 years ago
  
  1. the script is generated by the llm
  2. the user runs the script that does the scraping
  these are temporally separate actions
  - dopidopHN 3 years ago
    
    Fine, but it’s subject to html selectors brittleness no? Oh, you subject the raw html when you need it maybe?
    
    dagurp 3 years ago
    
    Here's how I do it.
    1. Tell chatGPT to create a python script that scrapes example.com and generate an rss file.
    2. Paste a snippet of the html and tell it to modify the script to use that.
    3. I do some minor tweaks myself to fix the date format.

ChrisArchitect 3 years ago

Other services like this: https://www.fivefilters.org/feed-creator/

k1m 3 years ago

I created Feed Creator, so nice to see it mentioned in the comments :)
I've written two blog posts about how we go about using CSS selectors when working with Feed Creator. Might be useful for those looking to do the same with RSS-Bridge.
How to turn a webpage into an RSS feed using Feed Creator
Part 1: https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...
Part 2 (using more advanced selectors): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...

eviks 3 years ago

What's the easiset way to also run a few basic filters on the site/RSS feed's content to make it truly shine vs simplistic scraping, like

- splitting the full feed by theme of the article into separate feeds and at the same time

- remove a few keywords and also

- get article length and split into a long / short feed

- Or maybe get what you used to have on some news sites - subscribe only to a specific author instead of getting bombarded with hundreds of items in a feed

rakoo 3 years ago

Write a parser for rss-bridge that takes a rss feed in, does what you need, and spits a feed out
I don't know any service that does that automatically but it's attainable to have a generic way of doing what you need. That's the power of rss-bridge: make the feed you want from content that already exists
pinkcan 3 years ago

you could start by pushing all articles into a database; have another process quickly label/tag the entries based on the criteria you care about; web or tui app to show you only the entries you care about; slower clean up job for entries you don't care to keep around anymore
- eviks 3 years ago
  
  Thanks, but I meant which of the RSS services offers this basic filtering? From a dozen I know of, including paid ones, at most you get keywords black/white lists, which is too limiting Used to use Huginn for that on Heroku

PaulHoule 3 years ago

I've wondered why people have tried all sorts of cumbersome ways to splice metadata onto HTML like RDFa but never tried the obvious approach of basing extraction rules on CSS selectors... Often these work without the cooperation of the target site so long as they use CSS the way it was supposed be used (e.g. not tailwind, bootstrap, etc.)

ttepasse 3 years ago

Back in the optimistic 2000s there was the idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff, e.g. microformats, HTML meta, FOAF, etc:
https://www.w3.org/TR/grddl/
- account-5 3 years ago
  
  Having learned xpath and a little xslt I've always wondered why it isn't more popular. It seems like a powerhouse for reading and transforming data from XML type documents. I've found it hard to find decent resources to learn more than the basics (and none for xquery) because of lack of popularity nowadays, but I do thing it's a skill you should have like SQL and regex. Seems a no brainer.
  - PaulHoule 3 years ago
    
    I’ve thought about that. My first take on XSLT was that it was “too complicated”, I got to talking with XSLT enthusiasts later and found out how many good ideas XSLT has in it.
    My take is that some specifications can be written out in a linear way where you can start reading at the beginning and work to end and not feel like you need to read ahead.
    Some specs have a minor discontinuity, I remember perceiving it in the K and R book on C but it seemed like there was just one kink in it and if you read the book twice you’d do OK.
    Books in C++ are worse and have numerous topics that resist being put in the right order. It’s not unusual for “resource acquisition is initialization” to be repeated hundreds of times before it is defined, for instance.
    That circularity is both a function of the domain and also a function of the text, I think a certain amount of circularity is inherent to many domains, but frequently you can bootstrap a domain by dividing it into numerous layers and put the circularity into a layer built just to manage the circularity.
    XSLT, XMLSchema, and many XML specs have that kind of circular structure, you are left wondering what exact kind of machine is required to implement it so you can look at the spec and have a hard time understanding how to do easy things and no grasp of the hard-looking things that are actually easy. Couple that with numerous sharp edges in XML such as numeric values not being allowed in ID or IDREF fields (hate to break it to them but numeric identifiers are rampant in the jndustry) and it is no wonder people would rather use deeply lame ‘standards’ like JSON that lack comments, aren’t really clear about the semantics of numbers, and don’t have the moral authority to say “quit screwing around and just use ISO 8601 dares.
    Now I finally realized the OWL spec is perfectly clear in the sense that you can understand what it really does by understanding the mapping of OWL axioms to first order logic, but the trouble is that logic is the most treacherous branch of mathematics.
bubblematrix 3 years ago

CSS selectors has been common for the scrapers I've been using for years.
kybernetikos 3 years ago

I quite like the microformats approach to this. https://developer.mozilla.org/en-US/docs/Web/HTML/microforma...
k1m 3 years ago

Sadly the trend does seem to be a move away from semantic CSS. I get the appeal of Tailwind for creating components and custom designs, but it's surprising when you see content heavy sites like the BBC no longer using class attributes in their news articles the way they used to.

CoBE10 3 years ago

For me PolitePol is best because if doesn't limit the amount of feeds and the free plan is pretty good: https://politepol.com

treyd 3 years ago

I wonder if this would work better / be more expressive with XPATH-style selectors?

thirdplace_OP 3 years ago

rss-bridge also has xpath-style bridge: https://rss-bridge.org/bridge01/#bridge-XPathBridge

account-5 3 years ago

Is there a standalone application that can do similar. That doesn't require a web server to run. Like an RSS reader you'd run on you desktop or phone? I'd definitely be interested in that.

Hamuko 3 years ago

FreshRSS has XPath scraping.

https://danq.me/2022/09/27/freshrss-xpath/

midasz 3 years ago

Does it work for websites that fetch content async? I've had success with https://morss.it instead (which can also be selfhosted)

simonjgreen 3 years ago

This is very similar to how you can scrape data from web with powerquery

skribanto 3 years ago

Getting 502 Bad Gateway

kalupa 3 years ago

yea, suffering from success ...

kayson 3 years ago

FreshRSS has this feature built in. But you can use rss-bridge for far more complicated scenarios too

1vuio0pswjnm7 3 years ago

"Generate RSS feed for any website using CSS selectors"

For me, "CSS selectors" always seems like a deceptive term, if it means selecting HTML tag elements. What if the website does not use styling.

I read 1000s of websites, including all HN submissions, without using CSS. When I want to extract information from a website, I focus on patterns in the page. They might be HTML, they might be style elements, but they could be anything. I never assume that all websites will wrap the information I want in certain elements. There is a ridiculous amount of random variation amongst websites.

mmcwilliams 3 years ago

I'm not sure that CSS being used on the page is a requirement. In the way that `h1 a` would be a valid CSS selector, in this case, would not be require that it be styled by a style sheet.
The key here is that it uses selectors, not the style sheets themselves.
daniel-s 3 years ago

You just need to use the same logic, syntax as CSS' selectors to pick out can ntent from the page. That's something a little different to CSS to style.
1vuio0pswjnm7 3 years ago

Using CSS selectors, exclusively, is brittle and prone to failure.

Settings

Generate RSS feed for any website using CSS selectors

Keyboard Shortcuts