Generate RSS feed for any website using CSS selectors
rss-bridge.orgCSS selectors were more useful before the Tailwind fad of dropping meaningful classes names in favor of recreating inline styles but with abbreviations to memorize. I use μBlock Origin + userStyles a lot which both also uses CSS selectors & the last couple of years everything has become a lot harder on the end user to tweak/fix. If you’re lucky now, you’ll have some ARIA attributes to select on.
And it also became harder due to people thinking random ids and class names are totally fine. Super annoyed by that. It feels like they are forcing their vision onto the user, while the user does not want their vision and could not care less.
The web was nicer when you could inspect, learn, & riff off of what others where doing in the industry–like the old music industry used to do when covering & borrowing a phrase was considered homage not grounds for lawsuit. It’s now all meant to be closed off & behind build tools that complect the output where most folks don’t even know how their pipeline works; and this is strange since the simple tools of HTML, CSS, & JS simply construct the web without any build steps at all if you wanted.
Agree completely. As a minor nicety, though, I’ve noticed most sites publish sourcemaps in production now. So, in a few ways it’s become more possible to inspect and study JS, compared to when sourcemaps weren’t there, and you could only see mangled source.
I am not so sure about sourcemaps being an adequate replacement. They are just one feature flag toggle away from disappearing at any given moment. All it takes is one over-zealous tech decision maker to make them disappear on a website. And I know the types that would rather turn it off to shave off a few kB from the delivery, instead of rethinking their choice of framework. Or someone suddenly thinking, that they need to "protect" (obfuscate) their frontend source code. Source maps are too easy to switch off.
RSSHub[0] is in the same ballpark, but consists of a large library of site-specific code[1][2].
[0]https://github.com/DIYgod/RSSHub/
RSS Bridge also has a large library of site-specific code, CSS is just another of the hundred of solution they offer. And there are some other projects collecting and maintaining recipes for scrapping data from sites. Calibre for example and youtube-dl/yt-dlp for videos. Seeing so many projects doing all the same, I kinda feel sad that they are not cooperating to maintain a central recipe-collection.
It ded.
Archive: https://web.archive.org/web/20230714202418/https://rss-bridg...
Sample feed: https://web.archive.org/web/20230308160413/https://rss-bridg...
List of public instances: https://rss-bridge.github.io/rss-bridge/General/Public_Hosts...
edit: but the few I tried did not have the CSS Selector Bridge enabled so go with the original link or archive of it.
I was always afraid to use on of these. I thought that the css selectors would be too brittle and ultimately break.
I have build my own solution that is automagical at https://awesomegoat.com/ but I am running into next set of issues which are various scraping protections. It seems that reasonable RSS gateway today needs to include botnet of residential proxies just to read content on the internet.
This is a great tool! Before I learned about nitter, this was my primary way to follow people on Twitter. I love the idea of trying to wrestle unsupported feeds (Twitter, Instagram, etc.) into a standard/open format.
The lack of feed generation is why I so many of the latest blog platforms are non-starters in my book. It boggles my mind. Honestly, if you don't generate a feed of some sort, I really can't take you seriously.
I run my own instance of RSS Bridge to keep track of authors that I like on Goodreads.
It works pretty well, although every once in a while Goodreads hiccups, and then RSS bridge gives me a bunch of "new posts" that are actually error messages.
Hey, I wrote the Goodreads bridge for exactly this usecase. I’ll try to see if I can filter out the error messages.
Thanks! I've been meaning to play with the code and see if I could see if I could figure out how to add a few more features:
* Generate RSS feeds from book series
* Filter out translations
* Filter out compilations (not sure if this one is really plausible)
Any pointers on how I might accomplish some of those?
Huginn is an another useful tool that allows you to wrangle CSS selectors and XPath nodes to create RSS feeds.
I use it quite successfully to get data out of undocumented APIs and out into RSS.
This honestly is standard web scraping but these projects always catch my attention.
You're bound at the mercy of rate-limiting firewalls (so you'll have to rotate proxies if you intend on using this heavily) on top of the standard CloudFront bot detection recaptcha, and div-obfuscation (a good example of this is Facebook).
rss-Bridge has decent caching support, customisable on a bridge level, so that comes pre-tuned and works well at low volumes for personal use.
At large scale, like the kind of traffic I started seeing when I ran a public rss-bridge Instagram/Telegram bridge - rate limits are unavoidable.
That's been my experience too. Some of the bridges take into account the rate limits imposed by the platforms, and the steps required to get content without an API key.
So using RSS Bridge to generate feeds from large platforms is often a lot more reliable than the typical scraping script I'd code up myself for other sites.
These days I just let chagpt generate a script that scrapes a site and spits out an rss file. Then I run it with cron.
I’m guessing they paste a portion of the website’s source then tell ChatGPT to generate a script that can generate an RSS feed from that site.
Yeah I just copy the html that's relevant. There's some manual work involved but it doesn't take a lot of time.
Are you not limited by the cut off date of the content the model is trained off ?
1. the script is generated by the llm
2. the user runs the script that does the scraping
these are temporally separate actions
Fine, but it’s subject to html selectors brittleness no? Oh, you subject the raw html when you need it maybe?
Here's how I do it.
1. Tell chatGPT to create a python script that scrapes example.com and generate an rss file.
2. Paste a snippet of the html and tell it to modify the script to use that.
3. I do some minor tweaks myself to fix the date format.
Other services like this: https://www.fivefilters.org/feed-creator/
I created Feed Creator, so nice to see it mentioned in the comments :)
I've written two blog posts about how we go about using CSS selectors when working with Feed Creator. Might be useful for those looking to do the same with RSS-Bridge.
How to turn a webpage into an RSS feed using Feed Creator
Part 1: https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...
Part 2 (using more advanced selectors): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...
What's the easiset way to also run a few basic filters on the site/RSS feed's content to make it truly shine vs simplistic scraping, like
- splitting the full feed by theme of the article into separate feeds and at the same time
- remove a few keywords and also
- get article length and split into a long / short feed
- Or maybe get what you used to have on some news sites - subscribe only to a specific author instead of getting bombarded with hundreds of items in a feed
Write a parser for rss-bridge that takes a rss feed in, does what you need, and spits a feed out
I don't know any service that does that automatically but it's attainable to have a generic way of doing what you need. That's the power of rss-bridge: make the feed you want from content that already exists
you could start by pushing all articles into a database; have another process quickly label/tag the entries based on the criteria you care about; web or tui app to show you only the entries you care about; slower clean up job for entries you don't care to keep around anymore
Thanks, but I meant which of the RSS services offers this basic filtering? From a dozen I know of, including paid ones, at most you get keywords black/white lists, which is too limiting Used to use Huginn for that on Heroku
I've wondered why people have tried all sorts of cumbersome ways to splice metadata onto HTML like RDFa but never tried the obvious approach of basing extraction rules on CSS selectors... Often these work without the cooperation of the target site so long as they use CSS the way it was supposed be used (e.g. not tailwind, bootstrap, etc.)
Back in the optimistic 2000s there was the idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff, e.g. microformats, HTML meta, FOAF, etc:
Having learned xpath and a little xslt I've always wondered why it isn't more popular. It seems like a powerhouse for reading and transforming data from XML type documents. I've found it hard to find decent resources to learn more than the basics (and none for xquery) because of lack of popularity nowadays, but I do thing it's a skill you should have like SQL and regex. Seems a no brainer.
I’ve thought about that. My first take on XSLT was that it was “too complicated”, I got to talking with XSLT enthusiasts later and found out how many good ideas XSLT has in it.
My take is that some specifications can be written out in a linear way where you can start reading at the beginning and work to end and not feel like you need to read ahead.
Some specs have a minor discontinuity, I remember perceiving it in the K and R book on C but it seemed like there was just one kink in it and if you read the book twice you’d do OK.
Books in C++ are worse and have numerous topics that resist being put in the right order. It’s not unusual for “resource acquisition is initialization” to be repeated hundreds of times before it is defined, for instance.
That circularity is both a function of the domain and also a function of the text, I think a certain amount of circularity is inherent to many domains, but frequently you can bootstrap a domain by dividing it into numerous layers and put the circularity into a layer built just to manage the circularity.
XSLT, XMLSchema, and many XML specs have that kind of circular structure, you are left wondering what exact kind of machine is required to implement it so you can look at the spec and have a hard time understanding how to do easy things and no grasp of the hard-looking things that are actually easy. Couple that with numerous sharp edges in XML such as numeric values not being allowed in ID or IDREF fields (hate to break it to them but numeric identifiers are rampant in the jndustry) and it is no wonder people would rather use deeply lame ‘standards’ like JSON that lack comments, aren’t really clear about the semantics of numbers, and don’t have the moral authority to say “quit screwing around and just use ISO 8601 dares.
Now I finally realized the OWL spec is perfectly clear in the sense that you can understand what it really does by understanding the mapping of OWL axioms to first order logic, but the trouble is that logic is the most treacherous branch of mathematics.
CSS selectors has been common for the scrapers I've been using for years.
I quite like the microformats approach to this. https://developer.mozilla.org/en-US/docs/Web/HTML/microforma...
Sadly the trend does seem to be a move away from semantic CSS. I get the appeal of Tailwind for creating components and custom designs, but it's surprising when you see content heavy sites like the BBC no longer using class attributes in their news articles the way they used to.
For me PolitePol is best because if doesn't limit the amount of feeds and the free plan is pretty good: https://politepol.com
I wonder if this would work better / be more expressive with XPATH-style selectors?
rss-bridge also has xpath-style bridge: https://rss-bridge.org/bridge01/#bridge-XPathBridge
Is there a standalone application that can do similar. That doesn't require a web server to run. Like an RSS reader you'd run on you desktop or phone? I'd definitely be interested in that.
FreshRSS has XPath scraping.
Does it work for websites that fetch content async? I've had success with https://morss.it instead (which can also be selfhosted)
This is very similar to how you can scrape data from web with powerquery
Getting 502 Bad Gateway
yea, suffering from success ...
FreshRSS has this feature built in. But you can use rss-bridge for far more complicated scenarios too
"Generate RSS feed for any website using CSS selectors"
For me, "CSS selectors" always seems like a deceptive term, if it means selecting HTML tag elements. What if the website does not use styling.
I read 1000s of websites, including all HN submissions, without using CSS. When I want to extract information from a website, I focus on patterns in the page. They might be HTML, they might be style elements, but they could be anything. I never assume that all websites will wrap the information I want in certain elements. There is a ridiculous amount of random variation amongst websites.
I'm not sure that CSS being used on the page is a requirement. In the way that `h1 a` would be a valid CSS selector, in this case, would not be require that it be styled by a style sheet.
The key here is that it uses selectors, not the style sheets themselves.
You just need to use the same logic, syntax as CSS' selectors to pick out can ntent from the page. That's something a little different to CSS to style.
Using CSS selectors, exclusively, is brittle and prone to failure.