Ask HN: Simple API to extract web article text?

2 points by friendofafriend 2 years ago · 7 comments · 1 min read

I'm working on a consumer facing project that involves analyzing text from web articles.

Does anyone know of an API that can handle the text extraction part automatically?

Ideally the API can take in a URL and just return the main text content of a website, even for sites with slightly complex layouts.

For example: https://www.nytimes.com/2024/03/28/technology/personaltech/smart-glasses-ray-ban-meta.html

We're most interested in an API that has a decent free tier + usage-based pricing (at least for overages).

So far, most of our searches have turned up website scrapers that return HTML that needs to be further parsed (ScrapingBot, ScrapingBee, Scrapingdog, etc.), or services that are prohibitively priced (Diffbot).

Next, we're looking into Apify, but maybe we've missed something?

Any recommendations would be greatly appreciated!

timoteostewart 2 years ago

Would you consider rolling your own? Python’s goose3 has worked well for me in article extraction. It seemed to be successful more often than trafilatura and newspaper3k.

friendofafriendOP 2 years ago

I was not aware of any of those projects - thank you for pointing me in the right direction!
goose3, trafilatura, newspaper3k (and newspaper4k even) all look like great tools. We were not planning on rolling our own, but that might be the right way to go after all. Thanks again.

cranberryturkey 2 years ago

Brisk.news

friendofafriendOP 2 years ago

Thanks for the suggestion! The site itself works pretty well, but I'm not seeing an API or any such documentation.
- cranberryturkey 2 years ago
  
  you can register and account and login then create an api key, unfortunately the api is not documented yet, but the code is FOSS. https://github.com/profullstack/hynt-web so it would be pretty easy to see.
  - friendofafriendOP 2 years ago
    
    Great, I'll check it out. Thank you again!
    
    cranberryturkey 2 years ago
    
    let me know if you have any questions just file a github issue on that repo and i'll add what you need.

Settings

Ask HN: Simple API to extract web article text?

Keyboard Shortcuts