Show HN: robots.txt as a service, check web crawl rules through an API

17 points by fooock 7 years ago · 8 comments

Reader

dscpls 7 years ago

Why a service and not a library?

It looks like a great way for you to discover URLs but like a terribly slow way for people to avoid implementing robots.txt rules.

fooockOP 7 years ago

The project have multiple subprojects, and one of these is the parser. Any developer can compile or extend it without more effort and create a library. Just know to code in Kotlin / Java.
The aim of this project is only check if a given web resource can be crawled by a user-agent, but using a API

tehwhale 7 years ago

While this looks good, I don't think it's feasible for a web crawler in most cases. Crawlers want to crawl a ton of URLs and it would have to make a request to your service for each and every URL.

What's the plan here? Check for a sitemap.xml (which generally only contains crawlable URLs anyway) or crawl the index and look for all links and send a request to your service for every URL before crawling it?

I personally think it would be better suited as a library where you can pass it a robots.txt and it'll let you know if you can crawl a URL based on that.

fooockOP 7 years ago

I implemented this service thinking in make a network request for each new URL that needs to be crawled. Internally the service caches all requests by the base domain and user agent. The responses are very fast if these domain was previously checked.
For example if you want to check the url https://example.com/test/user/1 with a user agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent requests with different paths but same base url, port and protocol, will use the cached version (just ~190ms). Note that this version is in alpha and many things can be optimized. The balance between managing these files in different projects or the time between network requests must be sought.
Anyway, any person can compile the parser module and create a library to check robots.txt rules by itself ;-)
PS: thanks for the feedback

itsmefaz 7 years ago

The service is very nice and I understand your reason for developing it. I see this service to be having more value in helping companies find all the web pages, rather than just the allowed ones.

I understand the unethical nature of the above method, however, I see it happening quite a lot in practice.

fooockOP 7 years ago

Yes, in the practice people sometimes don't want to be polite with webmasters, and choose not obey robots.txt rules. Thanks for the suggestion!
- itsmefaz 7 years ago
  
  Exactly, your service could definitely be used as an alternate to parsing robots.txt (which traditionally is in xml) to a more standard json parsing. Along with the advantages that comes with making it REST.

fooockOP 7 years ago

I created this project to use in my projects. It is open source. You can use it if you are implementing SEO tools or a web crawler. Note that this is a first alpha release.

Give me some feedback!

Settings

Show HN: robots.txt as a service, check web crawl rules through an API

Keyboard Shortcuts