Ask HN: Fastest Crawl of HN Articles
HN links to over 6 million urls in stories and comments. Many domains have expired or content is no longer available. Internet archive has much of the content but throttles requests. What's the fastest way to get the historical content? HN does have a REST API which is quite easy to use. https://github.com/HackerNews/API I'm not sure what rate limiting policy is in place, but in theory you can start with a request for maxitem and from that point on just GET all items down to zero until you hit some sort of blocker. HN's API doesn't have rate limiting. The Algolia HN API[0] however does. To clarify I'm not asking about HN itself but articles linked from HN. As you said the HN api is great and there are at least 2 existing published crawls of it that help a lot. > To clarify I'm not asking about HN itself but articles linked from HN. I might not have a clear picture of what you're looking for, but items of type "story" returned by the HN API do have a URL field, which I believe correspond to submitted links. You can scrape the text field of comment items, but that takes a bit more work. Hopefull this will help: you're talking about a submission to HN, e.g. a link to a WSJ article complete with comments section, and OP is talking about the specific WSJ article. The fastest way to get that would probably still be through HN's API, you just have to take the URL field for stories and ignore everything else. And how do you get the content once you have the URL? > And how do you get the content once you have the URL? I don't understand your question. If you have the URL, you just GET it, like any regular URL? Is there something that I'm missing? Many domains have expired or content is no longer available. Use IA more responsibly, perhaps. Instead of scraping it, convert the list of links from HN to point to IA? You still have to work with whatever limits the site puts up in any case. If a HN story is a link to Wikipedia, the HN api serves the content of the Wikipedia page?? The best way to do it is from Google BigQuery. There's a dataset containing everything: bigquery-public-data.hacker_news.full You can write SQL and is super fast. Sample: SELECT * FROM bigquery-public-data.hacker_news.full LIMIT 1 maybe https://commoncrawl.org/