Settings

Theme

Ask HN: Fastest Crawl of HN Articles

8 points by agencies 4 years ago · 13 comments · 1 min read

Reader

HN links to over 6 million urls in stories and comments. Many domains have expired or content is no longer available. Internet archive has much of the content but throttles requests. What's the fastest way to get the historical content?

arinlen 4 years ago

HN does have a REST API which is quite easy to use.

https://github.com/HackerNews/API

I'm not sure what rate limiting policy is in place, but in theory you can start with a request for maxitem and from that point on just GET all items down to zero until you hit some sort of blocker.

  • krapp 4 years ago

    HN's API doesn't have rate limiting. The Algolia HN API[0] however does.

    [0]https://hn.algolia.com/api

  • agenciesOP 4 years ago

    To clarify I'm not asking about HN itself but articles linked from HN.

    As you said the HN api is great and there are at least 2 existing published crawls of it that help a lot.

    • arinlen 4 years ago

      > To clarify I'm not asking about HN itself but articles linked from HN.

      I might not have a clear picture of what you're looking for, but items of type "story" returned by the HN API do have a URL field, which I believe correspond to submitted links.

      You can scrape the text field of comment items, but that takes a bit more work.

      • lcnPylGDnU4H9OF 4 years ago

        Hopefull this will help: you're talking about a submission to HN, e.g. a link to a WSJ article complete with comments section, and OP is talking about the specific WSJ article.

    • krapp 4 years ago

      The fastest way to get that would probably still be through HN's API, you just have to take the URL field for stories and ignore everything else.

      • tedunangst 4 years ago

        And how do you get the content once you have the URL?

        • arinlen 4 years ago

          > And how do you get the content once you have the URL?

          I don't understand your question. If you have the URL, you just GET it, like any regular URL? Is there something that I'm missing?

        • krapp 4 years ago

          Use IA more responsibly, perhaps. Instead of scraping it, convert the list of links from HN to point to IA? You still have to work with whatever limits the site puts up in any case.

      • agenciesOP 4 years ago

        If a HN story is a link to Wikipedia, the HN api serves the content of the Wikipedia page??

jpcapdevila 4 years ago

The best way to do it is from Google BigQuery.

There's a dataset containing everything: bigquery-public-data.hacker_news.full

You can write SQL and is super fast. Sample:

SELECT * FROM bigquery-public-data.hacker_news.full LIMIT 1

python273 4 years ago

maybe https://commoncrawl.org/

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection