Ask HN: Fastest Crawl of HN Articles

8 points by agencies 4 years ago · 13 comments · 1 min read

Reader

HN links to over 6 million urls in stories and comments. Many domains have expired or content is no longer available. Internet archive has much of the content but throttles requests. What's the fastest way to get the historical content?

arinlen 4 years ago

HN does have a REST API which is quite easy to use.

https://github.com/HackerNews/API

I'm not sure what rate limiting policy is in place, but in theory you can start with a request for maxitem and from that point on just GET all items down to zero until you hit some sort of blocker.

krapp 4 years ago

HN's API doesn't have rate limiting. The Algolia HN API[0] however does.
[0]https://hn.algolia.com/api
agenciesOP 4 years ago

To clarify I'm not asking about HN itself but articles linked from HN.
As you said the HN api is great and there are at least 2 existing published crawls of it that help a lot.
- arinlen 4 years ago
  
  > To clarify I'm not asking about HN itself but articles linked from HN.
  I might not have a clear picture of what you're looking for, but items of type "story" returned by the HN API do have a URL field, which I believe correspond to submitted links.
  You can scrape the text field of comment items, but that takes a bit more work.
  - lcnPylGDnU4H9OF 4 years ago
    
    Hopefull this will help: you're talking about a submission to HN, e.g. a link to a WSJ article complete with comments section, and OP is talking about the specific WSJ article.
- krapp 4 years ago
  
  The fastest way to get that would probably still be through HN's API, you just have to take the URL field for stories and ignore everything else.
  - tedunangst 4 years ago
    
    And how do you get the content once you have the URL?
    
    arinlen 4 years ago
    
    > And how do you get the content once you have the URL?
    I don't understand your question. If you have the URL, you just GET it, like any regular URL? Is there something that I'm missing?
    
    agenciesOP 4 years ago
    
    Many domains have expired or content is no longer available.
    
    krapp 4 years ago
    
    Use IA more responsibly, perhaps. Instead of scraping it, convert the list of links from HN to point to IA? You still have to work with whatever limits the site puts up in any case.
  - agenciesOP 4 years ago
    
    If a HN story is a link to Wikipedia, the HN api serves the content of the Wikipedia page??

jpcapdevila 4 years ago

The best way to do it is from Google BigQuery.

There's a dataset containing everything: bigquery-public-data.hacker_news.full

You can write SQL and is super fast. Sample:

SELECT * FROM bigquery-public-data.hacker_news.full LIMIT 1

python273 4 years ago

maybe https://commoncrawl.org/

Settings

Ask HN: Fastest Crawl of HN Articles

Keyboard Shortcuts