Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

45 points by thyrox 2 years ago · 12 comments · 1 min read

Reader

Like the title says wondering if there is an equivalent of Google takeout for HN? Or how you guys are doing it?

Thanks.

You can export the whole dataset as described here: https://github.com/ClickHouse/ClickHouse/issues/29693

Or query one of the preloaded datasets: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

    curl https://clickhouse.com/ | sh

    ./clickhouse client --host play.clickhouse.com --user play --secure --query "SELECT * FROM hackernews WHERE by = 'thyrox' ORDER BY time" --format JSON

verdverm 2 years ago

This does not include the user's private data, which looks like what OP is after as well

Jugurtha 2 years ago

Here's a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files.

It doesn't do upvotes nor stories/links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it.

  from pathlib import Path
  
  import scrapy
  import requests
  import html
  import json
  import os

  USER = 'Jugurtha'  
  LINKS = f'https://hacker-news.firebaseio.com/v0/user/{USER}.json?print=pretty'
  BASE_URL = 'https://hacker-news.firebaseio.com/v0/item/'

  class HNSpider(scrapy.Spider):
      name = "hn"
  
      def start_requests(self):
          submitted = requests.get(LINKS).json()['submitted']
          urls = [f'{BASE_URL}{sub}.json?print=pretty' for sub in submitted]
          for url in urls:
              item = url.split('/item/')[1].split('.json')[0]
              filename = f'{item}.html'
              filepath = Path(f'posts/{filename}')
              if not os.path.exists(filepath):
                  yield scrapy.Request(url=url, callback=self.parse)
              else:
                  self.log(f'Skipping already downloaded {url}')
  
      def parse(self, response):
          item = response.url.split('/item/')[1].split('.json')[0]
  
          filename = f"{item}.html"
          content = json.loads(response.text).get('text')
          if content is not None:
              text = html.unescape(content)
              filepath = Path(f'posts/{filename}')
  
              with open(Path(f'posts/{filename}'), 'w') as f:
                  f.write(text)
                  self.log(f"Saved file {filename}")

gabrielsroka 2 years ago

I cleaned up the code a little bit, but I didn't test it. This will have the same limitation as the Python I posted earlier in that you're not authenticated.

  from pathlib import Path
  
  import scrapy
  import requests
  import html
  import json
  import os
 
  # Set this:
  USER = 'Jugurtha'  
  
  BASE_URL = 'https://hacker-news.firebaseio.com/v0' # https://github.com/HackerNews/API
  LINKS = f'${BASE_URL}/user/{USER}.json'
 
  class HNSpider(scrapy.Spider):
      name = 'hn'
  
      def start_requests(self):
          items = requests.get(LINKS).json()['submitted']
          for item in items:
              url = f'{BASE_URL}/item/{item}.json'
              filepath = Path(f'posts/{item}.html')
              if os.path.exists(filepath):
                  self.log(f'Skipping already downloaded {url}')
              else:
                  yield scrapy.Request(url=url, callback=self.parse)
  
      def parse(self, response):
          item = response.url.split('/item/')[1].split('.json')[0]
  
          filename = f'{item}.html'
          content = json.loads(response.text).get('text')
          if content:
              text = html.unescape(content)
  
              with open(Path(f'posts/{filename}'), 'w') as f:
                  f.write(text)
                  self.log(f'Saved file {filename}')

gabrielsroka 2 years ago

I wrote a JS one years ago. It still seems to work but it might need some more throttling.

https://news.ycombinator.com/item?id=34110624

Edit: I see I added a sleep on line 83 a few years ago.

Edit 2: I just fixed a big bug, I'm not sure if it was there before.

Edit 3: I wrote a Python one, too, but I haven't tested it and it most likely needs to be throttled. It's also not currently authenticated so only useful for certain pages unless you add authentication.

https://github.com/gabrielsroka/gabrielsroka.github.io/blob/...

westurner 2 years ago

There are few tests for this script which isn't packaged: https://github.com/westurner/dlhn/ https://github.com/westurner/dlhn/tree/master/tests https://github.com/westurner/hnlog/blob/master/Makefile

Ctrl-F of the one document in a browser tab works, but isn't regex search (or `grep -i -C`) without a browser extension.

Dogsheep / datasette has a SQLite query Web UI

HackerNews/API: https://github.com/HackerNews/API

verdverm 2 years ago

https://gist.github.com/verdverm/23aefb64ee981e17452e95dd5c4...

Fetches pages and then converts to json

There might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven't looked for it myself

krapp 2 years ago

Hacker News has had an API since 2014[0]. It can be found via the "API" link at the bottom of the page[1].
[0]https://www.ycombinator.com/blog/hacker-news-api
[1]https://github.com/HackerNews/API
- verdverm 2 years ago
  
  That's a read only, unauthenticated API, correct?
  In other words, it does not show how to get up votes for a user, which is only visible to them.

mooreds 2 years ago

Nothing out of the box.

There's a copy of the data in bigquery: https://console.cloud.google.com/bigquery?p=bigquery-public-...

But the latest post is from Nov 2022, not sure if/when it gets reloaded.

Tomte 2 years ago

Partially. https://github.com/dogsheep/hacker-news-to-sqlite

082349872349872 2 years ago

scrape https://news.ycombinator.com/user?id=thyrox ?

Settings

Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

Keyboard Shortcuts