Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?
Like the title says wondering if there is an equivalent of Google takeout for HN? Or how you guys are doing it?
Thanks. You can export the whole dataset as described here: https://github.com/ClickHouse/ClickHouse/issues/29693 Or query one of the preloaded datasets: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT... This does not include the user's private data, which looks like what OP is after as well Here's a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files. It doesn't do upvotes nor stories/links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it. I cleaned up the code a little bit, but I didn't test it. This will have the same limitation as the Python I posted earlier in that you're not authenticated. I wrote a JS one years ago. It still seems to work but it might need some more throttling. https://news.ycombinator.com/item?id=34110624 Edit: I see I added a sleep on line 83 a few years ago. Edit 2: I just fixed a big bug, I'm not sure if it was there before. Edit 3: I wrote a Python one, too, but I haven't tested it and it most likely needs to be throttled. It's also not currently authenticated so only useful for certain pages unless you add authentication. https://github.com/gabrielsroka/gabrielsroka.github.io/blob/... There are few tests for this script which isn't packaged: https://github.com/westurner/dlhn/ https://github.com/westurner/dlhn/tree/master/tests https://github.com/westurner/hnlog/blob/master/Makefile Ctrl-F of the one document in a browser tab works, but isn't regex search (or `grep -i -C`) without a browser extension. Dogsheep / datasette has a SQLite query Web UI HackerNews/API:
https://github.com/HackerNews/API https://gist.github.com/verdverm/23aefb64ee981e17452e95dd5c4... Fetches pages and then converts to json There might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven't looked for it myself Hacker News has had an API since 2014[0]. It can be found via the "API" link at the bottom of the page[1]. That's a read only, unauthenticated API, correct? In other words, it does not show how to get up votes for a user, which is only visible to them. Nothing out of the box. There's a copy of the data in bigquery: https://console.cloud.google.com/bigquery?p=bigquery-public-... But the latest post is from Nov 2022, not sure if/when it gets reloaded.
curl https://clickhouse.com/ | sh
./clickhouse client --host play.clickhouse.com --user play --secure --query "SELECT * FROM hackernews WHERE by = 'thyrox' ORDER BY time" --format JSON
from pathlib import Path
import scrapy
import requests
import html
import json
import os
USER = 'Jugurtha'
LINKS = f'https://hacker-news.firebaseio.com/v0/user/{USER}.json?print=pretty'
BASE_URL = 'https://hacker-news.firebaseio.com/v0/item/'
class HNSpider(scrapy.Spider):
name = "hn"
def start_requests(self):
submitted = requests.get(LINKS).json()['submitted']
urls = [f'{BASE_URL}{sub}.json?print=pretty' for sub in submitted]
for url in urls:
item = url.split('/item/')[1].split('.json')[0]
filename = f'{item}.html'
filepath = Path(f'posts/{filename}')
if not os.path.exists(filepath):
yield scrapy.Request(url=url, callback=self.parse)
else:
self.log(f'Skipping already downloaded {url}')
def parse(self, response):
item = response.url.split('/item/')[1].split('.json')[0]
filename = f"{item}.html"
content = json.loads(response.text).get('text')
if content is not None:
text = html.unescape(content)
filepath = Path(f'posts/{filename}')
with open(Path(f'posts/{filename}'), 'w') as f:
f.write(text)
self.log(f"Saved file {filename}")
from pathlib import Path
import scrapy
import requests
import html
import json
import os
# Set this:
USER = 'Jugurtha'
BASE_URL = 'https://hacker-news.firebaseio.com/v0' # https://github.com/HackerNews/API
LINKS = f'${BASE_URL}/user/{USER}.json'
class HNSpider(scrapy.Spider):
name = 'hn'
def start_requests(self):
items = requests.get(LINKS).json()['submitted']
for item in items:
url = f'{BASE_URL}/item/{item}.json'
filepath = Path(f'posts/{item}.html')
if os.path.exists(filepath):
self.log(f'Skipping already downloaded {url}')
else:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
item = response.url.split('/item/')[1].split('.json')[0]
filename = f'{item}.html'
content = json.loads(response.text).get('text')
if content:
text = html.unescape(content)
with open(Path(f'posts/{filename}'), 'w') as f:
f.write(text)
self.log(f'Saved file {filename}')