What gets to the front page of Hacker News? A data project
randomshit.devSo, oddly enough, I've also been looking at HN front-page characteristics, based on the same corpus (the "past" page links). And that whole section on caveats over what that archive represents is something I could have written... The front page, both in its dynamic and archived forms is strongly subject to many influences in complex ways.
A couple of tips:
- It's possible to crawl the page using wget, given a reasonable delay. The full collection from 2007 to present (I'd done my first crawl in late May of this year) took a couple of days. Updates to that happen in seconds.
- I break down data by date, story position (e.g., rank 1--30), submitted site (if present), points (votes), comments, and submitter, as well as title.
- I'm working on classifying titles. The original question prompting my analysis was what US states get the most love from HN (NY, CA, WA*, TX, and CO are the top 5). I'd expanded that US and globally-significant cities, and been doing some tuple-based ngram analysis, though that gets pretty hairy.
For 2022 (most recent complete year), the top 40 submitted front-page sites are:
2022: Distinct sites: 6446
Site Stories Points ( mean ) Comments ( mean )
------------------------------ ------- ------ ---------- -------- ----------
n/a 432 167275 ( 386.32 ) 125304 ( 289.39 )
youtube.com 105 27243 ( 257.01 ) 12489 ( 117.82 )
nature.com 80 17694 ( 218.44 ) 11716 ( 144.64 )
wikipedia.org 68 12258 ( 177.65 ) 5855 ( 84.86 )
nytimes.com 67 21190 ( 311.62 ) 21765 ( 320.07 )
arstechnica.com 63 18319 ( 286.23 ) 12057 ( 188.39 )
ieee.org 53 9432 ( 174.67 ) 5933 ( 109.87 )
reuters.com 53 28360 ( 525.19 ) 29033 ( 537.65 )
theguardian.com 49 12228 ( 244.56 ) 8677 ( 173.54 )
quantamagazine.org 48 11293 ( 230.47 ) 5519 ( 112.63 )
science.org 47 12485 ( 260.10 ) 7655 ( 159.48 )
economist.com 46 12504 ( 266.04 ) 17324 ( 368.60 )
bloomberg.com 43 20037 ( 455.39 ) 20630 ( 468.86 )
lwn.net 43 10566 ( 240.14 ) 5912 ( 134.36 )
theverge.com 43 16313 ( 370.75 ) 14335 ( 325.80 )
arxiv.org 39 7415 ( 185.38 ) 3559 ( 88.97 )
washingtonpost.com 39 15778 ( 394.45 ) 18117 ( 452.93 )
bbc.com 37 11600 ( 305.26 ) 8696 ( 228.84 )
newyorker.com 37 7577 ( 199.39 ) 6549 ( 172.34 )
wsj.com 36 10920 ( 295.14 ) 11646 ( 314.76 )
wired.com 35 9104 ( 252.89 ) 6738 ( 187.17 )
archive.org 32 8011 ( 242.76 ) 4626 ( 140.18 )
gist.github.com 32 10287 ( 311.73 ) 5456 ( 165.33 )
reddit.com 30 12579 ( 405.77 ) 8457 ( 272.81 )
theregister.com 29 8288 ( 276.27 ) 4586 ( 152.87 )
apple.com 28 13245 ( 456.72 ) 12917 ( 445.41 )
github.blog 26 8398 ( 311.04 ) 4242 ( 157.11 )
cnbc.com 23 8568 ( 357.00 ) 10356 ( 431.50 )
phys.org 23 4918 ( 204.92 ) 2380 ( 99.17 )
theatlantic.com 23 7518 ( 313.25 ) 10643 ( 443.46 )
axios.com 22 8903 ( 387.09 ) 8616 ( 374.61 )
news.mit.edu 22 6181 ( 268.74 ) 2887 ( 125.52 )
smithsonianmag.com 22 4964 ( 215.83 ) 2988 ( 129.91 )
stanford.edu 22 8461 ( 367.87 ) 4720 ( 205.22 )
krebsonsecurity.com 21 6299 ( 286.32 ) 3331 ( 151.41 )
microsoft.com 21 7809 ( 354.95 ) 4392 ( 199.64 )
atlasobscura.com 20 2789 ( 132.81 ) 1637 ( 77.95 )
cnn.com 19 4704 ( 235.20 ) 4252 ( 212.60 )
righto.com 19 2568 ( 128.40 ) 795 ( 39.75 )
simonwillison.net 17 4878 ( 271.00 ) 1553 ( 86.28 )
TechCrunch, BTW, lands at #41: techcrunch.com 17 8681 ( 482.28 ) 8224 ( 456.89 )
(The "mean" values are the arithmetic mean of points (votes) and comments by domain.)For 2023, there've only been 10 TechCrunch items (through 21-6-2023), well below trend:
Ubuntu 22.04 LTS servers and phased apt updates
Twitterrific has been discontinued
DuckDB – An in-process SQL OLAP database management system
Shane Pitman, leader of the warez group Razor 1911: life after prison (2005)
Nearly 40% of software engineers will only work remotely
Htmx 1.9.0 has been released
Geometry Central: library of data structures, algorithms for geometry processing
Google Authenticator now supports Google Account synchronization
I Wrote an Activitypub Server in OCaml: Lessons Learnt, Weekends Lost
In New Paradox, Black Holes Appear to Evade Heat Death
I'll note that breaking stories down by site will tend to obscure categories, as frequently-submitted sites (say, NY Times) will crowd out many individual blogs. I could probably do some manual classification based on sites, including, say, all categories of Twitter (currently broken out by user/account), and might look into that.One of the most surprising facts to jump out to me is how much nytimes.com has fallen since 2019. It had previously been in the top-4 submitted sites pretty consistently, and single top for 2014--2019, but fell to 7th in 2020 and 9th in 2021, recovering to 5 in 2022.
I've also paired my own analysis with a 2022 study published by Whaly.io based on the HN API and all content submitted: <https://whaly.io/posts/hacker-news-2021-retrospective>
I've been somewhat live-bloogging my analysis on the Fediverse under the #HackerNewsAnalytics hashtag:
<https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>
That includes a number of findings (and testing/debugging notes), including: mentions of Reddit by year, mentions of the FP-500 companies (top-10: Apple, Microsoft, Amazon, Intel, Tesla, Netflix, IBM, Adobe, Oracle, and AT&T, though Google under various terms (Google, Alphabet, YouTube, Android) nearly doubles the top-ranked Apple, and no, adding in iPhone, iPad, MacBook, etc., doesn't help), trends in votes and comments by story position (interesting IMO), overall submission success rate (a hair under 3%), mentions of the FP Top 100 Global Thinkers in titles (reprising an old study of mine of numerous online sites), a look at the Leaders characteristics, what HN cares about being down, and, well, ... things: <https://toot.cat/@dredmorbius/110454128168815763>
________________________________
Notes:
* "Washington" can of course designate both a city and a state, amongst other things, and it turns out that the string is dominated by references to the Washington Post, much as "New York" is by the New York Times. But the list gives the naive ranking. Adding in "Silicon Valley" and "San Francisco" put California well on top.
Edits: Some in situ updates as I think of things. Sorry!
And for an overall activity summary:
Data through 2023-6-21.Years: 16 Start year: 2007 End year: 2023 Total Stories: 178882 Distinct Sites: 52642 Distinct Submitters: 43648 Year Stories Points ( mean ) Comments ( mean ) Sites Submitters ---- ------- ------ --------- -------- --------- ----- ---------- 2007 9382 92264 ( 9.83 ) 61207 ( 6.52 ) 2644 1163 2008 10980 294775 ( 26.85 ) 186339 ( 16.97 ) 3458 2021 2009 10950 608603 ( 55.58 ) 303962 ( 27.76 ) 4157 3017 2010 10950 1062763 ( 97.06 ) 491718 ( 44.91 ) 4397 3987 2011 10949 1657004 ( 151.34 ) 632724 ( 57.79 ) 4830 4889 2012 10980 1829402 ( 166.61 ) 778634 ( 70.91 ) 5047 5537 2013 10950 2132819 ( 194.78 ) 998387 ( 91.18 ) 5250 5975 2014 10905 2057628 ( 188.69 ) 916438 ( 84.04 ) 5338 5920 2015 10950 2001269 ( 182.76 ) 845719 ( 77.23 ) 5301 5313 2016 10977 2521394 ( 229.70 ) 1137575 ( 103.63 ) 5238 5341 2017 10950 2776064 ( 253.52 ) 1259899 ( 115.06 ) 5236 5274 2018 10950 2762928 ( 252.32 ) 1262654 ( 115.31 ) 4986 5038 2019 10950 3051011 ( 278.63 ) 1447141 ( 132.16 ) 5302 5123 2020 10980 3338150 ( 304.02 ) 1734703 ( 157.99 ) 5938 5564 2021 10950 3376829 ( 308.39 ) 1859933 ( 169.86 ) 6178 5339 2022 10950 3308025 ( 302.10 ) 1986265 ( 181.39 ) 6446 5443 2023 5160 1555335 ( 301.42 ) 879401 ( 170.43 ) 3253 2851Oh man, this is awesome. I learned a lot from collecting this data and one of the big takeaways for me was how diverse the set of news sources on HN is (to your point, very little "traditional" journalism here). Glad you're doing this!
Drop me a line if you'd like to discuss this / share w/ reports. username at Protonmail.
I'm sort of a Can Haz All The Tables sort of guy, and I'm largely processing via awk (and a few other shell tools). So pasting that here would get a bit tedious...
It's also been interesting to look at how HN has, and hasn't, changed over the years. Your categorical analysis would be an interesting filter to look at over time, especially regarding accusations that HN is drifting in various directions.
The other bit that stands out to me is how constrained a set the front page is (30 slots per day, 10,950 per year, 10,980 in a leap year), as well as how thin submission titles are for gleaning meaning and context (I'm ... somewhat frustrated by this). Though there is clearly signal that gets through.
I don't have time-of-day granularity, but can look at day-of-week (and have) and month-of-year (not yet) looking for seasonality. DoW has been interesting (usually peaks Tue/Wed, starts trailing off on Fri, Sat & Sun are low points, based on votes/comments, but give higher odds of a given submission landing).
You might want to look at Whaly's work as well (I'd edited it into my larger top-level comment above: <https://whaly.io/posts/hacker-news-2021-retrospective>).
I should mention that I clicked on every single link to see the contents before classifying it, which is part of what made this so tedious
So, some further thoughts on your methodology:
- It's comprehensive. That's ... admirable, but not necessarily efficient in data analysis. There's a lot to be said for both random sampling and inference.
- You might get more mileage by looking at the top-n stories of a given day. I'd suggest 3--5 items. There's a considerable fall-off in activity from storypos 1 to storypos 30 (1st to 30th items on the front page archive), which is one of the dimensions I've looked at.
- The thought that's occurred to me over the past few days is that this seems like a natural area in which LLM / GPT techniques might be used to classify posts given training data.
- Tuple and ngram analysis can also turn up interesting patterns. Here it's useful to have a base corpus from which universal tendencies can be inferred, and to look at statistically improbably terms which occur both from the HN subject corpus to the universal corpus (terms and phrases which HN finds significant), as well as changing trends over time within the HN corpus.
- Day-of-week and month-of-year analysis can also show interesting patterns, and I've looked at a bit of the first. I'd really like to know if there's an HN "September" (on an annual basis).
- I took a look at your data and ... spreadsheets. Maybe I'm old-school, but flatfiles and gawk are really my style.
There's a thin line between dedication and mania.
I'd probably manually classify domains by topic.
The top 100 domains appear at least 138 times each.
Domains appearing >= 100 times are 149.
The top 500 domains appear >= 35x each. (Number 500 is a personal fave, lowtechmagazine.com).
The top 1,000 sites, >= 17x each.
14,676 sites appear more than once.
37,966 sites appear only once.
25% of FP stories come from 31 sites appearing 400+ times each.
50% of FP stories come from 331 sites appearing 51+ times each.
75%: 2,521 sites, 7+ times.
90%: 7,749 sites, 3+ times.
95%: 11,173 sites, 2+ times.
99%: 13,992 sites, 2+ times.
Pick the degree of completeness you want (your 5% "misc" would require classifying slightly more than 11,000 sites).
I'd probably aim for 50--75% coverage.
OK, while writing this, I've classified about 10,200 (of 52,642) domains. (most of the first 300 manually, a bunch of the rest based on regexes, e.g., .edu, .gov, blogspot, medium.com, substack.com domains, etc.).
By site:
By story count ...1 7621 software 2 1710 blog 3 535 academic / science 4 123 government 5 41 general news 6 34 ??? 7 31 corporate comm. 8 30 tech news 9 15 general interest 10 10 business news 11 8 law 12 6 technology 13 4 social media 14 3 corporate comm 15 3 general magazine 16 2 general information 17 2 science news 18 2 tech discussion 19 2 video 20 1 business education 21 1 corporate comm. 22 1 corporate commm. 23 1 general discussion 24 1 health news 25 1 images 26 1 law 27 1 legal news 28 1 misc 29 1 n/a 30 1 podcast 31 1 tech blog 32 1 tech law 33 1 tech publications 34 1 technology / security 35 1 translation 36 1 videos 37 1 webcomic Unclassified: 42442
'???' indicates I couldn't (quickly) assess a domain. Examples: 37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.1 13782 general news 2 13398 software 3 10473 tech news 4 8677 blog 5 7651 academic / science 6 7294 n/a 7 4750 ??? 8 4600 business news 9 3546 corporate comm. 10 1504 general magazine 11 1291 general information 12 1162 general interest 13 1132 technology 14 1099 videos 15 1073 social media 16 975 government 17 568 corporate comm 18 559 tech discussion 19 505 tech law 20 251 tech publications 21 171 tech blog 22 170 science news 23 136 business education 24 104 corporate comm. 25 103 video 26 99 corporate commm. 27 96 general discussion 28 80 misc 29 71 technology / security 30 61 law 31 59 webcomic 32 49 translation 33 48 health news 34 47 images 35 46 podcast 36 32 law 37 7 legal news Unclassified: 93213Note that I'm classifying by site rather than story, so an NY Times item on, say, quantum computing, would fall under "general news".
Also, very quick ad hoc code here, there are assuredly errors (and I've already fixed a few in stealth edits to this comment).
Having played with classifying sites for much of the past day, I've assigned a classification to just under 30% of them, which classifies just under 64% of all posts.
The remaining unclassified sites average about 1.7 posts each (there are a few with as many as 20 posts), but there are minimal gains for additional classification.
I'm starting now with running an analysis over the full archive to come up with trends-by-classification over years.
The top-20 classifications (by story) are:
I've got a total of 60 classifications which ... seems a bit high, and I'm looking at ways of slimming that down. It's also a bit confused, as some is classified by topic ("programming", "networking" "database", "cryptocurrency", "crowdfunding"), some by source ("corporate comm." is any post that originates from an identifiable company communicating as that company), and general format ("blog" includes 5,306 sites, and spans a wide range of topics). The distinction between, say, "tech news" and "blog" is somewhat ambiguous, and there are a few blogs which should be classified as "corporate comms.". But in all there's a rough sense of what types of content are being posted, and I'd really like to see the change over time.1 64777 36.21% UNCLASSIFIED 2 22481 12.57% blog 3 15106 8.44% general news 4 13769 7.70% tech news 5 12709 7.10% programming 6 8459 4.73% academic / science 7 8200 4.58% corporate comm. 8 7294 4.08% n/a 9 5311 2.97% business news 10 3798 2.12% general interest 11 2151 1.20% social media 12 2048 1.14% software 13 1613 0.90% technology 14 1432 0.80% video 15 1144 0.64% general information (wiki) 16 1006 0.56% government 17 724 0.40% misc documents 18 720 0.40% law 19 702 0.39% tech discussion 20 620 0.35% science newsFor those interested in the Ongoing Saga of HN Front Page Analyticcs, I've been posting occasional updates to the above site-based classification (~60% of posts now classified) to the Fediverse: <https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>
(It's a bit much to dump massive tables to HN, I'm trying to keep that to a bearable minimum.)