Settings

Theme

What gets to the front page of Hacker News? A data project

randomshit.dev

6 points by itunpredictable 3 years ago · 9 comments

Reader

dredmorbius 3 years ago

So, oddly enough, I've also been looking at HN front-page characteristics, based on the same corpus (the "past" page links). And that whole section on caveats over what that archive represents is something I could have written... The front page, both in its dynamic and archived forms is strongly subject to many influences in complex ways.

A couple of tips:

- It's possible to crawl the page using wget, given a reasonable delay. The full collection from 2007 to present (I'd done my first crawl in late May of this year) took a couple of days. Updates to that happen in seconds.

- I break down data by date, story position (e.g., rank 1--30), submitted site (if present), points (votes), comments, and submitter, as well as title.

- I'm working on classifying titles. The original question prompting my analysis was what US states get the most love from HN (NY, CA, WA*, TX, and CO are the top 5). I'd expanded that US and globally-significant cities, and been doing some tuple-based ngram analysis, though that gets pretty hairy.

For 2022 (most recent complete year), the top 40 submitted front-page sites are:

  2022:  Distinct sites:  6446

  Site                            Stories     Points (   mean  )  Comments (   mean  )
  ------------------------------  -------     ------ ----------   -------- ----------
  n/a                                 432     167275 (  386.32 )    125304 (  289.39 )
  youtube.com                         105      27243 (  257.01 )     12489 (  117.82 )
  nature.com                           80      17694 (  218.44 )     11716 (  144.64 )
  wikipedia.org                        68      12258 (  177.65 )      5855 (   84.86 )
  nytimes.com                          67      21190 (  311.62 )     21765 (  320.07 )
  arstechnica.com                      63      18319 (  286.23 )     12057 (  188.39 )
  ieee.org                             53       9432 (  174.67 )      5933 (  109.87 )
  reuters.com                          53      28360 (  525.19 )     29033 (  537.65 )
  theguardian.com                      49      12228 (  244.56 )      8677 (  173.54 )
  quantamagazine.org                   48      11293 (  230.47 )      5519 (  112.63 )
  science.org                          47      12485 (  260.10 )      7655 (  159.48 )
  economist.com                        46      12504 (  266.04 )     17324 (  368.60 )
  bloomberg.com                        43      20037 (  455.39 )     20630 (  468.86 )
  lwn.net                              43      10566 (  240.14 )      5912 (  134.36 )
  theverge.com                         43      16313 (  370.75 )     14335 (  325.80 )
  arxiv.org                            39       7415 (  185.38 )      3559 (   88.97 )
  washingtonpost.com                   39      15778 (  394.45 )     18117 (  452.93 )
  bbc.com                              37      11600 (  305.26 )      8696 (  228.84 )
  newyorker.com                        37       7577 (  199.39 )      6549 (  172.34 )
  wsj.com                              36      10920 (  295.14 )     11646 (  314.76 )
  wired.com                            35       9104 (  252.89 )      6738 (  187.17 )
  archive.org                          32       8011 (  242.76 )      4626 (  140.18 )
  gist.github.com                      32      10287 (  311.73 )      5456 (  165.33 )
  reddit.com                           30      12579 (  405.77 )      8457 (  272.81 )
  theregister.com                      29       8288 (  276.27 )      4586 (  152.87 )
  apple.com                            28      13245 (  456.72 )     12917 (  445.41 )
  github.blog                          26       8398 (  311.04 )      4242 (  157.11 )
  cnbc.com                             23       8568 (  357.00 )     10356 (  431.50 )
  phys.org                             23       4918 (  204.92 )      2380 (   99.17 )
  theatlantic.com                      23       7518 (  313.25 )     10643 (  443.46 )
  axios.com                            22       8903 (  387.09 )      8616 (  374.61 )
  news.mit.edu                         22       6181 (  268.74 )      2887 (  125.52 )
  smithsonianmag.com                   22       4964 (  215.83 )      2988 (  129.91 )
  stanford.edu                         22       8461 (  367.87 )      4720 (  205.22 )
  krebsonsecurity.com                  21       6299 (  286.32 )      3331 (  151.41 )
  microsoft.com                        21       7809 (  354.95 )      4392 (  199.64 )
  atlasobscura.com                     20       2789 (  132.81 )      1637 (   77.95 )
  cnn.com                              19       4704 (  235.20 )      4252 (  212.60 )
  righto.com                           19       2568 (  128.40 )       795 (   39.75 )
  simonwillison.net                    17       4878 (  271.00 )      1553 (   86.28 )
TechCrunch, BTW, lands at #41:

  techcrunch.com                       17       8681 (  482.28 )      8224 (  456.89 )
(The "mean" values are the arithmetic mean of points (votes) and comments by domain.)

For 2023, there've only been 10 TechCrunch items (through 21-6-2023), well below trend:

  Ubuntu 22.04 LTS servers and phased apt updates
  Twitterrific has been discontinued
  DuckDB – An in-process SQL OLAP database management system
  Shane Pitman, leader of the warez group Razor 1911: life after prison (2005)
  Nearly 40% of software engineers will only work remotely
  Htmx 1.9.0 has been released
  Geometry Central: library of data structures, algorithms for geometry processing
  Google Authenticator now supports Google Account synchronization
  I Wrote an Activitypub Server in OCaml: Lessons Learnt, Weekends Lost
  In New Paradox, Black Holes Appear to Evade Heat Death

I'll note that breaking stories down by site will tend to obscure categories, as frequently-submitted sites (say, NY Times) will crowd out many individual blogs. I could probably do some manual classification based on sites, including, say, all categories of Twitter (currently broken out by user/account), and might look into that.

One of the most surprising facts to jump out to me is how much nytimes.com has fallen since 2019. It had previously been in the top-4 submitted sites pretty consistently, and single top for 2014--2019, but fell to 7th in 2020 and 9th in 2021, recovering to 5 in 2022.

I've also paired my own analysis with a 2022 study published by Whaly.io based on the HN API and all content submitted: <https://whaly.io/posts/hacker-news-2021-retrospective>

I've been somewhat live-bloogging my analysis on the Fediverse under the #HackerNewsAnalytics hashtag:

<https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>

That includes a number of findings (and testing/debugging notes), including: mentions of Reddit by year, mentions of the FP-500 companies (top-10: Apple, Microsoft, Amazon, Intel, Tesla, Netflix, IBM, Adobe, Oracle, and AT&T, though Google under various terms (Google, Alphabet, YouTube, Android) nearly doubles the top-ranked Apple, and no, adding in iPhone, iPad, MacBook, etc., doesn't help), trends in votes and comments by story position (interesting IMO), overall submission success rate (a hair under 3%), mentions of the FP Top 100 Global Thinkers in titles (reprising an old study of mine of numerous online sites), a look at the Leaders characteristics, what HN cares about being down, and, well, ... things: <https://toot.cat/@dredmorbius/110454128168815763>

________________________________

Notes:

* "Washington" can of course designate both a city and a state, amongst other things, and it turns out that the string is dominated by references to the Washington Post, much as "New York" is by the New York Times. But the list gives the naive ranking. Adding in "Silicon Valley" and "San Francisco" put California well on top.

Edits: Some in situ updates as I think of things. Sorry!

  • dredmorbius 3 years ago

    And for an overall activity summary:

      Years:                16
      Start year:           2007
      End year:             2023
      Total Stories:        178882
      Distinct Sites:       52642
      Distinct Submitters:  43648
    
      Year  Stories     Points (   mean )  Comments (   mean )   Sites  Submitters
      ----  -------     ------ ---------   -------- ---------   -----  ----------
      2007     9382      92264 (   9.83 )     61207 (   6.52 )   2644        1163
      2008    10980     294775 (  26.85 )    186339 (  16.97 )   3458        2021
      2009    10950     608603 (  55.58 )    303962 (  27.76 )   4157        3017
      2010    10950    1062763 (  97.06 )    491718 (  44.91 )   4397        3987
      2011    10949    1657004 ( 151.34 )    632724 (  57.79 )   4830        4889
      2012    10980    1829402 ( 166.61 )    778634 (  70.91 )   5047        5537
      2013    10950    2132819 ( 194.78 )    998387 (  91.18 )   5250        5975
      2014    10905    2057628 ( 188.69 )    916438 (  84.04 )   5338        5920
      2015    10950    2001269 ( 182.76 )    845719 (  77.23 )   5301        5313
      2016    10977    2521394 ( 229.70 )   1137575 ( 103.63 )   5238        5341
      2017    10950    2776064 ( 253.52 )   1259899 ( 115.06 )   5236        5274
      2018    10950    2762928 ( 252.32 )   1262654 ( 115.31 )   4986        5038
      2019    10950    3051011 ( 278.63 )   1447141 ( 132.16 )   5302        5123
      2020    10980    3338150 ( 304.02 )   1734703 ( 157.99 )   5938        5564
      2021    10950    3376829 ( 308.39 )   1859933 ( 169.86 )   6178        5339
      2022    10950    3308025 ( 302.10 )   1986265 ( 181.39 )   6446        5443
      2023     5160    1555335 ( 301.42 )    879401 ( 170.43 )   3253        2851
    
    Data through 2023-6-21.
  • gagejustins 3 years ago

    Oh man, this is awesome. I learned a lot from collecting this data and one of the big takeaways for me was how diverse the set of news sources on HN is (to your point, very little "traditional" journalism here). Glad you're doing this!

    • dredmorbius 3 years ago

      Drop me a line if you'd like to discuss this / share w/ reports. username at Protonmail.

      I'm sort of a Can Haz All The Tables sort of guy, and I'm largely processing via awk (and a few other shell tools). So pasting that here would get a bit tedious...

      It's also been interesting to look at how HN has, and hasn't, changed over the years. Your categorical analysis would be an interesting filter to look at over time, especially regarding accusations that HN is drifting in various directions.

      The other bit that stands out to me is how constrained a set the front page is (30 slots per day, 10,950 per year, 10,980 in a leap year), as well as how thin submission titles are for gleaning meaning and context (I'm ... somewhat frustrated by this). Though there is clearly signal that gets through.

      I don't have time-of-day granularity, but can look at day-of-week (and have) and month-of-year (not yet) looking for seasonality. DoW has been interesting (usually peaks Tue/Wed, starts trailing off on Fri, Sat & Sun are low points, based on votes/comments, but give higher odds of a given submission landing).

      You might want to look at Whaly's work as well (I'd edited it into my larger top-level comment above: <https://whaly.io/posts/hacker-news-2021-retrospective>).

      • itunpredictableOP 3 years ago

        I should mention that I clicked on every single link to see the contents before classifying it, which is part of what made this so tedious

        • dredmorbius 3 years ago

          So, some further thoughts on your methodology:

          - It's comprehensive. That's ... admirable, but not necessarily efficient in data analysis. There's a lot to be said for both random sampling and inference.

          - You might get more mileage by looking at the top-n stories of a given day. I'd suggest 3--5 items. There's a considerable fall-off in activity from storypos 1 to storypos 30 (1st to 30th items on the front page archive), which is one of the dimensions I've looked at.

          - The thought that's occurred to me over the past few days is that this seems like a natural area in which LLM / GPT techniques might be used to classify posts given training data.

          - Tuple and ngram analysis can also turn up interesting patterns. Here it's useful to have a base corpus from which universal tendencies can be inferred, and to look at statistically improbably terms which occur both from the HN subject corpus to the universal corpus (terms and phrases which HN finds significant), as well as changing trends over time within the HN corpus.

          - Day-of-week and month-of-year analysis can also show interesting patterns, and I've looked at a bit of the first. I'd really like to know if there's an HN "September" (on an annual basis).

          - I took a look at your data and ... spreadsheets. Maybe I'm old-school, but flatfiles and gawk are really my style.

        • dredmorbius 3 years ago

          There's a thin line between dedication and mania.

          I'd probably manually classify domains by topic.

          The top 100 domains appear at least 138 times each.

          Domains appearing >= 100 times are 149.

          The top 500 domains appear >= 35x each. (Number 500 is a personal fave, lowtechmagazine.com).

          The top 1,000 sites, >= 17x each.

          14,676 sites appear more than once.

          37,966 sites appear only once.

          25% of FP stories come from 31 sites appearing 400+ times each.

          50% of FP stories come from 331 sites appearing 51+ times each.

          75%: 2,521 sites, 7+ times.

          90%: 7,749 sites, 3+ times.

          95%: 11,173 sites, 2+ times.

          99%: 13,992 sites, 2+ times.

          Pick the degree of completeness you want (your 5% "misc" would require classifying slightly more than 11,000 sites).

          I'd probably aim for 50--75% coverage.

          OK, while writing this, I've classified about 10,200 (of 52,642) domains. (most of the first 300 manually, a bunch of the rest based on regexes, e.g., .edu, .gov, blogspot, medium.com, substack.com domains, etc.).

          By site:

               1   7621  software
               2   1710  blog
               3    535  academic / science
               4    123  government
               5     41  general news
               6     34  ???
               7     31  corporate comm.
               8     30  tech news
               9     15  general interest
              10     10  business news
              11      8  law
              12      6  technology
              13      4  social media
              14      3  corporate comm
              15      3  general magazine
              16      2  general information
              17      2  science news
              18      2  tech discussion
              19      2  video
              20      1  business education
              21      1  corporate comm. 
              22      1  corporate commm.
              23      1  general discussion
              24      1  health news
              25      1  images
              26      1  law 
              27      1  legal news
              28      1  misc
              29      1  n/a
              30      1  podcast
              31      1  tech blog
              32      1  tech law
              33      1  tech publications
              34      1  technology / security
              35      1  translation
              36      1  videos
              37      1  webcomic
            
            Unclassified: 42442
          
          
          By story count ...

               1  13782  general news
               2  13398  software
               3  10473  tech news
               4   8677  blog
               5   7651  academic / science
               6   7294  n/a
               7   4750  ???
               8   4600  business news
               9   3546  corporate comm.
              10   1504  general magazine
              11   1291  general information
              12   1162  general interest
              13   1132  technology
              14   1099  videos
              15   1073  social media
              16    975  government
              17    568  corporate comm
              18    559  tech discussion
              19    505  tech law
              20    251  tech publications
              21    171  tech blog
              22    170  science news
              23    136  business education
              24    104  corporate comm. 
              25    103  video
              26     99  corporate commm.
              27     96  general discussion
              28     80  misc
              29     71  technology / security
              30     61  law 
              31     59  webcomic
              32     49  translation
              33     48  health news
              34     47  images
              35     46  podcast
              36     32  law
              37      7  legal news
            
            Unclassified: 93213
          
          '???' indicates I couldn't (quickly) assess a domain. Examples: 37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.

          Note that I'm classifying by site rather than story, so an NY Times item on, say, quantum computing, would fall under "general news".

          Also, very quick ad hoc code here, there are assuredly errors (and I've already fixed a few in stealth edits to this comment).

          • dredmorbius 3 years ago

            Having played with classifying sites for much of the past day, I've assigned a classification to just under 30% of them, which classifies just under 64% of all posts.

            The remaining unclassified sites average about 1.7 posts each (there are a few with as many as 20 posts), but there are minimal gains for additional classification.

            I'm starting now with running an analysis over the full archive to come up with trends-by-classification over years.

            The top-20 classifications (by story) are:

                 1  64777  36.21%  UNCLASSIFIED
                 2  22481  12.57%  blog
                 3  15106   8.44%  general news
                 4  13769   7.70%  tech news
                 5  12709   7.10%  programming
                 6   8459   4.73%  academic / science
                 7   8200   4.58%  corporate comm.
                 8   7294   4.08%  n/a
                 9   5311   2.97%  business news
                10   3798   2.12%  general interest
                11   2151   1.20%  social media
                12   2048   1.14%  software
                13   1613   0.90%  technology
                14   1432   0.80%  video
                15   1144   0.64%  general information (wiki)
                16   1006   0.56%  government
                17    724   0.40%  misc documents
                18    720   0.40%  law
                19    702   0.39%  tech discussion
                20    620   0.35%  science news
            
            I've got a total of 60 classifications which ... seems a bit high, and I'm looking at ways of slimming that down. It's also a bit confused, as some is classified by topic ("programming", "networking" "database", "cryptocurrency", "crowdfunding"), some by source ("corporate comm." is any post that originates from an identifiable company communicating as that company), and general format ("blog" includes 5,306 sites, and spans a wide range of topics). The distinction between, say, "tech news" and "blog" is somewhat ambiguous, and there are a few blogs which should be classified as "corporate comms.". But in all there's a rough sense of what types of content are being posted, and I'd really like to see the change over time.
          • dredmorbius 3 years ago

            For those interested in the Ongoing Saga of HN Front Page Analyticcs, I've been posting occasional updates to the above site-based classification (~60% of posts now classified) to the Fediverse: <https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>

            (It's a bit much to dump massive tables to HN, I'm trying to keep that to a bearable minimum.)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection