GitHub - umairnadeem/hn-bot-detector

3 min read Original article ↗

Deployed at https://hn-bot-detector.vercel.app/

Detects LLM-generated comments on Hacker News. Paste a comment URL, ID, or raw text and get a score plus a breakdown of which signals fired.

Built after getting banned from HN for LLM-assisted posting. This is penance.

What it does

  • Comment lookup - paste an HN comment URL, ID, or raw text, get an instant bot score
  • Username analyzer - scores a user's last 50 comments
  • Post scanner (/post/[id]) - scans all comments on a post, sorted by bot score
  • Export - download results as JSON
  • Anthropic (Claude Haiku) and OpenAI (GPT-4o-mini) detection passes when API keys are set

How scoring works

Each comment is scored 0-100. Here's what we check:

Phrase detection

Uses character n-gram TF-IDF vectors rather than exact regex, so it catches paraphrases too (similarity threshold: 0.75). The phrase vocabulary includes:

  • Transition filler: "additionally", "furthermore", "moreover", "that being said", "having said that"
  • Formal hedges: "it is worth noting", "it is crucial", "needless to say", "one could argue"
  • Insight markers: "the real insight here", "the key takeaway", "the key unlock", "at its core"
  • Buzzwords: "leverage", "utilize", "robust", "seamless", "paradigm", "synergy", "delve into"

Structure

  • No contractions in 100+ word comment (+10)
  • Word count between 150-400 (+5)
  • Exactly 3 paragraphs, 1-2 sentences each (+20) - extremely common LLM output shape
  • Classic thesis/body/conclusion structure (+10)
  • Personal anecdotes present (-10)

Unicode signals

These are the subtle ones. Keyboards output straight ASCII quotes. LLMs output typographic Unicode quotes. If you see curly quotes in a plain-text HN comment, the text was written somewhere else and pasted in.

  • Curly/smart quotes " " ' ' (U+201C/D, U+2018/9) - +8 per occurrence, max +20
  • Em dash (U+2014) - +5 per, max +15
  • En-dash (U+2013) as separator - +5 per, max +15 (used when people prompt LLMs to replace em dashes to evade detection)
  • Rightwards arrow -> (U+2192) - +10 per, max +20

List patterns

  • Numbered lists 1. 2. 3. (+15, or +25 if 3+ items)
  • Examples always in threes: "for example X, Y, and Z" (+12)

False personal framing

LLMs fake personal experience before making generic claims:

  • "In practice, I've found..." (+8)
  • "In my experience, the..." (+8)
  • "The question is whether..." (+8)
  • "What remains to be seen..." (+8)

Timing (user-level)

  • More than 5 comments in 24 hours (+20)
  • More than 15 comments in 7 days (+15)
  • Average interval under 30 minutes (+15)

Semantic similarity (user-level)

LLM comments from the same user tend to cluster together in vector space. We compute pairwise TF-IDF cosine similarity across a user's comments:

  • Avg similarity > 0.4: +20 pts
  • Avg similarity > 0.6: +30 pts

LLM detection pass (optional)

Set ANTHROPIC_API_KEY or OPENAI_API_KEY in .env.local to run a Claude Haiku or GPT-4o-mini pass on top of the heuristics. Scores are weighted at 0.6 and 0.5 respectively.

Verdicts

Score Verdict
60-100 LIKELY BOT
30-59 POSSIBLY BOT
0-29 LIKELY HUMAN

Getting started

git clone https://github.com/umairnadeem/hn-bot-detector.git
cd hn-bot-detector
pnpm install
cp .env.example .env.local
pnpm dev

API

GET  /api/analyze/comment?id=<id>        score a single comment by HN ID or URL
POST /api/analyze/comment                score raw text (body: { text: string })
GET  /api/analyze/user?username=<user>   analyze last 50 comments for a user
GET  /api/analyze/post?id=<id>           analyze all comments on a post

All return JSON with full scoring breakdowns.

Stack

  • Next.js 14, TypeScript strict mode
  • Tailwind CSS, HN-style design
  • HN Algolia API
  • TF-IDF cosine similarity (no external dependencies)

Environment variables

Variable Description
ANTHROPIC_API_KEY Claude Haiku detection pass (recommended)
OPENAI_API_KEY GPT-4o-mini detection pass (optional)

License

MIT