LLMinate - LLM CAPTCHA System

2 min read Original article ↗

Essays

Detecting LLMs with Word Substrings: A Novel Approach to CAPTCHA

How a simple linguistic insight can help distinguish humans from AI


The Insight

Here's an observation: when asked to generate random nonsense words, humans and AI behave completely differently.

Humans mash keyboards: ksdjsdksdk, eokdfweinfhfejla, owiewewlkdfk

Large Language Models generate plausible-sounding nonsense: Vesperthrum, Nebulithic, Zorathium

Why? Because LLMs are trained on human language. Even when generating "random" words, they unconsciously follow patterns they've learned combining real morphemes, respecting English phonotactics, and creating words that feel like they could exist.

The Algorithm

This insight leads to a simple detection algorithm:

  1. Ask the user to generate 3 nonsense words
  2. For each word, extract all possible substrings
  3. Check if any substring appears in a dictionary of real English words
  4. If multiple words contain real-word substrings → likely an LLM
  5. If all words contain no real substrings → likely a human

Proof of concept code available here

Why This Works

The Human Advantage

Humans produce true randomness. When asked to make up words, we type patterns like:

  • Keyboard rows: asdfghjk
  • Random characters: kjsdhfkjsdhf
  • Repeated patterns: lolololol

These rarely contain real 5+ character English words by chance.

The LLM Weakness

LLMs generate structured randomness. Even when trying to be creative, they:

  • Combine real morphemes: Vesper (evening) + thrum (humming)
  • Follow statistical patterns learned from billions of texts
  • Create words with realistic character transitions

These inevitably contain real word fragments that dictionary lookups can detect.

Proof of concept code