Essays
Detecting LLMs with Word Substrings: A Novel Approach to CAPTCHA
How a simple linguistic insight can help distinguish humans from AI
The Insight
Here's an observation: when asked to generate random nonsense words, humans and AI behave completely differently.
Humans mash keyboards: ksdjsdksdk, eokdfweinfhfejla, owiewewlkdfk
Large Language Models generate plausible-sounding nonsense: Vesperthrum, Nebulithic, Zorathium
Why? Because LLMs are trained on human language. Even when generating "random" words, they unconsciously follow patterns they've learned combining real morphemes, respecting English phonotactics, and creating words that feel like they could exist.
The Algorithm
This insight leads to a simple detection algorithm:
- Ask the user to generate 3 nonsense words
- For each word, extract all possible substrings
- Check if any substring appears in a dictionary of real English words
- If multiple words contain real-word substrings → likely an LLM
- If all words contain no real substrings → likely a human
Proof of concept code available here
Why This Works
The Human Advantage
Humans produce true randomness. When asked to make up words, we type patterns like:
- Keyboard rows:
asdfghjk - Random characters:
kjsdhfkjsdhf - Repeated patterns:
lolololol
These rarely contain real 5+ character English words by chance.
The LLM Weakness
LLMs generate structured randomness. Even when trying to be creative, they:
- Combine real morphemes:
Vesper(evening) +thrum(humming) - Follow statistical patterns learned from billions of texts
- Create words with realistic character transitions