GitHub - chrishwiggins/shannon-language-model: How a (small) language model walks through its training text: a teaching demo of a bigram Markov chain as a random walk. Live: shannon-language-model.pages.dev

A teaching demo for the simplest possible language model: a bigram Markov chain. A bigram model is a map (each word points to the words seen right after it); generating text is a random walk over that map. The app animates the walk, moving a reading head through the training text one word at a time as each next word is sampled.

Live demo: https://shannon.chrishwiggins.com

What it shows

A bigram model is a map. Build a directed graph whose nodes are words and whose edges go from each word to the words seen right after it. A word's out-degree (its number of distinct next-words) is the branching factor of the model.
Generation is a random walk. Sample each next word in proportion to how often it followed the current one; the reading head moves through the training text as it goes. This is exactly Shannon's 1948 first-order word model: no temperature, no neural net.
A designed training text makes the branching legible. The text is engineered so most words have a small, near-uniform out-degree (mostly two choices), so you can follow the walk by eye. A self-check recomputes the bigram graph from the displayed tokens and proves it matches the designed graph, or blocks generation.
The covering walk. To exhibit every edge in the fewest words, the displayed text is a Guan-route-augmented Eulerian circuit (route inspection + Hierholzer) over the designed graph.
Real text, for contrast. Other tabs build the real bigram graph of pasted text or a bundled public-domain classic, where out-degrees vary wildly (Zipfian), unlike the uniform designed demo.

Running it locally

The designed-demo and paste tabs work from any static server. The standard-texts tab reads bundled files from dat/, and the optional URL-scraping tab uses Python's BeautifulSoup (which cannot run in a browser). The local backend serves everything:

python3 src/server.py
# then open http://localhost:8731/

The only non-stdlib dependency is beautifulsoup4 (pip install beautifulsoup4), and only the URL tab needs it.

To produce the static build that the live site serves:

python3 src/build-static.py # writes out/site/

The static build removes the URL-scraping tab (no Python backend on a static host) and stamps a last-updated time.

The app is behind a client-side password gate (password: monday). This is obfuscation, not security: the password ships in the page source.

Project layout

Path	What it is
`index.html`	The single-page app (markup + CSS + JS).
`src/server.py`	Local dev backend: serves the app and the URL scraper (BeautifulSoup).
`src/build-static.py`	Produces the static deploy build in `out/site/`.
`src/pareto-entropy.py`	Computes the fair-use-vs-branching Pareto front (Shannon conditional entropy vs. vocabulary) over a text.
`dat/*.txt`	Bundled public-domain training texts.
`dat/index.json`	Catalog the Standard-texts tab reads.
`ref/markov-1913-summary.md`	A short note on Markov's 1913 Eugene Onegin chain (the first Markov chain).

The training texts

All bundled texts in dat/ are public domain:

markov-onegin.txt — Pushkin's Eugene Onegin (Henry Spalding's 1881 translation), the text Andrey Markov used in 1913 for the first Markov chain.
bible-genesis.txt — King James Bible, Genesis 1.
shakespeare-hamlet.txt, shakespeare-sonnet18.txt — Shakespeare.
house-jack.txt — "The House That Jack Built" (1755, a cumulative rhyme).
ring-roses.txt — "Ring a Ring o' Roses".
columbia-mission.txt — the Columbia University mission statement.

The designed demo's small vocabulary is drawn from a canon of texts about language, machines, and meaning (Chomsky, Brautigan, Orwell via Strunk & White, the 1955 Dartmouth proposal, Weizenbaum, Searle, Strachey). Those copyrighted sources are referenced by title and used only as the provenance of individual words; none of their text is redistributed here.

Background

A. A. Markov (1913) counted vowel/consonant transitions across 20,000 letters of Eugene Onegin — the first Markov chain. See ref/markov-1913-summary.md.
C. E. Shannon (1948), A Mathematical Theory of Communication — the n-gram language model and conditional entropy this demo animates.

License

MIT (see LICENSE) for the code. The bundled dat/*.txt training texts are public-domain works, included as data.