Pen & Paper LM 160 is a microscopic language model designed to be run by hand.
It does not generate normal prose. It predicts the next token in a small workflow language:
QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
The model is trained to produce simple useful loops:
QUESTION → CHECK → ANSWER → DONE TASK → PLAN → DO → CHECK → DONE IDEA → CHECK → PLAN → DO → CHECK → DONE FACT → ANSWER → DONE PROBLEM → CHECK → PLAN → DO → CHECK → DONE UNKNOWN → ASK → DONE DONE → DONE
It is small enough to infer on paper, but it still has the core parts of a neural language model: context, weights, hidden activations, ReLU, logits, and next-token decoding.
Model shape
Pen & Paper LM 160 uses the previous token and the current token as context.
previous token + current token → hidden layer → ReLU → output logits → next token
It has:
12 tokens 2-token context 4 hidden neurons 12 output logits
The first layer is split into two tables:
W1_prev: weights for the previous token W1_curr: weights for the current token
This is equivalent to a 24 × 4 input matrix, but easier to use by hand.
Parameter count
W1_prev: 12 × 4 = 48 W1_curr: 12 × 4 = 48 b1: 4 W2: 4 × 12 = 48 b2: 12 Total = 160 parameters
Token order
Use this order for all output scores:
| Index | Token |
|---|---|
| 0 | QUESTION |
| 1 | TASK |
| 2 | IDEA |
| 3 | FACT |
| 4 | PROBLEM |
| 5 | UNKNOWN |
| 6 | PLAN |
| 7 | DO |
| 8 | CHECK |
| 9 | ASK |
| 10 | ANSWER |
| 11 | DONE |
Inference rule
If the prompt has one token, duplicate it.
becomes:
previous = TASK current = TASK
Then calculate:
h_raw = W1_prev[previous] + W1_curr[current] + b1 h = ReLU(h_raw) logits = h × W2 + b2
ReLU is simple:
negative numbers become 0 zero and positive numbers stay unchanged
The predicted next token is the token with the highest logit.
If there is a tie, choose the first highest token in the token order.
W1 previous-token table
| Previous token | H1 | H2 | H3 | H4 |
|---|---|---|---|---|
| QUESTION | 0 | 4 | -2 | 0 |
| TASK | -1 | -2 | 0 | 0 |
| IDEA | 0 | 0 | 2 | 3 |
| FACT | 0 | 2 | 1 | -1 |
| PROBLEM | 0 | 0 | 2 | 3 |
| UNKNOWN | -1 | 0 | 2 | -1 |
| PLAN | 1 | 1 | -1 | 2 |
| DO | 2 | 1 | 2 | -1 |
| CHECK | -1 | 0 | 0 | -1 |
| ASK | 0 | 0 | 0 | 0 |
| ANSWER | 0 | 0 | 0 | 0 |
| DONE | 2 | 2 | 2 | -1 |
W1 current-token table
| Current token | H1 | H2 | H3 | H4 |
|---|---|---|---|---|
| QUESTION | 0 | 1 | -1 | 3 |
| TASK | 0 | -1 | 1 | 2 |
| IDEA | 3 | 1 | -2 | 2 |
| FACT | -1 | 3 | -2 | -1 |
| PROBLEM | 3 | 1 | -2 | 2 |
| UNKNOWN | 0 | -1 | 3 | -1 |
| PLAN | -1 | -1 | -2 | -2 |
| DO | 2 | 1 | -1 | 2 |
| CHECK | -1 | 2 | 2 | -1 |
| ASK | 0 | 3 | 2 | -1 |
| ANSWER | 2 | 2 | 3 | -1 |
| DONE | 2 | 1 | 2 | -1 |
W2 output matrix
Columns use the token order:
QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
| Hidden row | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H1 | -1 | -1 | -1 | -2 | -2 | -1 | -2 | -2 | 2 | -1 | -2 | 1 |
| H2 | -2 | -2 | -2 | -1 | -2 | -2 | -1 | -1 | 1 | -4 | 3 | 1 |
| H3 | -2 | -2 | -1 | -1 | -1 | -1 | 1 | -1 | -3 | 3 | -2 | 1 |
| H4 | -2 | -1 | -1 | -1 | -1 | -2 | 3 | -1 | 2 | -2 | -2 | -2 |
Output bias
Again, this uses the same token order:
QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
b2 = [-2, -3, -3, -2, -2, -2, -2, 7, -1, -2, -2, -1]
Worked example
Prompt:
Because there is only one token, duplicate it:
previous = TASK current = TASK
Find the two rows:
W1_prev[TASK] = [-1, -2, 0, 0] W1_curr[TASK] = [ 0, -1, 1, 2] b1 = [ 1, 1, 1, 2]
Add them:
h_raw = [-1, -2, 0, 0]
+ [ 0, -1, 1, 2]
+ [ 1, 1, 1, 2]
h_raw = [0, -2, 2, 4]
Apply ReLU:
Now calculate the output logits:
Result:
| Token | Logit |
|---|---|
| QUESTION | -14 |
| TASK | -11 |
| IDEA | -9 |
| FACT | -8 |
| PROBLEM | -8 |
| UNKNOWN | -12 |
| PLAN | 12 |
| DO | 1 |
| CHECK | 1 |
| ASK | -4 |
| ANSWER | -14 |
| DONE | -7 |
The highest logit is:
So the model predicts:
Continue the same way:
TASK PLAN → DO PLAN DO → CHECK DO CHECK → DONE
Full output:
TASK → PLAN → DO → CHECK → DONE
Verified behavior
With greedy decoding, the model produces:
QUESTION → CHECK → ANSWER → DONE TASK → PLAN → DO → CHECK → DONE IDEA → CHECK → PLAN → DO → CHECK → DONE FACT → ANSWER → DONE PROBLEM → CHECK → PLAN → DO → CHECK → DONE UNKNOWN → ASK → DONE DONE → DONE
Pen & Paper LM 160 is not useful because it is powerful. It is useful because the whole model is visible.
You can inspect every weight, run every step by hand, and change the model’s behavior directly. Raising weights toward CHECK makes it more cautious. Raising weights toward DO makes it more action-oriented. Raising weights toward DONE makes it finish sooner.
It is a complete language model small enough to fit in a notebook.
P.S. Here's a little playground where you can fiddle with the model.