🧠 MicroGPT Explained
This code is essentially "GPT in its purest, most naked form". Karpathy's goal:
👉 "Show how GPT works using only Python, without PyTorch or TensorFlow"
Let me break it down step by step.
🧩 1️⃣ Big Picture: What Does This Code Do?
This file does the following:
- Loads a text dataset (list of names)
- Creates a character-level tokenizer
- Builds a small GPT model
- Implements its own autograd system
- Trains with backprop + Adam
- Generates new names
In other words:
Mini PyTorch + Mini GPT + Mini Trainer = One File
📂 2️⃣ Dataset Part
docs = [ ... ] # list of names random.shuffle(docs)
Downloads names from GitHub:
emma
olivia
noah
liam
...
These are the training data.
🔤 3️⃣ Tokenizer (Character-Level)
uchars = sorted(set(''.join(docs))) BOS = len(uchars) vocab_size = len(uchars) + 1
Here:
Each letter = token
Example:
a → 0
b → 1
c → 2
...
z → 25
BOS → 26
BOS = Beginning Of Sequence
The start token.
🧮 4️⃣ Autograd System (Value Class)
This is the most critical part.
This class:
👉 Mini version of PyTorch Tensor.
Each number has:
- data → its value
- grad → its gradient
- children → where it came from
Example:
a = Value(2) b = Value(3) c = a * b
c = 6
But also:
c → knows it came from a and b
Then when you call:
It does:
Chain rule derivatives (backprop).
So:
Built its own PyTorch.
🧠 5️⃣ Model Parameters
n_embd = 16 n_head = 4 n_layer = 1 block_size = 8
This = model size
Kept very small so it runs on CPU.
Where are the parameters?
state_dict = { 'wte': ... 'wpe': ... 'lm_head': ... }
These are:
| Name | What |
|---|---|
| wte | token embedding |
| wpe | position embedding |
| attn_wq | query |
| attn_wk | key |
| attn_wv | value |
| attn_wo | output |
| mlp_fc1 | MLP layer 1 |
| mlp_fc2 | MLP layer 2 |
| lm_head | output |
= Complete GPT architecture
🏗️ 6️⃣ Forward Pass (GPT Function)
def gpt(token_id, pos_id, keys, values):
This function:
"Input a token → output logits"
The heart of the model ❤️
a) Embedding
tok_emb = state_dict['wte'][token_id] pos_emb = state_dict['wpe'][pos_id] x = tok_emb + pos_emb
Token + Position embedding.
b) Attention
q = linear(x, Wq) k = linear(x, Wk) v = linear(x, Wv)
Standard Transformer:
Q, K, V
Then:
Calculated manually.
c) MLP
x = linear(...) x = relu() ** 2 x = linear(...)
Feedforward network.
d) Output
logits = linear(x, lm_head)
Score for each character.
📉 7️⃣ Loss Calculation
loss_t = -probs[target_id].log()
This is:
Cross Entropy Loss
Meaning:
"How low of a probability did you assign to the correct character?"
🔁 8️⃣ Training Loop
for step in range(num_steps):
At each step:
1️⃣ Get data
doc = docs[step % len(docs)]
2️⃣ Forward
3️⃣ Backward
4️⃣ Adam
Manual optimizer.
🤖 9️⃣ Inference (Text Generation)
for sample_idx in range(20):
After training finishes:
Generates names from the model.
Steps:
- Start with BOS
- Make prediction
- Random selection
- Add new token
- Continue
🎯 Summary: What's the Purpose of This Code?
This code is:
❌ Not for production ❌ Not fast ❌ Doesn't use GPU
But:
✅ Educational ✅ Transparent ✅ Shows the math ✅ Reveals "What GPT actually does"
🧠 Big Picture
Think of it this way:
| Normal GPT | This Code |
|---|---|
| PyTorch | Pure Python |
| CUDA | CPU |
| 100B param | ~10K param |
| Complex | Simple |
This is:
"The skeleton of GPT"
📌 My Recommendation (For Learning)
If you really want to understand:
Go in this order:
1️⃣ First, fully grasp these parts:
- Value class
- backward()
- linear()
- softmax()
2️⃣ Then:
- Draw the Attention part as you read
- Sketch out the Q K V matrices on paper
3️⃣ Finally:
- Training loop
Want to dive deeper into:
👉 "Explain just the Attention part line by line" 👉 "Explain Value class with examples" 👉 "Convert this to PyTorch"
Which one interests you?