GitHub - onurkanbakirci/exp-of-microgpt

4 min read Original article ↗

🧠 MicroGPT Explained

This code is essentially "GPT in its purest, most naked form". Karpathy's goal:

👉 "Show how GPT works using only Python, without PyTorch or TensorFlow"

Let me break it down step by step.


🧩 1️⃣ Big Picture: What Does This Code Do?

This file does the following:

  1. Loads a text dataset (list of names)
  2. Creates a character-level tokenizer
  3. Builds a small GPT model
  4. Implements its own autograd system
  5. Trains with backprop + Adam
  6. Generates new names

In other words:

Mini PyTorch + Mini GPT + Mini Trainer = One File


📂 2️⃣ Dataset Part

docs = [ ... ]  # list of names
random.shuffle(docs)

Downloads names from GitHub:

emma
olivia
noah
liam
...

These are the training data.


🔤 3️⃣ Tokenizer (Character-Level)

uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

Here:

Each letter = token

Example:

a → 0
b → 1
c → 2
...
z → 25
BOS → 26

BOS = Beginning Of Sequence

The start token.


🧮 4️⃣ Autograd System (Value Class)

This is the most critical part.

This class:

👉 Mini version of PyTorch Tensor.

Each number has:

  • data → its value
  • grad → its gradient
  • children → where it came from

Example:

a = Value(2)
b = Value(3)
c = a * b

c = 6

But also:

c → knows it came from a and b

Then when you call:

It does:

Chain rule derivatives (backprop).

So:

Built its own PyTorch.


🧠 5️⃣ Model Parameters

n_embd = 16
n_head = 4
n_layer = 1
block_size = 8

This = model size

Kept very small so it runs on CPU.


Where are the parameters?

state_dict = {
  'wte': ...
  'wpe': ...
  'lm_head': ...
}

These are:

Name What
wte token embedding
wpe position embedding
attn_wq query
attn_wk key
attn_wv value
attn_wo output
mlp_fc1 MLP layer 1
mlp_fc2 MLP layer 2
lm_head output

= Complete GPT architecture


🏗️ 6️⃣ Forward Pass (GPT Function)

def gpt(token_id, pos_id, keys, values):

This function:

"Input a token → output logits"

The heart of the model ❤️


a) Embedding

tok_emb = state_dict['wte'][token_id]
pos_emb = state_dict['wpe'][pos_id]
x = tok_emb + pos_emb

Token + Position embedding.


b) Attention

q = linear(x, Wq)
k = linear(x, Wk)
v = linear(x, Wv)

Standard Transformer:

Q, K, V

Then:

Calculated manually.


c) MLP

x = linear(...)
x = relu() ** 2
x = linear(...)

Feedforward network.


d) Output

logits = linear(x, lm_head)

Score for each character.


📉 7️⃣ Loss Calculation

loss_t = -probs[target_id].log()

This is:

Cross Entropy Loss

Meaning:

"How low of a probability did you assign to the correct character?"


🔁 8️⃣ Training Loop

for step in range(num_steps):

At each step:

1️⃣ Get data

doc = docs[step % len(docs)]

2️⃣ Forward

3️⃣ Backward

4️⃣ Adam

Manual optimizer.


🤖 9️⃣ Inference (Text Generation)

for sample_idx in range(20):

After training finishes:

Generates names from the model.

Steps:

  1. Start with BOS
  2. Make prediction
  3. Random selection
  4. Add new token
  5. Continue

🎯 Summary: What's the Purpose of This Code?

This code is:

❌ Not for production ❌ Not fast ❌ Doesn't use GPU

But:

✅ Educational ✅ Transparent ✅ Shows the math ✅ Reveals "What GPT actually does"


🧠 Big Picture

Think of it this way:

Normal GPT This Code
PyTorch Pure Python
CUDA CPU
100B param ~10K param
Complex Simple

This is:

"The skeleton of GPT"


📌 My Recommendation (For Learning)

If you really want to understand:

Go in this order:

1️⃣ First, fully grasp these parts:

  • Value class
  • backward()
  • linear()
  • softmax()

2️⃣ Then:

  • Draw the Attention part as you read
  • Sketch out the Q K V matrices on paper

3️⃣ Finally:

  • Training loop

Want to dive deeper into:

👉 "Explain just the Attention part line by line" 👉 "Explain Value class with examples" 👉 "Convert this to PyTorch"

Which one interests you?