GitHub - onurkanbakirci/exp-of-microgpt

🧠 MicroGPT Explained

This code is essentially "GPT in its purest, most naked form". Karpathy's goal:

👉 "Show how GPT works using only Python, without PyTorch or TensorFlow"

Let me break it down step by step.

🧩 1️⃣ Big Picture: What Does This Code Do?

This file does the following:

Loads a text dataset (list of names)
Creates a character-level tokenizer
Builds a small GPT model
Implements its own autograd system
Trains with backprop + Adam
Generates new names

In other words:

Mini PyTorch + Mini GPT + Mini Trainer = One File

📂 2️⃣ Dataset Part

docs = [ ... ]  # list of names
random.shuffle(docs)

Downloads names from GitHub:

emma
olivia
noah
liam
...

These are the training data.

🔤 3️⃣ Tokenizer (Character-Level)

uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

Here:

Each letter = token

Example:

a → 0
b → 1
c → 2
...
z → 25
BOS → 26

BOS = Beginning Of Sequence

The start token.

🧮 4️⃣ Autograd System (Value Class)

This is the most critical part.

This class:

👉 Mini version of PyTorch Tensor.

Each number has:

data → its value
grad → its gradient
children → where it came from

Example:

a = Value(2)
b = Value(3)
c = a * b

c = 6

But also:

c → knows it came from a and b

Then when you call:

It does:

Chain rule derivatives (backprop).

So:

Built its own PyTorch.

🧠 5️⃣ Model Parameters

n_embd = 16
n_head = 4
n_layer = 1
block_size = 8

This = model size

Kept very small so it runs on CPU.

Where are the parameters?

state_dict = {
  'wte': ...
  'wpe': ...
  'lm_head': ...
}

These are:

Name	What
wte	token embedding
wpe	position embedding
attn_wq	query
attn_wk	key
attn_wv	value
attn_wo	output
mlp_fc1	MLP layer 1
mlp_fc2	MLP layer 2
lm_head	output

= Complete GPT architecture

🏗️ 6️⃣ Forward Pass (GPT Function)

def gpt(token_id, pos_id, keys, values):

This function:

"Input a token → output logits"

The heart of the model ❤️

a) Embedding

tok_emb = state_dict['wte'][token_id]
pos_emb = state_dict['wpe'][pos_id]
x = tok_emb + pos_emb

Token + Position embedding.

b) Attention

q = linear(x, Wq)
k = linear(x, Wk)
v = linear(x, Wv)

Standard Transformer:

Q, K, V

Then:

Calculated manually.

c) MLP

x = linear(...)
x = relu() ** 2
x = linear(...)

Feedforward network.

d) Output

logits = linear(x, lm_head)

Score for each character.

📉 7️⃣ Loss Calculation

loss_t = -probs[target_id].log()

This is:

Cross Entropy Loss

Meaning:

"How low of a probability did you assign to the correct character?"

🔁 8️⃣ Training Loop

for step in range(num_steps):

At each step:

1️⃣ Get data

doc = docs[step % len(docs)]

2️⃣ Forward

3️⃣ Backward

4️⃣ Adam

Manual optimizer.

🤖 9️⃣ Inference (Text Generation)

for sample_idx in range(20):

After training finishes:

Generates names from the model.

Steps:

Start with BOS
Make prediction
Random selection
Add new token
Continue

🎯 Summary: What's the Purpose of This Code?

This code is:

❌ Not for production ❌ Not fast ❌ Doesn't use GPU

But:

✅ Educational ✅ Transparent ✅ Shows the math ✅ Reveals "What GPT actually does"

🧠 Big Picture

Think of it this way:

Normal GPT	This Code
PyTorch	Pure Python
CUDA	CPU
100B param	~10K param
Complex	Simple

This is:

"The skeleton of GPT"

📌 My Recommendation (For Learning)

If you really want to understand:

Go in this order:

1️⃣ First, fully grasp these parts:

Value class
backward()
linear()
softmax()

2️⃣ Then:

Draw the Attention part as you read
Sketch out the Q K V matrices on paper

3️⃣ Finally:

Training loop

Want to dive deeper into:

👉 "Explain just the Attention part line by line" 👉 "Explain Value class with examples" 👉 "Convert this to PyTorch"

Which one interests you?