π§
LLMs are not magic brains. They are prediction machines built from a few repeatable parts: tokens, vectors, attention, memory-like feed-forward layers, and a loop that keeps choosing the next likely piece of text.
The whole idea in one minute1. Tokens: the model's alphabet is not your alphabet2. Embeddings: IDs become meaning-shaped numbers3. Position: the model needs word order4. Attention: tokens decide what to pay attention to5. Multi-head attention: many views at once6. Feed-forward networks: where a lot of learned structure lives7. Residual stream and normalization: keeping deep models trainable8. Next-token prediction: the answer is built one piece at a time9. Architecture vs weights: why models feel different10. GPT-2 and MoE: two useful milestonesGPT-2: scaling the next-token gameMoE: not every token needs the whole building11. The AI ecosystem: MCP, tools, RAG, agents, and evalsMCP: the USB-C idea for AI toolsRAG: giving the model an open bookAgents: the loop around the modelA friendly checklist for understanding any LLM answerFurther readingInterlinked Content
βοΈ
Source note: this is an original, beginner-friendly rewrite inspired by Kato's article How LLMs Actually Work, with extra examples, code, tables, and Notion-native structure.
The whole idea in one minute
An LLM, or large language model, takes your text, turns it into numbers, runs those numbers through many transformer layers, and predicts what text should come next.
That is the simple version. The useful version is this:
- Your prompt is split into tokens, which are small text pieces.
- Each token becomes a vector, which is a list of numbers that carries learned meaning.
- The model adds information about order, because
dog bites manandman bites dogdo not mean the same thing. - Attention lets each token decide which earlier tokens matter.
- A feed-forward network does deeper processing for each token.
- Residual connections and normalization keep the many layers stable.
- The model outputs scores for the next possible token.
- One token is chosen, added to the text, and the loop repeats.
flowchart LR
A["You type a prompt"] --> B["Tokenizer<br>text pieces"]
B --> C["Embeddings<br>meaning as numbers"]
C --> D["Position signal<br>word order"]
D --> E["Attention<br>what should matter?"]
E --> F["Feed-forward layer<br>deeper processing"]
F --> G["Next-token scores"]
G --> H["Pick one token"]
H --> I["Add it to the text"]
I --> Eπ‘
A good mental model: an LLM is like an autocomplete system that has read a massive library and learned incredibly subtle patterns about what usually follows what.
| Part | Plain-English job | Why it matters |
|---|---|---|
| Tokens | Break text into pieces | The model cannot read raw words or letters directly. |
| Embeddings | Turn pieces into meaning-shaped numbers | Similar ideas can sit near each other in number-space. |
| Position | Tell the model where each piece appears | Order changes meaning. |
| Attention | Let tokens look at useful previous tokens | This is how context flows through the sentence. |
| Feed-forward network | Process each token more deeply | A lot of learned structure lives here. |
| Next-token prediction | Score likely continuations | This is the generation loop behind every answer. |
1. Tokens: the model's alphabet is not your alphabet
Models do not see your sentence the way you do. You see words. The model sees token IDs.
A tokenizer might split a sentence like this:
Those ID numbers are what enter the model. The specific numbers differ across model families, but the pattern is the same: text becomes a sequence of integers.
Why not just use whole words? Because language is messy. New names, typos, code, slang, and other languages would explode the vocabulary. Tokens sit between letters and words: flexible enough for rare text, efficient enough for common text.
Slightly technical: why the strawberry counting problem happens
When you ask a model how many letters are in a word, the model may not be looking at separate letters. It may see a word as one or a few tokens. That means character-level questions can be awkward unless the model deliberately reasons about spelling.
2. Embeddings: IDs become meaning-shaped numbers
A token ID by itself is just a label. ID 11205 does not mean robot unless the model has a learned table that says what vector should represent that token.
That table is called the embedding matrix. Think of it as a huge spreadsheet:
- Every token ID gets one row.
- Every row contains many numbers.
- Those numbers are learned during training.
- The row becomes the token's starting representation.
If two tokens are used in similar situations, their vectors often end up close together. Words like doctor, nurse, and hospital tend to live near related medical concepts. This was not hand-labeled by a person; it emerges because those relationships help the model predict text.
π§
Embeddings are not definitions. They are coordinates learned from usage. The model learns that concepts are related because they appear in related contexts.
Slightly technical: vector arithmetic
An embedding is a vector, meaning a list of numbers. With enough training, directions in vector space can behave like meaning shifts. That is why famous examples like king - man + woman β queen can sometimes work. It is geometry, not a dictionary.
3. Position: the model needs word order
A bag of tokens is not enough. These two sentences contain almost the same pieces but mean very different things:
The dog chased the boy.
The boy chased the dog.
The model therefore needs a position signal. Older transformers added a position vector to each token embedding. Many modern LLMs use RoPE, short for Rotary Position Embeddings, where position is represented by rotating parts of the vector.
You do not need the math to understand the purpose: position makes the model aware that one token came before another, and roughly how far apart they are.
π
Practical takeaway: important context usually works best near the start or end of a long prompt. Many models are weaker at using information buried in the middle.
Slightly technical: why long context is still hard
Even if a model can accept a huge prompt, that does not mean it uses every part equally well. Attention has to compare many tokens, and retrieval quality can drop when the answer is hidden in the middle of a long context window.
4. Attention: tokens decide what to pay attention to
Attention is the heart of the transformer. It lets each token ask: which previous tokens should shape my current meaning?
For each token, the model creates three learned views:
| Name | Question it answers | Everyday analogy |
|---|---|---|
| Query | What am I looking for? | A search request |
| Key | What do I match with? | A label on stored information |
| Value | What information should be passed along? | The content you copy after finding a match |
Imagine the sentence:
The cat that I saw yesterday was sleeping.
When the model reaches was, it needs to know what was sleeping. Attention can give more weight to cat than to yesterday, because cat is more useful for understanding the verb.
π
GPT-style models use causal masking: while predicting the next token, they can look backward but not forward. Future text is hidden because it has not been generated yet.
5. Multi-head attention: many views at once
One attention pattern is not enough for language. A sentence can contain grammar, references, tone, code syntax, and long-range dependencies at the same time.
Multi-head attention runs several attention operations in parallel. One head might track subject-verb relationships. Another might follow quotation marks. Another might notice that a variable name in code was used earlier.
Slightly technical: heads are learned projections, not fixed slices
Each head learns its own projections from the full token vector into a smaller query/key/value space. So a head is not simply handed a pre-cut piece of the vector. It learns its own way to view the whole token representation.
The model then combines the outputs from all heads and sends the result onward.
A practical detail: during generation, the model stores old key and value vectors in a KV cache. That way it does not need to recompute the entire conversation every time it adds one new token.
6. Feed-forward networks: where a lot of learned structure lives
After attention mixes information between tokens, each token goes through a feed-forward network.
Attention is about tokens communicating. The feed-forward network is more like each token doing private thinking.
The rough pattern is:
- Expand the vector into a larger space.
- Apply a non-linear function.
- Compress it back down.
The non-linear step matters because it lets the model learn richer patterns. Without it, many stacked layers would collapse into something much simpler.
π§±
A lot of model parameters live in feed-forward layers. This is one reason they are often discussed as the model's learned store of patterns, facts, and associations.
Slightly technical: dense models vs mixture of experts
In a dense transformer, every token uses the same feed-forward network in a layer. In a mixture-of-experts model, a small router chooses only a few expert networks for each token. This can increase total model capacity without making every token run through every parameter.
7. Residual stream and normalization: keeping deep models trainable
A modern LLM can have dozens or even hundreds of layers. If each layer simply replaced the previous representation, training would be fragile.
Residual connections solve part of that problem. Instead of replacing the vector, a block adds its output back to the existing vector.
This creates a running stream of information through the network. Each layer can add a refinement without destroying everything that came before.
Layer normalization keeps the numbers stable. Without it, values can grow too large or shrink too much as they pass through many layers.
π οΈ
The boring-sounding parts matter. Residual connections and normalization are major reasons very deep transformer stacks can actually train.
8. Next-token prediction: the answer is built one piece at a time
At the end of the stack, the model turns the final vector into scores for possible next tokens. These raw scores are called logits. A softmax converts them into probabilities.
Then a decoding strategy chooses one token.
| Setting | Plain-English effect | When useful |
|---|---|---|
| Temperature | Controls randomness | Lower for precise answers, higher for creative drafts |
| Top-k | Only considers the k most likely tokens | Prevents very unlikely choices |
| Top-p | Considers the smallest likely group whose probabilities add up to p | Flexible sampling without fixed k |
That loop is the machine behind the fluent paragraph. The model writes by repeatedly asking: given everything so far, what token should come next?
β οΈ
This also explains hallucinations. The base training objective rewards plausible continuation, not guaranteed truth. Post-training, retrieval, tool use, and evaluation are added to make outputs more useful and reliable.
9. Architecture vs weights: why models feel different
Many modern LLMs share the same broad transformer-family shape. What makes them feel different is usually a combination of:
- Training data: what they learned from.
- Scale: how many layers, heads, parameters, and tokens were used.
- Architecture choices: dense or mixture-of-experts, attention variants, context length, tokenizer.
- Post-training: instruction tuning, preference training, safety behavior, tool use, and product-level rules.
So when people compare GPT, Claude, Gemini, Llama, Mistral, Qwen, or Gemma, they are often comparing siblings in a broad transformer family rather than completely unrelated species of model.
Slightly technical: modern transformer vocabulary
RoPE: position through vector rotation.
RMSNorm: a cheaper normalization variant used in many modern open models.
SwiGLU: a popular activation/feed-forward design.
GQA: grouped-query attention, which reduces KV-cache memory.
MoE: mixture of experts, where only selected expert networks run for each token.
10. GPT-2 and MoE: two useful milestones
Two research threads make the mechanics above feel more concrete. GPT-2 showed how far plain next-token prediction could go when scaled. Mixture of Experts shows how a model can grow more capable without forcing every token to use every parameter.
π§©
Plain-English mental model: GPT-2 is like one very large generalist team. MoE is like a building with specialist rooms, where a router sends each token to only the rooms that seem useful.
GPT-2: scaling the next-token game
OpenAI's 2019 paper Language Models are Unsupervised Multitask Learners made a simple bet famous: train a transformer to continue internet text, then test whether that same model can handle many tasks by phrasing them as text continuation.
- It was autoregressive: it generated left to right, one token at a time.
- It was dense: every token passed through the same model weights.
- It helped popularize the idea that scale plus simple training can produce surprisingly general behavior.
MoE: not every token needs the whole building
A dense transformer usually runs every token through the same feed-forward network. In a Mixture-of-Experts model, a small router chooses only a few expert networks for each token. The model can have many more total parameters, while each token activates only a subset.
| Concept | Dense LLM | MoE LLM |
|---|---|---|
| Work per token | Uses the same main blocks | Uses selected experts |
| Analogy | One big generalist team | Router plus specialist teams |
| Tradeoff | Simpler to train and serve | More capacity, more routing complexity |
Slightly technical: where the MoE papers fit
βοΈ
Important nuance: MoE does not automatically mean smarter. Data quality, routing balance, training stability, inference hardware, and post-training still matter.
11. The AI ecosystem: MCP, tools, RAG, agents, and evals
The transformer is the engine, but real AI products usually add a stack around it. That stack gives the model fresh information, lets it take actions, checks its work, and keeps the system observable.
πΊοΈ
Plain-English map: the LLM is the text brain, tools are the hands, RAG is the open-book notes, MCP is a standard plug for external systems, agents are the loop that decides what to do next, and evals are the tests that tell you if any of it works.
| Term | Simple meaning | What it helps with | Watch out for |
|---|---|---|---|
| Prompt | Instructions and context | Steering behavior without changing weights | Vague prompts create vague answers |
| Tool calling | The model asks your app to run a function | Weather, search, payments, calendars, databases | Validate every argument before doing anything real |
| MCP | A shared protocol for connecting AI apps to tools/data | Reusable integrations across different hosts | Permissions, auth, and tool descriptions matter |
| RAG | Retrieve relevant documents before answering | Fresh facts and private knowledge | Bad retrieval creates confident wrong answers |
| Embeddings | Meaning as searchable vectors | Semantic search and clustering | Similar does not always mean correct |
| Agent | A model inside a task loop | Planning, tool use, retries, handoffs | Needs limits, logs, and stop conditions |
| Fine-tuning | Training on examples of desired behavior | Style, format, classification, repeated edge cases | Do evals first; do not use it as a fact database |
| Evals | Tests for model behavior | Comparing prompts, tools, models, and releases | Tiny demo tests miss real-world messiness |
MCP: the USB-C idea for AI tools
MCP stands for Model Context Protocol. Instead of every AI app inventing a custom connector for every service, MCP defines a common client-server pattern. An AI app is the host. It creates an MCP client. That client connects to an MCP server, which exposes things like tools, resources, and prompts.
The key idea is not that MCP makes the model smarter by itself. It makes integrations more standard. A coding agent can connect to GitHub, a support assistant can connect to tickets, and a research assistant can connect to document stores using the same basic pattern.
π
Security rule: treat tools like real permissions, not decorations. If a tool can send email, delete files, spend money, or publish content, the app should require clear approval, scoped access, logging, and argument validation.
RAG: giving the model an open book
RAG means Retrieval-Augmented Generation. The model does not rely only on what it learned during training. Your app first searches a knowledge base, pulls the most relevant chunks into the prompt, and asks the model to answer using that context.
- Split documents into chunks.
- Turn each chunk into an embedding vector.
- Store those vectors in a search index or vector database.
- When the user asks something, search for similar chunks.
- Put the best chunks into the model context and ask for a grounded answer.
Agents: the loop around the model
An agent is not a new kind of brain. It is usually an LLM plus an orchestration loop: read the goal, choose a next step, maybe call a tool, inspect the result, update the plan, and continue until done or stopped.
π§ͺ
Evals are what turn AI from a cool demo into an engineering system. Before shipping a new prompt, model, tool, or agent flow, test it on examples that represent real users, failure cases, and edge cases.
Slightly technical: how these pieces fit in one product
A production assistant might use MCP to discover tools, RAG to fetch private documents, tool calling to take controlled actions, structured outputs to return clean JSON, evals to measure quality, tracing to debug failures, and guardrails to block unsafe or unauthorized actions.
A friendly checklist for understanding any LLM answer
Did the model receive the right information in the prompt?
Was the important context near the beginning or end?
Is the task asking for facts, reasoning, creativity, or formatting?
Would retrieval or a tool make the answer more grounded?
Should the output be checked against a source before trusting it?
β
If you remember one thing, remember this: LLMs transform text into numbers, let those numbers exchange context through attention, and then predict the next token again and again until an answer appears.
Further reading
- Kato, How LLMs Actually Work
- Vaswani et al., Attention Is All You Need
- Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts
- Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2)
- Fedus, Zoph, and Shazeer, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Artetxe et al., Efficient Large Scale Language Modeling with Mixtures of Experts
- Jiang et al., Mixtral of Experts
- Model Context Protocol, Architecture overview
- OpenAI, Function calling / tool calling guide
- OpenAI, Introducing text and code embeddings
- OpenAI Agents SDK, Agents guide
- OpenAI, Supervised fine-tuning guide
Polished enough to read like an essay, structured enough to use as a reference, and simple enough that you can explain it to a friend after one pass.