How LLMs Actually Work: A Friendly Map for Humans • oreoro

🧭

LLMs are not magic brains. They are prediction machines built from a few repeatable parts: tokens, vectors, attention, memory-like feed-forward layers, and a loop that keeps choosing the next likely piece of text.

The whole idea in one minute 1. Tokens: the model's alphabet is not your alphabet 2. Embeddings: IDs become meaning-shaped numbers 3. Position: the model needs word order 4. Attention: tokens decide what to pay attention to 5. Multi-head attention: many views at once 6. Feed-forward networks: where a lot of learned structure lives 7. Residual stream and normalization: keeping deep models trainable 8. Next-token prediction: the answer is built one piece at a time 9. Architecture vs weights: why models feel different 10. GPT-2 and MoE: two useful milestones GPT-2: scaling the next-token game MoE: not every token needs the whole building 11. The AI ecosystem: MCP, tools, RAG, agents, and evals MCP: the USB-C idea for AI tools RAG: giving the model an open book Agents: the loop around the model A friendly checklist for understanding any LLM answer Further reading Interlinked Content

✍️

Source note: this is an original, beginner-friendly rewrite inspired by Kato's article How LLMs Actually Work, with extra examples, code, tables, and Notion-native structure.

The whole idea in one minute

An LLM, or large language model, takes your text, turns it into numbers, runs those numbers through many transformer layers, and predicts what text should come next.

That is the simple version. The useful version is this:

Your prompt is split into tokens, which are small text pieces.
Each token becomes a vector, which is a list of numbers that carries learned meaning.
The model adds information about order, because dog bites man and man bites dog do not mean the same thing.
Attention lets each token decide which earlier tokens matter.
A feed-forward network does deeper processing for each token.
Residual connections and normalization keep the many layers stable.
The model outputs scores for the next possible token.
One token is chosen, added to the text, and the loop repeats.

flowchart LR
    A["You type a prompt"] --> B["Tokenizer<br>text pieces"]
    B --> C["Embeddings<br>meaning as numbers"]
    C --> D["Position signal<br>word order"]
    D --> E["Attention<br>what should matter?"]
    E --> F["Feed-forward layer<br>deeper processing"]
    F --> G["Next-token scores"]
    G --> H["Pick one token"]
    H --> I["Add it to the text"]
    I --> E

💡

A good mental model: an LLM is like an autocomplete system that has read a massive library and learned incredibly subtle patterns about what usually follows what.

Part	Plain-English job	Why it matters
Tokens	Break text into pieces	The model cannot read raw words or letters directly.
Embeddings	Turn pieces into meaning-shaped numbers	Similar ideas can sit near each other in number-space.
Position	Tell the model where each piece appears	Order changes meaning.
Attention	Let tokens look at useful previous tokens	This is how context flows through the sentence.
Feed-forward network	Process each token more deeply	A lot of learned structure lives here.
Next-token prediction	Score likely continuations	This is the generation loop behind every answer.

1. Tokens: the model's alphabet is not your alphabet

Models do not see your sentence the way you do. You see words. The model sees token IDs.

A tokenizer might split a sentence like this:

Those ID numbers are what enter the model. The specific numbers differ across model families, but the pattern is the same: text becomes a sequence of integers.

Why not just use whole words? Because language is messy. New names, typos, code, slang, and other languages would explode the vocabulary. Tokens sit between letters and words: flexible enough for rare text, efficient enough for common text.

Slightly technical: why the strawberry counting problem happens

When you ask a model how many letters are in a word, the model may not be looking at separate letters. It may see a word as one or a few tokens. That means character-level questions can be awkward unless the model deliberately reasons about spelling.

2. Embeddings: IDs become meaning-shaped numbers

A token ID by itself is just a label. ID 11205 does not mean robot unless the model has a learned table that says what vector should represent that token.

That table is called the embedding matrix. Think of it as a huge spreadsheet:

Every token ID gets one row.
Every row contains many numbers.
Those numbers are learned during training.
The row becomes the token's starting representation.

If two tokens are used in similar situations, their vectors often end up close together. Words like doctor, nurse, and hospital tend to live near related medical concepts. This was not hand-labeled by a person; it emerges because those relationships help the model predict text.

🧠

Embeddings are not definitions. They are coordinates learned from usage. The model learns that concepts are related because they appear in related contexts.

Slightly technical: vector arithmetic

An embedding is a vector, meaning a list of numbers. With enough training, directions in vector space can behave like meaning shifts. That is why famous examples like king - man + woman ≈ queen can sometimes work. It is geometry, not a dictionary.

3. Position: the model needs word order

A bag of tokens is not enough. These two sentences contain almost the same pieces but mean very different things:

The dog chased the boy.

The boy chased the dog.

The model therefore needs a position signal. Older transformers added a position vector to each token embedding. Many modern LLMs use RoPE, short for Rotary Position Embeddings, where position is represented by rotating parts of the vector.

You do not need the math to understand the purpose: position makes the model aware that one token came before another, and roughly how far apart they are.

📌

Practical takeaway: important context usually works best near the start or end of a long prompt. Many models are weaker at using information buried in the middle.

Slightly technical: why long context is still hard

Even if a model can accept a huge prompt, that does not mean it uses every part equally well. Attention has to compare many tokens, and retrieval quality can drop when the answer is hidden in the middle of a long context window.

4. Attention: tokens decide what to pay attention to

Attention is the heart of the transformer. It lets each token ask: which previous tokens should shape my current meaning?

For each token, the model creates three learned views:

Name	Question it answers	Everyday analogy
Query	What am I looking for?	A search request
Key	What do I match with?	A label on stored information
Value	What information should be passed along?	The content you copy after finding a match

Imagine the sentence:

The cat that I saw yesterday was sleeping.

When the model reaches was, it needs to know what was sleeping. Attention can give more weight to cat than to yesterday, because cat is more useful for understanding the verb.

🔒

GPT-style models use causal masking: while predicting the next token, they can look backward but not forward. Future text is hidden because it has not been generated yet.

5. Multi-head attention: many views at once

One attention pattern is not enough for language. A sentence can contain grammar, references, tone, code syntax, and long-range dependencies at the same time.

Multi-head attention runs several attention operations in parallel. One head might track subject-verb relationships. Another might follow quotation marks. Another might notice that a variable name in code was used earlier.

Slightly technical: heads are learned projections, not fixed slices

Each head learns its own projections from the full token vector into a smaller query/key/value space. So a head is not simply handed a pre-cut piece of the vector. It learns its own way to view the whole token representation.

The model then combines the outputs from all heads and sends the result onward.

A practical detail: during generation, the model stores old key and value vectors in a KV cache. That way it does not need to recompute the entire conversation every time it adds one new token.

6. Feed-forward networks: where a lot of learned structure lives

After attention mixes information between tokens, each token goes through a feed-forward network.

Attention is about tokens communicating. The feed-forward network is more like each token doing private thinking.

The rough pattern is:

Expand the vector into a larger space.
Apply a non-linear function.
Compress it back down.

The non-linear step matters because it lets the model learn richer patterns. Without it, many stacked layers would collapse into something much simpler.

🧱

A lot of model parameters live in feed-forward layers. This is one reason they are often discussed as the model's learned store of patterns, facts, and associations.

Slightly technical: dense models vs mixture of experts

In a dense transformer, every token uses the same feed-forward network in a layer. In a mixture-of-experts model, a small router chooses only a few expert networks for each token. This can increase total model capacity without making every token run through every parameter.

7. Residual stream and normalization: keeping deep models trainable

A modern LLM can have dozens or even hundreds of layers. If each layer simply replaced the previous representation, training would be fragile.

Residual connections solve part of that problem. Instead of replacing the vector, a block adds its output back to the existing vector.

This creates a running stream of information through the network. Each layer can add a refinement without destroying everything that came before.

Layer normalization keeps the numbers stable. Without it, values can grow too large or shrink too much as they pass through many layers.

🛠️

The boring-sounding parts matter. Residual connections and normalization are major reasons very deep transformer stacks can actually train.

8. Next-token prediction: the answer is built one piece at a time

At the end of the stack, the model turns the final vector into scores for possible next tokens. These raw scores are called logits. A softmax converts them into probabilities.

Then a decoding strategy chooses one token.

Setting	Plain-English effect	When useful
Temperature	Controls randomness	Lower for precise answers, higher for creative drafts
Top-k	Only considers the k most likely tokens	Prevents very unlikely choices
Top-p	Considers the smallest likely group whose probabilities add up to p	Flexible sampling without fixed k

That loop is the machine behind the fluent paragraph. The model writes by repeatedly asking: given everything so far, what token should come next?

⚠️

This also explains hallucinations. The base training objective rewards plausible continuation, not guaranteed truth. Post-training, retrieval, tool use, and evaluation are added to make outputs more useful and reliable.

9. Architecture vs weights: why models feel different

Many modern LLMs share the same broad transformer-family shape. What makes them feel different is usually a combination of:

Training data: what they learned from.
Scale: how many layers, heads, parameters, and tokens were used.
Architecture choices: dense or mixture-of-experts, attention variants, context length, tokenizer.
Post-training: instruction tuning, preference training, safety behavior, tool use, and product-level rules.

So when people compare GPT, Claude, Gemini, Llama, Mistral, Qwen, or Gemma, they are often comparing siblings in a broad transformer family rather than completely unrelated species of model.

Slightly technical: modern transformer vocabulary

RoPE: position through vector rotation.

RMSNorm: a cheaper normalization variant used in many modern open models.

SwiGLU: a popular activation/feed-forward design.

GQA: grouped-query attention, which reduces KV-cache memory.

MoE: mixture of experts, where only selected expert networks run for each token.

10. GPT-2 and MoE: two useful milestones

Two research threads make the mechanics above feel more concrete. GPT-2 showed how far plain next-token prediction could go when scaled. Mixture of Experts shows how a model can grow more capable without forcing every token to use every parameter.

🧩

Plain-English mental model: GPT-2 is like one very large generalist team. MoE is like a building with specialist rooms, where a router sends each token to only the rooms that seem useful.

GPT-2: scaling the next-token game

OpenAI's 2019 paper Language Models are Unsupervised Multitask Learners made a simple bet famous: train a transformer to continue internet text, then test whether that same model can handle many tasks by phrasing them as text continuation.

It was autoregressive: it generated left to right, one token at a time.
It was dense: every token passed through the same model weights.
It helped popularize the idea that scale plus simple training can produce surprisingly general behavior.

MoE: not every token needs the whole building

A dense transformer usually runs every token through the same feed-forward network. In a Mixture-of-Experts model, a small router chooses only a few expert networks for each token. The model can have many more total parameters, while each token activates only a subset.

Concept	Dense LLM	MoE LLM
Work per token	Uses the same main blocks	Uses selected experts
Analogy	One big generalist team	Router plus specialist teams
Tradeoff	Simpler to train and serve	More capacity, more routing complexity

Slightly technical: where the MoE papers fit

⚖️

Important nuance: MoE does not automatically mean smarter. Data quality, routing balance, training stability, inference hardware, and post-training still matter.

11. The AI ecosystem: MCP, tools, RAG, agents, and evals

The transformer is the engine, but real AI products usually add a stack around it. That stack gives the model fresh information, lets it take actions, checks its work, and keeps the system observable.

🗺️

Plain-English map: the LLM is the text brain, tools are the hands, RAG is the open-book notes, MCP is a standard plug for external systems, agents are the loop that decides what to do next, and evals are the tests that tell you if any of it works.

Term	Simple meaning	What it helps with	Watch out for
Prompt	Instructions and context	Steering behavior without changing weights	Vague prompts create vague answers
Tool calling	The model asks your app to run a function	Weather, search, payments, calendars, databases	Validate every argument before doing anything real
MCP	A shared protocol for connecting AI apps to tools/data	Reusable integrations across different hosts	Permissions, auth, and tool descriptions matter
RAG	Retrieve relevant documents before answering	Fresh facts and private knowledge	Bad retrieval creates confident wrong answers
Embeddings	Meaning as searchable vectors	Semantic search and clustering	Similar does not always mean correct
Agent	A model inside a task loop	Planning, tool use, retries, handoffs	Needs limits, logs, and stop conditions
Fine-tuning	Training on examples of desired behavior	Style, format, classification, repeated edge cases	Do evals first; do not use it as a fact database
Evals	Tests for model behavior	Comparing prompts, tools, models, and releases	Tiny demo tests miss real-world messiness

MCP: the USB-C idea for AI tools

MCP stands for Model Context Protocol. Instead of every AI app inventing a custom connector for every service, MCP defines a common client-server pattern. An AI app is the host. It creates an MCP client. That client connects to an MCP server, which exposes things like tools, resources, and prompts.

The key idea is not that MCP makes the model smarter by itself. It makes integrations more standard. A coding agent can connect to GitHub, a support assistant can connect to tickets, and a research assistant can connect to document stores using the same basic pattern.

🔐

Security rule: treat tools like real permissions, not decorations. If a tool can send email, delete files, spend money, or publish content, the app should require clear approval, scoped access, logging, and argument validation.

RAG: giving the model an open book

RAG means Retrieval-Augmented Generation. The model does not rely only on what it learned during training. Your app first searches a knowledge base, pulls the most relevant chunks into the prompt, and asks the model to answer using that context.

Split documents into chunks.
Turn each chunk into an embedding vector.
Store those vectors in a search index or vector database.
When the user asks something, search for similar chunks.
Put the best chunks into the model context and ask for a grounded answer.

Agents: the loop around the model

An agent is not a new kind of brain. It is usually an LLM plus an orchestration loop: read the goal, choose a next step, maybe call a tool, inspect the result, update the plan, and continue until done or stopped.

🧪

Evals are what turn AI from a cool demo into an engineering system. Before shipping a new prompt, model, tool, or agent flow, test it on examples that represent real users, failure cases, and edge cases.

Slightly technical: how these pieces fit in one product

A production assistant might use MCP to discover tools, RAG to fetch private documents, tool calling to take controlled actions, structured outputs to return clean JSON, evals to measure quality, tracing to debug failures, and guardrails to block unsafe or unauthorized actions.

A friendly checklist for understanding any LLM answer

Did the model receive the right information in the prompt?

Was the important context near the beginning or end?

Is the task asking for facts, reasoning, creativity, or formatting?

Would retrieval or a tool make the answer more grounded?

Should the output be checked against a source before trusting it?

✅

If you remember one thing, remember this: LLMs transform text into numbers, let those numbers exchange context through attention, and then predict the next token again and again until an answer appears.