Press enter or click to view image in full size
LLMs don’t read text; they read lists of floating-point numbers. Here is a visual guide to the evolution, linear algebra, and code behind how AI captures human concepts.
Introduction: The Fundamental Problem
Computers are calculators. They understand numbers perfectly but are oblivious to nuance, irony, or synonyms. Humans, on the other hand, communicate almost entirely in nuance.
To build a system like ChatGPT, or even a simple “Chat with PDF” (RAG) tool, you must solve the Semantic Gap: How do you translate the fuzzy, qualitative world of human language into the precise, quantitative world of machine computation?
The answer is Vector Embeddings. An embedding is a translation layer that maps a discrete concept (like a word) to a continuous point in a high-dimensional mathematical space.
The Evolution of “Meaning”
To appreciate how modern Transformers work, we must first understand the problems with older methods.
1. The Old Way: The “Bag of Words” (Counting)
Before deep learning, computers treated language as a “Bag of Words.” To represent a vocabulary of 10,000 words, you created a vector of 10,000 numbers, almost all of which were zero.
The word “Apple” was just a 1 in a specific column.
The Problem: This is incredibly inefficient. Worse, it contains zero meaning. The mathematical distance between “Apple” and “Orange” is the exact same as the distance between “Apple” and “Carburetor.” The computer only knows they are different indices.
Press enter or click to view image in full size
2. The Breakthrough: Word2Vec (Context)
Around 2013, researchers found a way to create dense vectors. Instead of 10,000 zeros, “King” became a compact list of maybe 300 numbers:
[0.12, -0.45, 0.88, ...]
How? By training a neural network on the idea that “You shall know a word by the company it keeps.”
The model looks at a sliding window of text. If “King” and “Queen” frequently appear surrounded by similar context words like “throne,” “crown,” and “ruled,” the network learns to place their mathematical representations close together.
Press enter or click to view image in full size
3. The Modern Era: Transformers (Nuance)
Word2Vec had a flaw: the word “bank” had one static vector, regardless of whether you meant a “river bank” or a “financial bank.”
Modern Transformer models (like BERT and GPT) generate contextual embeddings. They look at the entire sentence at once. In a modern LLM, the vector for the word “bank” changes dynamically depending on the other words in the sentence.
Press enter or click to view image in full size
The Linear Algebra of Meaning
Because these concepts are now just lists of numbers, we can perform math on them. This is where the true magic of embeddings is revealed. It leads to the most famous example in all of Natural Language Processing:
King — Man + Woman ≈ Queen
This isn’t a metaphor; it’s literal arithmetic. If you take the vector coordinate for “King,” subtract the coordinate for “Man” (effectively removing the “masculinity” component), and add the coordinate for “Woman” (adding the “femininity” component), you land almost exactly on the coordinate for “Queen.”
Press enter or click to view image in full size
The Ruler (Cosine Similarity)
Now, how do we use this for something practical like search? If a user queries for “Puppy,” how does the AI know to return a document about “Dog Health”?
It needs a ruler to measure distance in this multi-dimensional space. We don’t use standard “as-the-crow-flies” distance (Euclidean distance) because the length of a document’s vector can vary based on how many words it has.
Why Euclidean Distance Fails
Your first instinct might be high-school geometry distance (Euclidean distance).
The problem is that in high-dimensional spaces, magnitude (the length of the arrow) can be misleading. A long document about apples might have a very long vector, while a short sentence about apples has a short vector. They point in the same direction (same meaning), but the Euclidean distance between their tips is huge.
Enter Cosine Similarity
Instead, we measure the angle between two vectors. This metric is called Cosine Similarity.
- If two vectors point in roughly the same direction, the angle between them is small, and their cosine similarity is close to 1. This indicates a high conceptual match.
- If they point in different directions, the angle is large (approaching 90 degrees), and their cosine similarity is close to 0. This indicates they are conceptually unrelated.
Press enter or click to view image in full size
This is the core mechanism of semantic search. The AI calculates the cosine similarity between your query’s vector and the vectors of every document in its database, then returns the ones with the highest scores.
The Code (From toy example to reality)
Let’s solidify this with Python. We will start with a manual, low-dimensional example to see the math work, and then look at what real-world embedding data looks like.
A Toy 3D Search Engine
We will use numpy to build a semantic search engine in just a few lines of code. We will define a hypothetical 3D concept space:
[Is_Animal, Is_Domesticated, Has_Wheels]
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity# 1. Define our "Database" of concepts as manual vectors
# Dimensions represent: [Is_Animal, Is_Domesticated, Has_Wheels]
database = {
"Dog": np.array([0.9, 0.8, 0.0]), # High animal, high domesticated
"Cat": np.array([0.8, 0.9, 0.0]), # Similar to dog
"Wolf": np.array([0.9, 0.1, 0.0]), # Animal, but NOT domesticated
"Truck": np.array([0.0, 0.0, 0.9]) # Not animal, has wheels
}
# 2. Define a User Query: "Puppy"
# We intuitively know a puppy is a domesticated animal.
query_vector = np.array([0.95, 0.9, 0.0])
# 3. Perform the Search
print(f"--- Searching for 'Puppy' ---")
results = {}
for word, vector in database.items():
# Reshape vectors to 1x3 matrices for scikit-learn functions
v1 = query_vector.reshape(1, -1)
v2 = vector.reshape(1, -1)
# Calculate Cosine Similarity
score = cosine_similarity(v1, v2)[0][0]
results[word] = score
# Sort results by highest score first
sorted_results = sorted(results.items(), key=lambda x:x[1], reverse=True)
for word, score in sorted_results:
print(f"Similarity to {word:<6}: {score:.4f}")
--- Searching for 'Puppy' ---
Similarity to Dog : 0.9986
Similarity to Cat : 0.9922
Similarity to Wolf : 0.8800
Similarity to Truck : 0.0000The math successfully captured our intuition. “Dog” is the closest match. “Wolf” is related (it’s an animal), but the angle is wider because the “domesticated” dimension doesn’t align. “Truck” is mathematically irrelevant.
Real-World Vectors (1536 Dimensions)
What do these vectors look like in production? They aren’t nice, readable numbers like 0.9. They are a dense block of abstract floats.
Here is a snippet of Python code using langchain and OpenAI to fetch a real vector for the sentence "Hello world".
Note: You need an OPENAI_API_KEY set in your environment to run this.
import os
from langchain_openai import OpenAIEmbeddings# Initialize the embedding model (defaults to text-embedding-3-small)
embeddings_model = OpenAIEmbeddings()
text = "Hello world"
vector = embeddings_model.embed_query(text)
print(f"Text: '{text}'")
print(f"Vector Dimensions: {len(vector)}")
print(f"First 10 dimensions: {vector[:10]}")
Text: 'Hello world'
Vector Dimensions: 1536
First 10 dimensions: [-0.00692, -0.0053, -0.0009, -0.0133, -0.0072, 0.0107, -0.0239, -0.0031, -0.0088, -0.0215]That list of 1,536 numbers is how the AI “understands” the concept of “Hello world”. Every single piece of text you send to a RAG system is converted into one of these lists before any searching happens.
Conclusion: The Foundation of Applied AI
Vector embeddings are the fundamental data structure of modern AI. They are the bridge that connects the fuzzy, qualitative world of human language to the precise, quantitative world of machines.
Here is a summary of what we’ve covered:
- Evolution of Meaning: We moved from simple word counting (sparse vectors) to context-aware representations (dense vectors) that capture nuance.
- Vector Arithmetic: Relationships between concepts can be expressed as simple algebraic equations (e.g., King — Man + Woman ≈ Queen).
- Cosine Similarity: This is the “ruler” we use to measure conceptual distance, allowing us to build powerful semantic search and RAG systems.
Understanding these core concepts is the difference between just using an AI library and truly understanding how to build and debug intelligent applications.