TLDR: VLOOKUP only works when values match exactly. Fuzzy matching helps with typos, but fails when "Microsoft Corporation" needs to match "MSFT" or "Apple Inc" needs to match "Apple Computer Company". This tutorial introduces deep merge by using AI to understand that two different strings refer to the same entity. Learn how to join datasets even when there's no common identifier, and names are spelled, abbreviated, or formatted completely differently.
I, and every data engineer, has tried to join two tables that don't have a shared key (also called entity resolution). Maybe it's companies, or people's names, or addresses, or parts. Those strings never match well enough, but pandas.merge() or VLOOKUP requires exact matches:
# This only works when "company" columns match exactly
merged = pd.merge(left_table, right_table, on="company")
Consider these variations that represent the same company:
| Table A | Table B |
|---|---|
| Microsoft Corporation | Microsoft Corp. |
| Apple Inc. | AAPL |
| Meta |
A standard merge would miss all of these. So next I would try fuzzy merge libraries, like fuzzywuzzy or rapidfuzz, compute string similarity scores:
from rapidfuzz import fuzz
similarity = fuzz.ratio("Microsoft Corporation", "Microsoft Corp.")
# Returns ~85% similarity
This effectively calculates a distance between strings and (like the Levenshtein Distance), and normalizes it to get a similarity score between 0 and 1:
A similarity ≥ threshold qualifies as a fuzzy match.
But this won't work even on my 3 company example. There's no threshold that will match "Facebook" to "Meta" that also doesn't introduce huge false positives. If there even is an optimal threshold, it varies every time.
If you really want high accuracy, like I do, then world knowledge is sometimes required, like "McGraw Hill" ↔ "S&P Global".
LLMs make a much, much higher quality semantic matching possible. First, you can use embeddings, convert text into high-dimensional vectors where semantically similar entities cluster together. This works well for simple relationships, and the OpenAI embeddings API is extremely cheap. But still the quality isn't great, and, similarly to fuzzy string matching, you have to pick the right threshold.
Full-powered LLMs can fundamentally do entity resolution as well as humans. Furthermore, LLM web agents can do this as well as humans with access to Google. This is important if data is dynamic, ambiguous, or simply not (yet) existent in the weights / training data of the LLMs.
Playing around with LLM web agents, I immediately noticed that they are overkill for most rows in most datasets. I needed to use the simplest method that works for each row, so I implemented a hierarchical cascade, using more expensive ones only if it looks like cheaper ones would fail:
- Exact string match: Basically a lookup-table
- Fuzzy string match: Levenshtein-based with a high threshold of 0.9
- LLM merge: A cheap LLM, run in batches, finds a match if it is confident
- LLM-Web-Agent merge: Browse the web and try find what entities really correspond to
Cost And Accuracy of the Progressive Intelligence Approach
Let's look at a real-world example.
(The full code of this tutorial is available in the accompanying Jupyter notebook)
I used a dataset of 438 S&P 500 companies with columns for company name, ticker, fair value, price, market cap, shares, and CEO. This gave us ground truth for validation.
data = pd.read_csv("data/companies.csv")
# Columns: company, ticker, fair_value, price, mkt_cap, shares, CEO
For the different experiments, I split the data into two tables, corrupted the data, and removed columns. The dataset then also serves as a ground truth for the different experiments.
I ran three experiments, each testing a different merging challenge:
- Fuzzy matching with noise from corrupt company names
- LLM merge without common columns by matching companies to tickers
- Web-based merge for association by matching CEOs to companies
Note this third one is not a typical join. CEOs and companies are not the same entity under different names, it's actually building a semantic relationship. LLMs are strictly needed for this.
My solution is now part of the everyrow.io SDK, so re-doing this experiment is pretty straightforward:
from everyrow import create_session
from everyrow.ops import merge
from everyrow.generated.models import LLMEnum
async with create_session(name="Merge companies") as session:
task = await merge(
session=session,
task="...", # Natural language description of what you're merging (helps the LLM, not necessary in general)
left_table=left_df, # table to merge into as a pandas DataFrame
right_table=right_df, # table to merge from as a pandas DataFrame
merge_on_left="...", # Optional: specify left merge column
merge_on_right="...", # Optional: specify right merge column
merge_model=LLMEnum.GEMINI_3_FLASH_MINIMAL # Optional: specify LLM model (default: GEMINI_3_FLASH_MINIMAL)
)
result = await task.await_result() # result contains the dataframe with merged tables and some metadata
The output table includes a research column with justifications for each match, allowing you to audit the matching logic.
You will also see a link to your session in the everyrow.io UI, where you can look at the merge process in real-time.
Experiment 1: Fuzzy String Matching Under Noise
In our first experiment, we progressively corrupt the data to simulate real-world data quality issues. We corrupted company names with a simple randomization function:
def randomize_string(text, p=0.1):
"""Replace each character with probability p"""
result = []
for char in text:
if random.random() < p:
result.append(random.choice(string.ascii_letters))
else:
result.append(char)
return ''.join(result)
left_table = data.copy()
left_table.drop(columns=["fair_value"], inplace=True)
right_table = data[["company", "fair_value"]]
right_table = randomize_string_column(right_table, "company", p=0.1)
The noise p describes the probability that a character gets corrupter (removed or replaced). With p=0.1 (=10%), "Microsoft" might become "Micposoft" or "Micrsft".
Running the experiment for different noise levels, we pick these arguments:
result = await deep_merge(
session=session,
task="Merge the tables on company name",
left_table=left_table,
right_table=right_table,
merge_on_left="company",
merge_on_right="company",
)
Let us look at the distribution of matching methods, accuracy, and price across all rows.
| Noise Level | Matched | Exact Match | Fuzzy Match | LLM Match | Web Match | Accuracy | Price |
|---|---|---|---|---|---|---|---|
| 0% | 100% | 100% | 0% | 0% | 0% | 100% | $0.13 |
| 5% | 100% | 49.8% | 30.6% | 19.6% | 0% | 100% | $0.32 |
| 10% | 100% | 26.5% | 30.8% | 42.7% | 0% | 100% | $0.44 |
($0.13 for no LLMs is only because I have overhead in my tool that does use LLMs to set up the problem.) At 10% character corruption, exact string matching handles only 27% of rows. Fuzzy matching catches another 30%, but 43% require LLM matching to resolve.
Let us looks at some examples, where fuzzy matching failed but LLMs succeeded:
| company name | company name (with noise) |
|---|---|
| Universal Health Services | Univebsal Health ServucQv |
| Boston Scientific | BostonNSchentifia |
| FactSet | FuctSet |
As we see, for short company names ("FactSet"), a single wrong letter is enough to get below the fuzzy string threshold of 0.9.
Despite the heavy corruption, we achieved 100% accuracy with 0% false positives across all noise levels. The cascade automatically escalated to more sophisticated methods as needed, just costing more.
Link to public session (p=0)
Link to public session (p=0.05)
Link to public session (p=0.1)
Experiment 2: LLM Merge for Semantic Relationships
Here we look at a more complex scenario: both tables do not share a column / identifier and we have to match purely based on the table information.
Left table:
| company | price | mkt_cap |
|---|---|---|
| Apple Inc. | 24.71 | 541.49B |
| Microsoft | 28.93 | 412.32B |
| Alphabet | 31.22 | 298.17B |
| ... | ... | ... |
Right Table:
| ticker | fair_value |
|---|---|
| AAPL | 1593.98 |
| MSFT | 1392.16 |
| GOOGL | 1428.66 |
| ... | ... |
This is where semantic matching can shine. We can give the LLM a hint that the ticker is an important column to identify the company:
result = await merge(
session=session,
task="Merge the tables based on company name and ticker",
left_table=left_table,
right_table=right_table,
)
Here we left out the merge columns, which automatically skips the exact and the fuzzy string matching step. Instead the merge cascade goes straight to LLM reasoning and if that fails, it falls back to web verification.
| Matched | Exact Match | Fuzzy Match | LLM Match | Web Match | Accuracy | False Positives | Price |
|---|---|---|---|---|---|---|---|
| 100% | 0% | 0% | 99.8% | 0.2% | 100% | 0% | $1.00 |
437 of 438 rows matched using pure LLM reasoning. One edge case required web verification.
This should not surprise: company tickers are stable and LLMs have memorized them well.
Link to public session
Experiment 3: Web-based Merge for Dynamic Data
Let us know look at a case where data is dynamic and web verification is needed for several entities.
From our dataset, we matched only the CEO names to the companies.
Left table:
| company | price | mkt_cap |
|---|---|---|
| Apple Inc. | 24.71 | 541.49B |
| Microsoft | 28.93 | 412.32B |
| Alphabet | 31.22 | 298.17B |
| ... | ... | ... |
Right table:
| CEO |
|---|
| Tim Cook |
| Satya Nadella |
| Sundar Pichai |
| ... |
In this case, we do not want to specify merge columns and adapt the prompt slightly:
result = await merge(
session=session,
task="Merge the CEO to the company information, use web search if unsure",
left_table=left_table,
right_table=right_table,
)
What is interesting to vary however, is the LLM model, as the ability to pay attention to a row in a large table ("needle in a haystack" problem) will vary, as will the confidence of the LLM and its ability to follow the user prompt.
Therefore let us compare Gemini-3-FLASH, GPT-5, and Claude-4.5-Sonnet, with minimal CoT and with more thinking allowed. Note however, that these models are only doing the semantic matching. The websearch itself is done by web agents with fixed LLM choices.
This time, the results show that the task was considerably harder for the web agents:
| LLM | Matched | LLM Match | Web Match | Accuracy | False Positives | Cost |
|---|---|---|---|---|---|---|
| GEMINI-3-FLASH | 100% | 89.9% | 1.1% | 87.7% | 12.3% | $0.75 |
| GEMINI-3-FLASH - thinking | 100% | 96.8% | 3.2% | 98.2% | 1.8% | $3.97 |
| GPT-5-MINI | 95.7% | 59.8% | 35.8% | 96.7% | 3.2% | $3.69 |
| GPT-5.2 | 98.9% | 79.0% | 19.9% | 98.2% | 0.7% | $3.76 |
| CLAUDE-4.5-SONNET | 99.5% | 69.4% | 30.1% | 89.3% | 10.3% | $7.12 |
| CLAUDE-4.5-SONNET - thinking | 99.8% | 74.0% | 25.8% | 90.4% | 9.4% | $5.97 |
We see that agents now use the web-support in a significant number of cases, but it differs vastly by the LLM choice as well as their confidence to match rows, their accuracy and rate of false positives. The best perfoming LLMs for this case-study were GPT-5.2 and GEMINI-3-FLASH (thinking).
Link to public session (GEMINI-3-FLASH)
Link to public session (GEMINI-3-FLASH - thinking)
Link to public session (GPT-5-MINI)
Link to public session (GPT-5.2)
Link to public session (CLAUDE-4.5-SONNET)
Link to public session (CLAUDE-4.5-SONNET - thinking)
Conclusion
We showed how web-agents can complement complex merging tasks and how a cascade of merging methods is most efficient.
Note that we have neither trained any models, learned any hyperparameters, nor adapted the architecture or the prompts with respect to the data. Additionally, further engineering tricks that we have not yet exploited, like collision resolution and multi-step matching, are likely to increase the accuracy further.
You (or Claude Code) can try this easily yourself, or learn more, via the everyrow.io SDK docs