Press enter or click to view image in full size
Part 1 of 6 in the series “LLMs Need Knowledge Graphs. Use RDF or End Up Rebuilding It.”
The Big Picture: Knowledge graphs triple LLM accuracy on enterprise data. But here’s what nobody tells you upfront: every knowledge graph converges on the same patterns, the same solutions. This series reveals why RDF isn’t just one option among many — it’s the natural endpoint of knowledge representation. By Post 6, you’ll see real enterprises learning this lesson at great cost — or great savings.
The Knowledge Layer Revolution
Your AI is struggling with your data. You know this because you’ve watched it happen-confident answers that are completely wrong, hallucinations about basic facts, inability to connect information from different systems.
You’re not alone. When large language models try to answer business questions using enterprise SQL databases, errors are common. Without additional context and structure, LLMs often struggle to interpret schemas and relationships correctly.
But something remarkable happens when you add a knowledge layer between your data and your AI. When that same data is transformed into a knowledge graph, accuracy more than triples. The improvement is dramatic.
This finding comes from research my colleagues (Juan Sequeda and Dean Allemang) and I published together (“Benchmarking the Abilities of LLMs for Supporting Enterprise Knowledge Graph Construction from Relational Databases”, 2023). We discovered that LLMs perform dramatically better with knowledge graphs — the structure aligns naturally with how they process information.
When teams embark on building a knowledge layer, they face a critical early decision: use the established RDF standards, or build something custom. Many choose to build their own solution, viewing RDF as overly complex or academic. They start with property graphs, custom schemas, or proprietary platforms that promise quick wins.
But I’ve spent years working at the intersection of knowledge representation and AI, watching these projects evolve. The pattern is remarkably consistent. Teams that choose not to use RDF inevitably find themselves rebuilding its core features: global identifiers for entities, protocols for data federation, ways to express relationships and metadata consistently. What starts as “we’ll keep it simple” becomes “we need a canonical ID system” becomes “we’re building our own semantic layer.”
Uber discovered this after building their own graph system. Neo4j reversed course after years of positioning against RDF. The market has spoken: you need these capabilities. The only question is whether you’ll build them yourself or use what already exists.
This series reveals why RDF isn’t just another technology choice — it’s the natural endpoint of knowledge representation. Not because of ideology or standards bodies, but because the problems of representing knowledge at scale force convergent evolution.
Let me show you why, starting with the most fundamental challenge every knowledge layer must solve.
Why LLMs Struggle with Traditional Databases
LLMs are pattern-matching machines trained on natural language. When they encounter a SQL schema, they’re forced to:
- Guess what
cust_idvscustomer_idvscustIDmean - Infer relationships from cryptic foreign key names
- Navigate ambiguous table names (is
ordersfor customer orders or supply orders?) - Understand domain-specific abbreviations without context
The result is poor performance — not because LLMs are bad at reasoning, but because SQL schemas optimize for storage efficiency rather than semantic clarity.
You can improve SQL schemas for semantic clarity — using descriptive names, normalizing relationships properly, maintaining clean metadata. But this requires constant discipline, adds significant overhead, and fights against SQL’s natural optimization patterns. Database administrators rightfully focus on performance and maintainability, leading to denormalization, cryptic but efficient column names, and other practices that prioritize machine efficiency over semantic clarity. Even with perfect discipline, SQL’s fundamental separation of data (in tables) from metadata (in schemas) makes it harder for AI systems to understand how the model evolves. When your knowledge representation is spread across DDL statements, foreign key constraints, and actual data, LLMs struggle to build a coherent semantic picture.
Knowledge graphs, on the other hand, are organized the way we actually think about facts and relationships. They represent knowledge directly, not as a “projection” into tables and columns. While you can store facts in relational databases, you’re always forcing a graph-shaped understanding into a table-shaped container.
## The Pattern Every Enterprise Follows When Building a Knowledge Graph
Watch for this progression in your organization:
- “We need a knowledge graph for our AI”
- “RDF seems too complex, let’s use property graphs”
- “We need global identifiers for our merger”
- “How do we federate queries across departments?”
- “Our custom solution is becoming unmaintainable”
- “Maybe we should have used RDF from the start”
This series will show you why this pattern is inevitable — and how to skip to the end.
Why Knowledge Graphs Change Everything
Knowledge graphs represent information the way LLMs (and humans) “think”:
- Explicit relationships: No guessing what foreign keys mean
- Rich context: Every entity and relationship can be described
- Natural language alignment: Triples mirror subject-verb-object sentences
- Semantic clarity: Types, hierarchies, and constraints are explicit
As Dan Bennett explains in his excellent primer on knowledge graphs, “We can state anything about anything using this model” — and crucially, “A single row is meaningful. It contains a single fact.” This isn’t just a technical preference — it’s about fundamental representation. Knowledge graphs store the atomic truths about your business directly, while relational databases require reconstructing those truths from scattered pieces. When an LLM can traverse relationships explicitly rather than inferring them from column names, accuracy triples. The knowledge graph becomes a bridge between human meaning and machine processing.
The Knowledge Graph Gold Rush… and Its Hidden Challenge
The 3x accuracy improvement has triggered a gold rush. Enterprises are racing to build knowledge graphs. But here’s what the research papers don’t always mention: building a production knowledge graph requires solving fundamental problems that have existed since humans started organizing information.
And this is where our story really begins.
The First Problem: Identity
Knowledge graphs must answer a deceptively simple question: “How do we know two things are the same thing?”
It starts innocently enough. Customer #12345 in your sales system needs to match up with cust_12345 in your support system. But then it gets messier:
- When an LLM sees “Apple” in your data, is it the fruit or the company?
- Is employee “A. Johnson” the same as “Alice Johnson” in HR?
- When you reference Database → Schema → Table → Column, which specific column across all your systems?
Without solving identity, you get:
- Data silos that refuse to talk to each other
- Integration projects that never truly end
- LLMs hallucinating because they can’t distinguish between entities
Every graph database, every knowledge graph platform, every enterprise data mesh must solve this. And RDF solved it 25 years ago by building on the architecture of the most successful distributed system ever created — the World Wide Web.
Enter IRIs: The Web’s Gift to Data
The solution has been staring us in the face since the invention of the web itself: International Resource Identifiers (IRIs). Just as URLs gave us a way to uniquely identify any document on the web, IRIs give us a way to uniquely identify anything at all.
Here’s what this looks like in practice:
# IRIs provide globally unique identifiers
tc:employee-alice-johnson a :Employee ;
:name "Alice Johnson" ;
:employeeId "E12345" .# Different system, same person - unified by IRI
dir:staff-ajohnson
owl:sameAs tc:employee-alice-johnson .
Notice how this reads almost like English sentences? That’s not an accident-RDF’s triple structure mirrors how we naturally express facts.
The keen-eyed reader might notice these identifiers don’t look like typical URLs. We’re using prefixed names (like tc:employee-alice-johnson) that expand to full IRIs (like <http://timecard.example.com/employee-alice-johnson>). Think of it like using domain names instead of IP addresses—both point to the same place, but one is much easier for humans to work with.
The magic isn’t in the syntax-it’s in the properties:
Global Uniqueness: By using domain-based namespacing, collisions become virtually impossible. Your customer #12345 at data.example.com will never be confused with someone else's customer #12345.
Dereferenceable: IRIs can be designed to return more information when accessed, following web architecture principles. While not automatic, making your IRIs dereferenceable is a semantic web best practice that elegantly bridges your knowledge graph with the existing infrastructure of the web. Just as clicking a link can take you to a webpage, systems can potentially follow well-designed IRIs to discover more context.
Hierarchical: IRIs naturally organize into hierarchies (/customer/12345/orders/...). These structured IRIs are invaluable for humans (and AI!) to quickly understand what they represent. But-and this is crucial-you should never parse them programmatically. The hierarchical structure is a scheme for generating meaningful identifiers and making them readable, but machines should treat them as opaque strings.
International: Unlike traditional URIs, IRIs support the full range of Unicode characters. Your customers in Tokyo, Moscow, and Cairo can all have identifiers in their own scripts.
The Build-vs-Buy Moment Every Enterprise Faces
At this point, you might be thinking: “We don’t need all this. We’ll just build a simple mapping table.”
Let me save you three years and several million dollars. Here’s how it actually plays out:
Get Bryon Jacob’s stories in your inbox
Join Medium for free to get updates from this writer.
Year 1: “We’ll just map customer IDs between systems” ($500K, 2 engineers)
- Build a mapping table
- Works great for 2–3 systems
- The solution seems complete
Year 2: “We need to handle entities beyond customers” ($2M, 5 engineers)
- Extend to products, employees, locations
- Mapping tables multiply
- Performance degrades
- Hire more engineers
Year 3: “We need globally unique identifiers” ($5M total, still not done)
- Invent your own URI scheme
- Build a resolution service
- Handle international characters
- Use or end up reinventing IRIs
The BBC chose differently. They adopted RDF from the start. During the 2010 World Cup, their semantic web platform automatically generated over 700 pages — far more than manual curation would have allowed. By the 2012 Olympics, they expected 10 million page views per day across 10,000 Olympic pages. The result? Dramatically reduced costs while delivering richer content experiences.
I’ve seen this pattern play out several times firsthand, gone through it myself, and heard the same story from veterans with decades of experience. The ending is always the same: organizations converge on globally unique, hierarchical, dereferenceable identifiers. Also known as… IRIs.
Back to Our LLM Problem
Consider this SQL query an LLM might need to construct:
-- LLM has to guess: are these the same customer?
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN crm_records r ON r.cust_num = c.customer_numberThe LLM has to infer that customer_id, id, cust_num, and customer_number might refer to the same entity. It's making educated guesses based on naming patterns. Sometimes it's right. Usually-84% of the time, according to the research-it's not.
Now look at the same information in RDF:
# In RDF, identity is explicit
tc:employee-alice-johnson
org:worksIn facilities:building-west-tower ;
org:reportsTo tc:employee-bob-smith ;
foaf:account it:users-ajohnson .
# No guessing needed!The relationships are explicit. The identities are unambiguous. The LLM doesn’t need to infer-it can simply follow the links.
From Theory to Practice
Starting with IRIs doesn’t require a massive transformation. You can begin simply:
tc:employee-alice-johnson a :Employee ;
:email "alice.johnson@techcorp.com" ;
:employeeId "E12345" ;
:department tc:dept-engineering .As your system grows, you can connect to other identifiers:
# Link internal and external identifiers
tc:employee-alice-johnson
owl:sameAs hr:employee-alice-johnson ;
owl:sameAs dir:staff-ajohnson ;
rdfs:seeAlso <https://linkedin.com/in/alice-johnson> .Suddenly, your customer data can connect to your CRM, to social media, to any system that uses IRIs. No integration project required-just shared identity.
Why This Matters for Your LLM Initiative
This accuracy jump isn’t just about having more data. It’s about having unambiguous data. Here’s what proper identity gives LLMs:
Disambiguation: When the LLM sees “Johnson” in a query, it can determine whether you mean alice-johnson, bob-johnson, or other employees with that surname-no guessing required.
Context Traversal: The LLM can follow relationships confidently. “What projects does Alice’s manager oversee?” becomes a simple graph traversal instead of a complex inference problem. Each step of inference is an opportunity for hallucination-even a small error rate compounds dramatically when multiplied across multiple hops. By making these relationships explicit in the graph, we turn risky inference into deterministic traversal.
Source Attribution: Every fact can specify its origin. The LLM can qualify its answers: “According to the HR system, Alice reports to Bob, but the project management system shows her working directly with the CTO on the AI initiative.”
The Payoff: Intelligence Emerges
When you solve identity properly, something magical happens:
LLMs can traverse relationships confidently. No more ambiguity about which “customer” or “product” you mean. The IRI is the answer.
Federated queries become natural. IRIs work across system boundaries by design. Your data can live anywhere and still connect.
Knowledge accumulates automatically. New facts enhance rather than confuse. Every system can contribute to the growing understanding.
Provenance is built-in. Every fact can specify who said it, when, and with what confidence. Critical for AI explainability.
This is why knowledge graphs triple LLM accuracy. It’s not about the graph structure alone-it’s about solving identity in a way that eliminates ambiguity.
The Inevitable Convergence
Here’s the uncomfortable truth: complex data systems eventually build these same features:
What You’ll Call It:
- “Entity Resolution Pipeline”
- “Master Data Management”
- “Canonical ID Service”
- “Universal Resource Registry”
What You’re Actually Building:
- Globally unique identifiers (IRIs)
- Namespace management (IRI prefixes)
- Entity equivalence (
owl:sameAs) - Distributed resolution (HTTP dereferencing)
The only difference? You’ll spend 2–3 years and millions of dollars building a worse version of what RDF gives you for free.
This isn’t speculation. Look at any mature data platform:
- Uber spent years building “algebraic property graphs” to avoid RDF, then presented it as a cautionary tale
- Neo4j went from “RDF is too complex” to maintaining comprehensive RDF toolkits
- Google’s Knowledge Graph uses RDF under the hood
- Major platforms converge on the same patterns
Organizations need an identity system. The question becomes whether to build one that works at web scale from day one, or one that will need to be rebuilt when your data outgrows its original scope.
The Choice: Build on RDF or Rebuild RDF?
The proven approach? Start with RDF. Use the battle-tested solution that powers DBpedia, Wikidata, and enterprise knowledge graphs worldwide.
As Juan Sequeda wisely advises in his foreword to the Neo4j whitepaper Knowledge Graphs — Data in Context:
“One of my mantras is don’t boil the ocean. This means that your knowledge graph journey should start simple, be practical, and focus on the business return…”
Source (Neo4j Whitepaper PDF)
But do start with the right foundation. Because those identifiers determine everything else.
Tim Berners-Lee’s first rule of Linked Data couldn’t be simpler:
“Use URIs as names for things.”
Source (W3C Linked Data Principles)
Twenty-five years later, enterprises are still learning this lesson the hard way.
Dean Allemang, reflecting on their research showing 3x improvement in LLM accuracy, summed it up perfectly:
“The bottom line is it works three times better, and that’s pretty cool.”
Source (Knowledge Graph Insights Podcast)
Three times better. That’s the difference between an LLM that frustrates users and one that delivers value. All because you solved identity properly.
The question isn’t whether you’ll build these features. Most enterprises do.
The question is whether you’ll choose to start with the solution that already exists.
Key Takeaways
- LLMs triple their accuracy with knowledge graphs: From SQL to knowledge graphs (Sequeda et al., 2023)
- Identity is the foundation problem: Every knowledge graph must solve “are these the same thing?”
- RDF/IRIs solved this 25 years ago: Global uniqueness, dereferenceability, no central authority
- You’ll build these features anyway: Mature data platforms converge on IRI-like solutions
- Understanding foundations enables implementation: This series equips you to understand RDF before diving into LLM integration
Next: RDF Triples: Smallest Atom of Meaning, Largest Scope of Use — How do you represent knowledge once you can identify anything? Enter the RDF triple-the atom of meaning that scales to the universe.