GraphRAG Part 2: Minimum Viable GraphRAG

In Part 1, we explored why GraphRAG matters. Now let's build a minimum viable GraphRAG with three extraction modes:

Mode	LLM Calls	Best For
Heuristic (default)	0 per chunk	Fast indexing, structured markdown
Hybrid	1 per document	Balance of speed and quality
LLM	2 per chunk	Maximum entity quality

All modes use:

DuckDB for unified storage (vectors + graph in a single file)
BM25 + BERT hybrid search via RRF fusion
Ollama for synthesis and optional batch classification (zero API costs)

Series Navigation:

Part 1: GraphRAG Fundamentals
Part 2: Minimum Viable GraphRAG (this article)

Code: Mostlylucid.GraphRag on GitHub

Architecture Overview

flowchart LR
    subgraph Indexing
        MD[Markdown Files] --> CH[Chunker]
        CH --> EMB[BERT Embeddings]
        CH --> EXT[Entity Extractor]
        EXT --> |heuristics + links| ENT[Entities]
        ENT --> REL[Relationships]
        REL --> COM[Communities]
    end
    
    subgraph Storage
        EMB --> DB[(DuckDB)]
        ENT --> DB
        REL --> DB
        COM --> DB
    end
    
    subgraph Query
        Q[Query] --> CLASS{Classify}
        CLASS --> |local| HS[Hybrid Search]
        CLASS --> |global| CS[Community Search]
        CLASS --> |drift| BOTH[Both + Synthesis]
        HS --> LLM[Ollama]
        CS --> LLM
        BOTH --> LLM
    end
    
    DB --> HS
    DB --> CS
    
    style MD stroke:#22c55e,stroke-width:2px
    style DB stroke:#3b82f6,stroke-width:2px
    style LLM stroke:#a855f7,stroke-width:2px

Why DuckDB?

Microsoft's GraphRAG uses separate storage for vectors (LanceDB), entities (Parquet), and relationships (more Parquet). DuckDB simplifies this:

Single .duckdb file for everything
Native HNSW vector search via VSS extension
SQL for both vector search and graph traversal
Zero deployment complexity

DuckDB is not a graph database - and that's the point. This isn't Neo4j. Traversals are shallow, SQL-based, and deliberate. That constraint keeps the system debuggable and cheap.

Storage Schema

The schema uses join tables for provenance - we can query "which chunks mention entity X?" directly:

erDiagram
    documents ||--o{ chunks : contains
    chunks ||--o{ entity_mentions : has
    chunks ||--o{ relationship_mentions : has
    entities ||--o{ entity_mentions : mentioned_in
    entities ||--o{ relationships : source
    entities ||--o{ relationships : target
    relationships ||--o{ relationship_mentions : mentioned_in
    communities ||--o{ community_members : contains
    entities ||--o{ community_members : belongs_to
    
    chunks {
        varchar id PK
        varchar document_id FK
        text text
        float[] embedding
    }
    
    entities {
        varchar id PK
        varchar name
        varchar type
        int mention_count
    }
    
    entity_mentions {
        varchar entity_id FK
        varchar chunk_id FK
    }

Key design decision: no VARCHAR[] for provenance. Join tables (entity_mentions, relationship_mentions) enable efficient queries like "get all chunks mentioning Docker".

Vector Search: The HNSW Gotcha

DuckDB's HNSW index only triggers with array_cosine_distance + ORDER BY + LIMIT:

// GraphRagDb.cs - SearchChunksAsync
cmd.CommandText = $"""
    SELECT id, document_id, text, chunk_index, 
           array_cosine_distance(embedding, $1::FLOAT[{_dim}]) as distance
    FROM chunks 
    WHERE embedding IS NOT NULL
    ORDER BY distance
    LIMIT $2
    """;
// Convert distance to similarity: 1.0f - distance

Using array_cosine_similarity will not use the index - it won’t trigger the HNSW index. On non-trivial corpora, this turns a ~5ms indexed query into a full table scan.

This is where we diverge from Microsoft's approach. Instead of the LLM-per-chunk extraction passes used in Microsoft's reference GraphRAG pipeline, we use IDF-based statistical extraction. The goal isn't perfect entities - it's stable, corpus-relative signals that don't require an LLM to produce. This trades some recall for determinism, auditability, and predictable cost - a deliberate choice for technical corpora:

flowchart TB
    subgraph "Phase 1: Signal Collection"
        TEXT[All Chunks] --> IDF[Compute IDF Scores]
        TEXT --> STRUCT[Structural Signals]
        STRUCT --> HEAD[Headings]
        STRUCT --> CODE[Inline Code]
        STRUCT --> LINKS[Links]
        IDF --> RARE[High-IDF = Rare Terms]
        RARE --> CAND[Candidates]
        HEAD --> CAND
        CODE --> CAND
        LINKS --> |explicit rels| LINKREL[Link Relationships]
    end
    
    subgraph "Phase 2: Dedup"
        CAND --> EMBED[BERT Embeddings]
        EMBED --> SIM[Similarity > 0.85]
        SIM --> MERGE[Merge Duplicates]
    end
    
    subgraph "Phase 3: Classify"
        MERGE --> LLM{LLM Available?}
        LLM --> |yes| BATCH[Single Batch Call]
        LLM --> |no| HEUR[Heuristic Types]
    end
    
    style IDF stroke:#f59e0b,stroke-width:2px
    style BATCH stroke:#a855f7,stroke-width:2px

Why IDF, not Hardcoded Lists?

The naive approach is a hardcoded HashSet<string> KnownTech = { "Docker", "Kubernetes", ... }. This breaks for:

New technologies (you'd need to update the list)
Domain-specific terms (different corpus = different entities)
Misspellings and variations

IDF (Inverse Document Frequency) solves this statistically. A term's IDF is:

$\text{IDF}(t) = \log\frac{N}{df(t)}$

Where:

$N$ = total chunks
$df(t)$ = documents containing term $t$

High IDF = rare term = likely an entity. "Docker" appearing in 5 of 100 chunks has higher IDF than "the" appearing in 100 of 100.

For more on TF-IDF and BM25, see my post on hybrid search with BM25.

Structural Signals

Markdown structure tells us what's important:

Headings (## Docker Setup) → entity
Inline code (`docker-compose`) → entity
Links ([Docker](https://docker.com)) → entity + relationship

// EntityExtractor.cs - structural signal extraction
private void ExtractStructuralEntities(string chunk, string chunkId)
{
    // Headings: ## Docker Compose Setup → "Docker Compose Setup"
    foreach (Match m in Regex.Matches(chunk, @"^#{1,3}\s+(.+)$", RegexOptions.Multiline))
    {
        var heading = m.Groups[1].Value.Trim();
        AddCandidate(heading, chunkId, weight: 2.0); // Higher weight
    }
    
    // Inline code: `docker-compose` → "docker-compose"
    foreach (Match m in Regex.Matches(chunk, @"`([^`]+)`"))
    {
        AddCandidate(m.Groups[1].Value, chunkId, weight: 1.5);
    }
}

Markdown links provide explicit relationships that don't require LLM inference:

// EntityExtractor.cs - ExtractLinks  
foreach (Match m in Regex.Matches(chunk, @"\[([^\]]+)\]\((/blog/[^)]+)\)"))
{
    var linkText = m.Groups[1].Value;  // "semantic search"
    var slug = m.Groups[2].Value;       // "/blog/semantic-search-with-qdrant"
    yield return new Relationship(linkText, $"blog:{slug}", "references", chunkId);
}

Deduplication via BERT Embeddings

Entity names like "Docker Compose", "docker-compose", and "DockerCompose" should be merged. We use BERT embeddings to detect semantic similarity:

// EntityExtractor.cs - DeduplicateAsync
var embeddings = await _embedder.EmbedBatchAsync(candidates.Select(c => c.Name), ct);

for (int i = 0; i < candidates.Count; i++)
{
    for (int j = i + 1; j < candidates.Count; j++)
    {
        var similarity = CosineSimilarity(embeddings[i], embeddings[j]);
        if (similarity > 0.85)
        {
            // Merge into canonical entity (keep higher mention count)
            canonical.MentionCount += duplicate.MentionCount;
            canonical.ChunkIds.UnionWith(duplicate.ChunkIds);
        }
    }
}

This step is O(n²) within a bounded candidate set, but candidate counts are bounded by IDF filtering and structural signals - not corpus size. For details on BERT embeddings, see semantic search with ONNX and BERT.

The CLI supports three extraction modes via --extraction-mode:

Heuristic Mode (Default)

dotnet run --project Mostlylucid.GraphRag -- index ./Markdown --extraction-mode heuristic

Uses IDF + structural signals for entity detection, with optional LLM batch classification. Zero per-chunk LLM calls - only ~1 call per 50 entities for type classification.

Hybrid Mode (Recommended)

dotnet run --project Mostlylucid.GraphRag -- index ./Markdown --extraction-mode hybrid

Best of both worlds:

Heuristic detection: IDF + structural signals find entity candidates (deterministic)
LLM enhancement: One call per document validates entities and extracts semantic relationships

flowchart LR
    subgraph "Per Document"
        CHUNKS[Document Chunks] --> HEUR[Heuristic Extraction]
        HEUR --> CAND[30 Candidates]
        CAND --> LLM[Single LLM Call]
        LLM --> ENT[Validated Entities]
        LLM --> REL[Semantic Relationships]
    end
    
    style HEUR stroke:#22c55e,stroke-width:2px
    style LLM stroke:#a855f7,stroke-width:2px

For 5 documents with 62 chunks, hybrid mode makes 5 LLM calls (vs 124 for full LLM mode). You get:

Deterministic entity coverage from heuristics
LLM-quality relationship extraction (semantic, not just co-occurrence)
Descriptions and validated types

LLM Mode (Microsoft-Style)

dotnet run --project Mostlylucid.GraphRag -- index ./Markdown --extraction-mode llm

Full Microsoft GraphRAG approach: 2 LLM calls per chunk (entity extraction + relationship extraction). Most expensive, but highest quality for unstructured text.

When to Use Each

Mode	LLM Calls	Best For
Heuristic	~1 per 50 entities	Fast indexing, well-structured markdown
Hybrid	1 per document	Balance of coverage and quality
LLM	2 per chunk	Unstructured prose, maximum quality

For technical documentation, start with hybrid mode. It gives you semantic relationships without the per-chunk cost. Fall back to heuristic for pure speed, or llm for narrative text.

Hybrid Search: BM25 + BERT

Hybrid search combines two complementary approaches:

Dense (BERT): Understands meaning. "Docker containers" matches "containerization". Sparse (BM25): Matches exact terms. "HNSW" only matches "HNSW".

flowchart LR
    Q[Query] --> BERT[BERT Embedding]
    Q --> BM25[BM25 Tokenize]
    
    BERT --> DENSE[Dense Search<br/>HNSW Index]
    BM25 --> SPARSE[Sparse Search<br/>TF-IDF Scoring]
    
    DENSE --> RRF[RRF Fusion]
    SPARSE --> RRF
    
    RRF --> TOP[Top K Results]
    TOP --> ENR[Enrich with<br/>Entities + Rels]
    
    style RRF stroke:#f59e0b,stroke-width:2px

What is BM25?

BM25 (Best Match 25) scores documents based on query term frequency. The formula:

$\text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$

Key intuitions:

IDF term: Rare words matter more ("HNSW" > "the")
TF saturation: Word appearing 10x isn't 10x more relevant than 1x
Length normalization: Long documents don't get unfair advantage

For a full BM25 implementation, see hybrid search and indexing.

Reciprocal Rank Fusion (RRF)

RRF merges rankings from different retrieval systems. Each rank position gets a score:

$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}$

Where $k$ (typically 60) prevents overweighting the top result. Documents appearing in both rankings get boosted:

// SearchService.cs - RRF fusion
const int k = 60;

foreach (var (chunk, rank) in denseResults.Select((c, i) => (c, i)))
    scores[chunk.Id] = 1.0 / (k + rank + 1);

foreach (var (chunk, rank) in sparseResults.Select((c, i) => (c, i)))
{
    var rrfScore = 1.0 / (k + rank + 1);
    if (scores.TryGetValue(chunk.Id, out var existing))
        scores[chunk.Id] = existing + rrfScore;  // Boost for appearing in both!
    else
        scores[chunk.Id] = rrfScore;
}

Example: A document ranked #1 in dense and #3 in sparse:

Dense: $1/(60+1) = 0.0164$
Sparse: $1/(60+3) = 0.0159$
Combined: 0.0323 (higher than either alone)

Query Modes

flowchart TB
    Q[Query] --> CLASS[Classify Query]
    
    CLASS --> |"How do I use X?"| LOCAL[Local Search]
    CLASS --> |"What are the themes?"| GLOBAL[Global Search]  
    CLASS --> |"How does X relate to Y?"| DRIFT[DRIFT Search]
    
    LOCAL --> HS[Hybrid Search] --> CTX1[Chunk + Entity Context]
    GLOBAL --> CS[Community Summaries] --> MAP[Map-Reduce]
    DRIFT --> BOTH[Local + Communities] --> SYN[Synthesize]
    
    CTX1 --> LLM[LLM Answer]
    MAP --> LLM
    SYN --> LLM
    
    style LOCAL stroke:#22c55e,stroke-width:2px
    style GLOBAL stroke:#3b82f6,stroke-width:2px
    style DRIFT stroke:#a855f7,stroke-width:2px

Query Classification

// QueryEngine.cs
private static QueryMode ClassifyQuery(string query)
{
    var q = query.ToLowerInvariant();
    if (q.Contains("main theme") || q.Contains("summarize") || q.Contains("overview"))
        return QueryMode.Global;
    if (q.Contains("relate") || q.Contains("connect") || q.Contains("compare"))
        return QueryMode.Drift;
    return QueryMode.Local;
}

This classifier is deliberately simple - and easy to replace with a small intent model later. If no entities match, the system degrades cleanly to pure hybrid retrieval.

CLI Usage

Indexing

# Heuristic mode (default) - fast, no per-chunk LLM
dotnet run --project Mostlylucid.GraphRag -- index ./test-markdown

# LLM mode - Microsoft-style classification
dotnet run --project Mostlylucid.GraphRag -- index ./test-markdown --extraction-mode llm

GraphRAG Indexer
  Source: test-markdown
  Database: graphrag.duckdb
  Model: llama3.2:3b
  Extraction: Heuristic (IDF + signals)

Initializing...
Indexing docker-development-deep-dive.md: 0%
Indexing docker-swarm-cluster-guide.md: 40%
Indexing dockercomposedevdeps.md: 80%
Indexing complete: 100%
Classifying entities...: 0%
Extracted 168 entities, 315 rels (4 LLM calls): 100%
Found 10 communities: 100%
Summarizing c_0_2 (12 entities): 20%
Summarizing c_0_8 (4 entities): 80%

────────────────── Indexing Complete ───────────────────
┌───────────────┬───────┐
│ Metric        │ Count │
├───────────────┼───────┤
│ Documents     │ 5     │
│ Chunks        │ 62    │
│ Entities      │ 168   │
│ Relationships │ 312   │
│ Communities   │ 10    │
└───────────────┴───────┘

Querying

dotnet run --project Mostlylucid.GraphRag -- query "How do I use Docker Compose?"

──────────────────── Local Search ────────────────────

Query: How do I use Docker Compose?

╭─Answer────────────────────────────────────────────────╮
│ To run the services defined in the                    │
│ devdeps-docker-compose.yml file, you need to run the  │
│ following command in the same directory as the file:  │
│                                                       │
│ docker compose -f .\devdeps-docker-compose.yml up -d  │
│                                                       │
│ This command will start the containers in detached    │
│ mode.                                                 │
╰───────────────────────────────────────────────────────╯

Related Entities: Docker, container, services, image

Sources: 5 chunks (top score: 0.016)

Stats

dotnet run --project Mostlylucid.GraphRag -- stats

─────────────── GraphRAG Database Stats ────────────────
┌───────────────┬───────┐
│ Metric        │ Count │
├───────────────┼───────┤
│ Documents     │     5 │
│ Chunks        │    62 │
│ Entities      │   168 │
│ Relationships │   312 │
│ Communities   │    10 │
└───────────────┴───────┘

Database size: 7.76 MB

Cost Comparison

For 100 blog posts (~500 chunks, ~100 documents):

Operation	MSFT GraphRAG	Heuristic	Hybrid	LLM
Entity extraction	1,000 calls	0	0	1,000 calls
Document enhancement	-	-	100 calls	-
Classification	Included	~4 batch	-	~4 batch
Community summaries	~20	~20	~20	~20
Total LLM calls	~1,020	~24	~120	~1,024
Relationship quality	Semantic	Co-occurrence	Semantic	Semantic
Cost (gpt-4o-mini)	~$5-10	~$0.15	~$0.75	~$5-10
Cost (Ollama)	N/A	$0	$0	$0

Hybrid mode is the sweet spot for most technical content: you get semantic relationships (not just co-occurrence) at ~10% of MSFT's cost.

Rough order-of-magnitude estimate; exact cost depends on chunk size and prompt shape.

Tradeoffs

Aspect	Heuristic	Hybrid	LLM	MSFT GraphRAG
Entity detection	IDF + structure	IDF + structure	IDF + structure	LLM per chunk
Relationships	Co-occurrence	LLM-inferred	Co-occurrence	LLM-inferred
LLM calls (100 docs)	~24	~120	~24	~1,020
Relationship quality	Low	High	Low	High
Works offline	Yes	Yes (Ollama)	Yes (Ollama)	API required
Best for	Speed-critical	Recommended	Legacy compat	Unstructured text

Conceptually, this is the same pipeline as DocSummarizer: build structure first, then let an LLM narrate it.

Where this breaks down: Fiction or narrative text without structural markup. Implicit relationships with no lexical signal. Highly ambiguous entity names that require world knowledge to disambiguate. For those cases, use LLM mode or Microsoft's full approach.

Code

The implementation is minimal - ~2,000 lines across these files:

Mostlylucid.GraphRag/
├── Storage/GraphRagDb.cs              # DuckDB with HNSW + provenance
├── Services/EmbeddingService.cs       # ONNX BERT wrapper
├── Services/OllamaClient.cs           # LLM client
├── Extraction/
│   ├── IEntityExtractor.cs            # Extractor interface
│   ├── EntityExtractor.cs             # Heuristic mode
│   ├── HybridEntityExtractor.cs       # Hybrid mode (recommended)
│   └── LlmEntityExtractor.cs          # Full LLM mode
├── Search/SearchService.cs            # BM25 + BERT hybrid
├── Graph/CommunityDetector.cs         # Leiden + summarization
├── Query/QueryEngine.cs               # Local/Global/DRIFT
├── Indexing/MarkdownIndexer.cs        # Chunking
├── GraphRagPipeline.cs                # Orchestration
├── Models.cs                          # Shared types + ExtractionMode enum
└── Program.cs                         # CLI

Source: Mostlylucid.GraphRag/

Part 1: GraphRAG Fundamentals - Why knowledge graphs improve RAG
Hybrid Search with BM25 - Deep dive into BM25 scoring and indexing
Semantic Search with ONNX and BERT - BERT embeddings in .NET
DocSummarizer Tool - My document summarization tool that shares code with this

External Resources

DuckDB VSS Extension - HNSW vector search
Leiden Algorithm Paper - Community detection
RRF Paper - Reciprocal Rank Fusion