GraphRAG, 27,000 Aviation Accidents, and What OpenClaw Found

20 min read Original article ↗

Retrieval-Augmented Generation gave large language models access to external knowledge at query time, but it has a structural ceiling: it retrieves flat text chunks ranked by vector similarity. Ask a question that requires connecting facts spread across different documents, and it usually fails. The model sees pieces, not relationships.

GraphRAG replaces the chunk store with a knowledge graph — entities, relationships, attributes — extracted from the corpus. Retrieval traverses this graph, following edges from entity to entity, surfacing paths of connected facts. “What aircraft make appears most frequently in carburetor icing accidents during winter approaches?” requires connecting accident reports → aircraft types → failure modes → conditions → seasons. A vector database can’t do that. A knowledge graph can.

The question is no longer whether to use graph-structured retrieval. It’s how to build the graph, detect the structure within it, and make retrieval efficient at scale.

Youtu-GraphRAG is a system from Tencent’s Youtu Lab (Dong et al., 2025; accepted at ICLR 2026) that treats graph construction, indexing, retrieval, and generation as a vertically unified architecture rather than separate bolted-together pipelines. Its key innovations are schema-guided entity extraction (bounding the graph to analytically relevant types), a dual community detection algorithm called FastTreeComm that fuses graph topology with node semantics, and an iterative retrieval-reasoning loop (IRCoT) that enables multi-hop queries.

Dong, J., An, S., Yu, Y., Zhang, Q.W., Luo, L., Huang, X., Wu, Y., Yin, D., & Sun, X. (2025). Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning. arXiv:2508.19855.

Most GraphRAG implementations assume a single model does everything — construction, retrieval, reasoning, and synthesis. This project uses a three-tier architecture where each layer does what it’s best at.

Tier 1 — Local open-weight models (the labor layer). Qwen2.5-72B-Instruct (AWQ 4-bit, ~40GB) handles knowledge graph construction, IRCoT retrieval, and initial query answering. GTE-Qwen2-7B handles embeddings. Both run locally on a bare-metal NVIDIA GH200 Grace Hopper Superchip — 96GB HBM3 GPU memory unified with 480GB LPDDR5X CPU memory. This tier processes 27,378 documents, builds 313K nodes, runs community detection, and executes multi-hop retrieval loops. It needs to be local because the volume makes API calls prohibitively expensive, and because the architecture extends to domains where data sensitivity requires that nothing leaves institutional infrastructure.

Tier 2 — Frontier model via OpenClaw (the reasoning and orchestration layer). Claude Sonnet 4.6, accessed through OpenClaw, provides the judgment that sits above graph retrieval: interpreting results, formulating follow-up queries, deciding what to investigate next, synthesizing findings across multiple graph outputs, and directing the entire research pipeline. OpenClaw is a personal AI infrastructure platform — a persistent agentic execution environment with file system access, shell execution, process management, sub-agent spawning, cron scheduling, and memory, all mediated through a typed tool API. The human interface is a Slack DM. OpenClaw is what makes the pipeline operable by someone who isn’t going to write and debug Python by hand — it’s not convenience, it’s the infrastructure that lowers the barrier from “requires a data engineering team” to “requires a Slack message.”

Tier 3 — External search (the validation layer). OpenClaw’s Brave Search API skill grounds graph outputs against published literature in real time. When the graph surfaces a pattern — say, a causal chain structure in VFR-into-IMC accidents — OpenClaw can immediately search for NTSB studies, FAA advisory circulars, and published safety analyses that corroborate or challenge the finding. Neither the local model nor the graph can do this alone. This tier closes the loop between corpus-internal findings and corpus-external validation.

The three tiers communicate through OpenClaw. A typical analytical cycle: the operator poses a question over Slack → Claude (Tier 2) formulates a query and sends it to the youtu-graphrag API running on Qwen (Tier 1) → the graph returns a retrieval path and answer → Claude interprets the result, OpenClaw searches published literature via Brave Search API (Tier 3) to validate or contextualize it → Claude synthesizes the graph output and external evidence into a finding → Claude decides what to ask next.

The project started as a single message: “Can we load the NTSB aviation accident dataset into this?”

Within one conversation, OpenClaw:

  1. Searched GitHub for the youtu-graphrag repository and read the documentation

  2. Cloned the repo, installed dependencies, created a virtual environment

  3. Downloaded 27,378 NTSB accident narratives and formatted them as the expected corpus JSON

  4. Authored a domain-specific schema (schemas/ntsb.json) defining entity types: cause_factor, pilot, aircraft, phase_of_flight, injury_outcome, maintenance_issue, location, regulatory_violation, weather_condition

  5. Configured the system for local LLM endpoints

  6. Launched the 27-hour knowledge graph extraction in a background nohup session

No terminal was opened by a human. The entire setup — dependency resolution, schema authoring, configuration, and job launch — happened through natural language over Slack.

Before the engineering details or query results, here’s what this effort actually produced and what it didn’t.

What’s not novel: The top-line aviation findings — LOC-I, VFR-into-IMC, fuel mismanagement, approach and landing accidents — are well-established in existing safety literature. Reproducing the consensus findings is validation, not discovery.

What is genuinely novel:

1. The methodology applied to this domain. Building a 313,647-node knowledge graph over the full NTSB corpus and running multi-hop agentic reasoning across it is new. The NTSB has its own databases, but they’re built from coded fields — investigator-assigned categories. This graph is built from narrative text, extracting structure that coded fields miss.

2. Causal chain interior mapping at corpus scale. Identifying the intermediate steps between known cause categories and crash outcomes — across thousands of reports simultaneously — is not something existing NTSB publications provide.

3. Quantified interaction effects from narrative evidence. The night + VFR-only pilot + mountainous terrain lethal triad is a named risk combination in the graph. Existing statistics track each factor individually. The graph makes the intersection retrievable and characterizable across the full corpus.

4. Systematic surfacing of narrative-resident factors. Substance involvement, cockpit distraction, and recency gaps appear in narrative text at rates that coded databases undercount. Building a graph from narrative text makes this implicit data explicit and queryable.

5. An engineering playbook for large-scale GraphRAG deployment. The five problems encountered and solved (detailed below) represent practical knowledge absent from the youtu-graphrag documentation or academic papers. Any GraphRAG deployment on 27,000+ documents will hit these.

6. A three-tier architecture for AI-directed research. Local open-weight models for compute-intensive graph operations, a frontier model for orchestration and reasoning, and external search for real-time validation — coordinated through a persistent agentic execution environment. Most GraphRAG implementations assume a single model. This architecture assigns each tier the work it’s best suited for: volume and privacy to the local model, judgment and synthesis to the frontier model, grounding to external search. The entire pipeline is operable through natural language over a messaging interface.

The youtu-graphrag codebase was tested on small-to-medium graphs. The NTSB corpus produced 313,647 nodes and 855,414 edges. At that scale, several assumptions in the original code became performance catastrophes.

Symptom: FastTreeComm died with double free or corruption memory errors on the full graph.

Root cause: FastTreeComm computes dense N×N similarity matrices, triggering heavily multithreaded linear algebra via OpenBLAS. The system’s precompiled OpenBLAS library had threading contention issues under high thread counts — a well-known class of OpenBLAS bugs in concurrent workloads.

Fix: Setting OPENBLAS_NUM_THREADS=1 eliminated the crash. Additionally, _compute_jaccard_matrix_vectorized was rewritten to use sparse matrix operations (scipy.sparse) instead of materializing the full dense adjacency matrix:

# Before: dense N×N matrix — crashes at scale
intersection = sub_adj.dot(sub_adj.T).toarray()
union = row_sums[:, None] + row_sums - intersection
jaccard_matrix = intersection / (union + 1e-9)

# After: sparse intermediate, only densify final result
intersection = sub_adj.dot(sub_adj.T)  # stays sparse
# ... convert to dense only for the small subgraph window

Symptom: A single query took 71–244 seconds for retrieval alone.

Root cause: _get_3hop_neighbors() performs BFS expansion from query entities. Hub nodes like “pilot error” caused expansion to 525,000+ triples in a single step.

Fix: A hard cap of max_neighbors=200 in the BFS loop — sufficient to surface relevant connections without combinatorial explosion. Query times dropped to 4–5 seconds per hop.

Symptom: Even after the BFS fix, queries took 30+ minutes end-to-end.

Root cause: Three functions performed linear scans scaling catastrophically with graph size — billions of operations per query.

Fix: Pre-built O(1) lookup structures at index load time:

def _build_fast_lookup_indices(self):
    # Reverse FAISS map: node_id → faiss_idx (was O(N) scan per node)
    self._node_id_to_faiss_idx = {
        node_id: int(idx)
        for idx, node_id in self.faiss_retriever.node_map.items()
    }
    
    # Schema type → [node_ids] (was O(N) graph scan per schema type)
    self._schema_type_to_nodes = {}
    for node_id, node_data in self.graph.nodes(data=True):
        schema_type = node_data.get('properties', {}).get('schema_type', '')
        self._schema_type_to_nodes.setdefault(schema_type, []).append(node_id)
    
    # Adjacency caches: node → [(u, rel, v)] (was O(|E|) edge scan per node)
    self._node_out_triples = {}
    self._node_in_triples = {}
    for u, v, data in self.graph.edges(data=True):
        self._node_out_triples.setdefault(u, []).append(
            (u, data.get('relation',''), v))
        self._node_in_triples.setdefault(v, []).append(
            (u, data.get('relation',''), v))

Total warm query time collapsed from 30+ minutes to ~60 seconds — a 30× improvement.

Symptom: Every query reloaded the full retriever from scratch. First query after service start: 15+ minutes.

Fix: A _retriever_cache dictionary in backend.py and a startup_event() pre-warm that loads the retriever in the background on service start.

The 27-hour extraction completed, but community detection crashed before saving output. A standalone recovery script (run_community_detection.py) was written to load the checkpoint, reconstruct the NetworkX graph, run FastTreeComm independently, and save the final graph without re-running extraction.

With the system stable, OpenClaw planned and ran 30 queries designed to exploit the difference between what NTSB databases code and what the narrative text contains. Standard NTSB statistics report the final link in a causal chain. The knowledge graph encodes the full narrative. Each query exercised the full three-tier cycle: Claude formulated the question, Qwen’s IRCoT retrieval traversed the graph, and Claude interpreted the results — pulling in external sources where validation was needed.

Each of the following examples shows the actual query, the real triples and accident narratives the system retrieved, the external source that validates the known component, and what the graph added that the external source doesn’t capture.

Query: “What is the step-by-step causal chain in accidents where VFR pilots continued flight into IMC conditions? Describe the sequence of events from initial weather encounter through loss of control and impact.”

What the graph retrieved — the actual triples, with schema types, that formed the reasoning context:

[pilot]          --contributed_to→  "Continued VFR Flight into IMC Conditions"  [score: 0.754]
[pilot]          --led_to→          [cause_factor: "Improper Decision to Continue VFR into IMC"]
[pilot]          --failed_to→       "avoid IMC"
[pilot]          --violated→        [regulatory_violation: "VFR Flight into IMC"]
[pilot]          --caused_by→       "Continued Flight Under VFR into IMC"        [score: 0.740]
[weather_condition: "VFR to IMC"] --has_attribute→ weather_type
[cause_factor: "Controlled Flight into Terrain"] --preceded_by→ "Pilot's Decision to Continue VFR into IMC"
[weather_condition: "IMC"] --preceded_by→ "Pilot's Decision to Continue Flight"
[accident: "ANC21FA069"] --caused_by→ "Pilot's decision to continue VFR flight into IMC"
[accident: "WPR09FA076"] --caused_by→ [weather_condition: "VFR flight into IMC conditions"]

The system grounded its answer in three accident narrative chunks (WPR11FA241, ERA12FA526, ANC21FA069). WPR11FA241 is representative — the narrative shows the full causal interior that the triple above compresses into a single edge:

“The pilot filed an IFR clearance. About 44 minutes into the flight, the pilot canceled the IFR clearance and continued VFR. About 16 minutes later, the pilot amended the destination to a closer airport, and about 19 minutes after that, radar service was terminated. About 1.5 hours after radar service was terminated, the airplane impacted mountainous terrain at 6,400 feet. The accident site was located beyond the amended destination — the pilot had overflown it and may have been attempting to reach the original destination. Toxicology showed positive results for amphetamine and methamphetamine. Review of medical records indicated the pilot had been treated for a skin infection the day prior and may have been distracted by his medical condition.” — WPR11FA241

The system’s synthesized answer (verbatim):

“(1) Initial Weather Encounter: pilot operating VFR encounters IMC — reduced visibility, clouds, precipitation. (2) Failure to Avoid IMC: pilot fails to divert, often due to poor preflight planning or overconfidence. (3) Continued VFR into IMC: pilot makes the improper decision to continue, violating regulatory requirements. (4) Spatial Disorientation: pilot loses orientation and control due to absence of visual references. (5) Loss of Control: inability to maintain attitude and heading leads to stall, spin, or other LOC event. (6) Controlled Flight into Terrain: aircraft collides with terrain at high descent rate. (7) Impact and Damage: significant damage, potential loss of life.”

External validation: The AOPA Air Safety Institute reports that VFR-into-IMC carries an 86% fatality rate in non-commercial fixed-wing aircraft — the worst fatality rate of any weather-related accident category. The FAA GAJSC has an active safety enhancement targeting specifically step 4: “FAA and Industry to revise teaching and training the UIMC escape response maneuver to include an initial climb before any heading change.” (gajsc.org/se)

What the graph adds: AOPA names the category. GAJSC targets one step. The graph narrows to the interior — WPR11FA241 shows a pilot who had already filed IFR, accepted an IFR clearance, then canceled it 44 minutes in and continued VFR. The triple [pilot] --canceled→ [IFR clearance] followed by [pilot] --continued→ [VFR into IMC] is in the narrative, not in any coded field. That specific decision node — canceling a clearance that would have been protective — appears across multiple retrieved narratives. GAJSC targets the cloud boundary. The graph finds the departure gate.

Query: “What is the fatality rate and survivability profile of accidents involving the combination of night conditions, VFR-only pilots, and mountainous terrain? List specific accident report identifiers that matched this combination.”

What the graph retrieved:

[pilot]          --impaired_by→    [weather_condition: "Dark Night Visual Meteorological Conditions"]
[pilot]          --impaired_by→    [weather_condition: "Night Visual Meteorological Conditions"]
[pilot]          --failed_to→      "Assess Terrain Suitability"
"Crash in Mountainous Terrain"  --resulted_in→  "Injury Severity: FATL"   [×2, score: 0.570–0.592]
[pilot]          --has_attribute→  "pilot_experience: 11.3 total night hours in 2007, 0.2 hours of night 5 days prior"
[weather_condition: "terrain impact"] --has_attribute→ injury_outcome: fatal

Specific reports matched:

  • MIA08FA080: “The noninstrument-rated pilot’s failure to maintain terrain clearance at night in marginal visual flight conditions. Contributing to the accident was the dark night, and low cloud ceilings.” Fatal.

  • ERA12FA193: “The non-night-qualified pilot’s improper decision to depart in dark night visual meteorological conditions, which resulted in his subsequent spatial disorientation, loss of control, and impact with water.” Fatal.

The system’s answer (verbatim): “The provided knowledge context does not contain specific fatality rate statistics, but the following accident reports match the combination: MIA08FA080 and ERA12FA193. Both resulted in fatalities. The pattern indicates spatial disorientation leading to loss of control and terrain or water impact.”

External validation: The FAA GAJSC CFIT working group identifies night flight in mountainous terrain as a priority safety enhancement, recommending that instrument-current pilots back up nighttime VFR mountain approaches with lateral and vertical navigation guidance. NTSB documents night VFR CFIT as a recurring fatal accident type. Each individual risk factor — night, VFR-only, terrain — is tracked separately in published statistics.

What the graph adds: The system surfaced the pilot_experience attribute: “11.3 total night hours in 2007, 0.2 hours of night in the 5 days prior.” That’s in the accident narrative. No coded NTSB field reports “hours of night flight in the preceding 5 days.” The graph extracted it as a node attribute during ingestion. Across hundreds of matching accidents, this attribute is queryable — making “legally night-current but operationally inexperienced at night” a measurable risk subgroup, not just an intuition. The individual factor statistics don’t capture it. The graph does.

Query: “In how many accident narratives do toxicology findings mention marijuana, prescription medications, or OTC drugs? In how many of those cases were substances coded as a causal or contributing factor versus only appearing in the narrative text? List specific report identifiers.”

What the graph retrieved:

[medical_condition: "Toxicology"]  --has_attribute→  contribution_to_accident: "no"
[cause_factor: "toxicological testing"]  --revealed→  "marijuana"
[contributing_factor: "Pilot's Medication"]  --has_attribute→  contribution_to_accident: "unlikely to have contributed"
[cause_factor: "Ethanol"]  --has_attribute→  contribution_to_accident: "unlikely"
[toxicology_conclusion: "Unlikely Contribution to Accident"]  --has_attribute→  toxicology_conclusion: "Unlikely Contribution"

Specific reports with substances in narrative text:

  • ERA13FA219: Diphenhydramine (OTC antihistamine) present in toxicology. Coded cause: loss of control. Substance listed in narrative, marked “unlikely to have contributed.”

  • ERA22FA009: Multiple medications present. Coded cause: not substance-related. Medications appear in narrative only.

  • ERA18FA005: Sertraline and desmethylsertraline (antidepressants). Coded cause: not substance-related.

The system’s answer (verbatim): “None of the substances were explicitly coded as causal or contributing factors in the provided narratives. Substances only appearing in narrative text: ERA13FA219 (diphenhydramine), ERA22FA009 (several medications), ERA18FA005 (sertraline and desmethylsertraline).”

External validation: A 2017 NTSB study of 2,231 fatal civil aviation accidents found that the prevalence of potentially impairing substances in pilot toxicology doubled from 11.2% to 23.0% between 1990 and 2012. The FAA’s own research has documented the gap between toxicological findings and coded attribution — investigators routinely note substances but cannot determine impairing dose from postmortem results, leading to “unlikely to have contributed” classifications even when presence is confirmed.

What the graph adds: The graph encodes two separate nodes that standard coded data collapses into one: [toxicological testing] --revealed→ "marijuana" and [Pilot's Medication] --has_attribute→ contribution_to_accident: "unlikely". These are different things — substance presence and substance attribution — and the graph keeps them distinct. The triple contribution_to_accident: "no" appearing as a node attribute means the graph can now answer: “In how many accidents were substances present but not attributed?” That query runs in seconds. It has never been systematically answered from the full 27,000-report corpus, because coded databases record only the attribution, not the presence-without-attribution pattern.

Query: “How often do accident narratives describe a first, manageable problem that consumed pilot attention, followed by a second problem that became fatal because the pilot’s resources were already committed?”

What the graph returned (from the earlier forensic batch, run against the same corpus):

Three specific instances with identified transition points:

  • ERA11FA414: First failure — unsecured oil gauge rod. Pilot notices oil pressure dropping, continues toward destination. Second failure — total engine seizure from oil starvation. The intervention window was the moment the pilot noticed the pressure drop: at that point, a precautionary landing was a complete solution. Five minutes later, the engine was gone.

  • CEN18FA236: First failure — weather navigation decisions consume pilot attention. Fuel state goes unmonitored. Second failure — fuel exhaustion, total loss of engine power. Each failure was independently recoverable; together they weren’t.

  • DEN08LA053: First failure — fuel valve left OFF, engine loses power on initial climb. Pilot manages obstacle clearance. Second failure — airspeed bleeds off during forced landing setup, inadvertent stall. Two recoverable events, one outcome.

External validation: James Reason’s Swiss Cheese model (University of Manchester, early 1990s) describes accidents as the alignment of holes through multiple defensive layers. “Aviate, Navigate, Communicate” exists as a trained priority framework precisely for the moment when multiple problems compete for attention. The dual-failure pattern is documented in aviation human factors literature under the label of attentional tunneling.

What the graph adds: Reason’s model names the structure abstractly. Existing training teaches the priority list. What the graph adds is the transition point — the specific moment in each accident where intervention was still possible and the pilot did not take it. In ERA11FA414, that moment is when the oil pressure gauge first moved. That event is in the narrative text; the graph encodes it as a node in the accident chain. No coded NTSB field captures “pilot noticed precursor, assessed it as manageable, and continued.” The graph does — and running the pattern across 27,000 cases shows that the transition point almost always comes well before the point of no return.

Aviation safety research has been constrained by the gap between what investigators write and what databases store. Investigators write detailed, nuanced narratives. Databases store coded categories. GraphRAG bridges that gap by treating the full narrative corpus as a structured, queryable knowledge base.

The NTSB has 27,378 narratives in this dataset alone. Patterns in causal chain interiors, factor interactions, and narrative-resident contributing factors have been sitting in that text for decades — known to individual investigators who read the reports, but invisible at corpus scale to systematic query. This project demonstrates that the analysis is now feasible on a single machine, operable by one person, and capable of making queryable what was previously locked in unstructured text.

The three-tier architecture is what makes this extensible beyond a single project. The immediate next steps:

Strengthening Tier 3 (external validation). The current external search capability uses general web search. The next iteration will connect to richer, more authoritative sources — NTSB full docket files, FAA advisory circulars, published safety studies, and peer-reviewed literature databases — so that the validation layer reads primary sources rather than whatever surfaces in a general search. The goal is automated validation: the system surfaces a pattern from the graph, searches the literature for corroboration, and flags findings where the graph output has no published precedent — those are the candidates for genuinely novel discovery.

Temporal analysis. The NTSB dataset spans decades. Partitioning the graph by era and querying whether causal chain structures have shifted as cockpit technology evolved — whether spatial disorientation chains look different in the glass cockpit era versus the steam gauge era — would be a methodologically novel contribution that the current graph can support with minimal additional engineering.

Ontology-driven node normalization. The current graph has a vocabulary problem. Because entities are extracted from free-text narratives by an LLM, the same real-world concept often occupies multiple nodes: “spatial disorientation,” “pilot became spatially disoriented,” “SD event,” and “loss of spatial orientation” all refer to the same failure mode but live in different parts of the graph with separate edge sets. The same is true for cause factors, weather conditions, and phase-of-flight descriptions. This means that causal links which should be reinforcing each other across thousands of reports are instead fragmented across synonymous nodes — the graph knows less than it should because it can’t recognize that two differently-worded nodes are the same thing.

The fix is a structured ontology layer that normalizes extracted entities to canonical concepts before they enter the graph. Healthcare has solved a version of this problem: SNOMED CT maps hundreds of thousands of clinical terms to a formal hierarchy, so that “heart attack,” “myocardial infarction,” and “acute MI” all resolve to the same concept code and inherit the same parent relationships.

This matters for pattern recognition at a fundamental level. Right now, if you query the graph for causal chains involving spatial disorientation, you get results from whichever nodes the LLM happened to label with those exact words. Reports where the investigator wrote “the pilot lost attitude awareness” or “the aircraft entered an unusual attitude consistent with the pilot’s failure to transition to instruments” describe the same phenomenon but don’t get linked into the same causal chain structures. The graph sees them as unrelated. An ontology-normalized graph would collapse all of these into a single canonical node, and every edge — every causal relationship extracted from every report that touches that concept — would converge on that node. The result is dramatically higher edge density on the concepts that actually matter, which means community detection algorithms find stronger clusters, multi-hop retrieval follows more complete causal paths, and quantitative queries (”how many accident chains pass through spatial disorientation?”) return accurate counts instead of undercounts fragmented across synonyms.

Building this ontology is itself a non-trivial project. The next iteration of this pipeline will use existing aviation taxonomies as a scaffold, extend them with LLM-assisted term mapping to handle the long tail of natural language variation, and enforce normalization at extraction time so that the graph is built on canonical concepts from the start rather than requiring post-hoc deduplication of a quarter-million nodes.

Full analysis report and source code available in the project repository.

Hardware: NVIDIA GH200 Grace Hopper Superchip (96GB HBM3 + 480GB LPDDR5X). Tier 1: Qwen2.5-72B-Instruct (AWQ 4-bit) + GTE-Qwen2-7B, local. Tier 2: Claude Sonnet 4.6 via OpenClaw. Tier 3: Brave Search API via OpenClaw. GraphRAG: youtu-graphrag (ICLR 2026).

Discussion about this post

Ready for more?