Agentic Search

This is the kind of question where retrieval starts to break:

“Find multi-family properties in Austin that sold below assessed value in the last 12 months, are in census tracts where median income grew more than 8% year-over-year, have fewer than 20 units, and aren’t within 500 feet of a commercial zone.”

It will retrieve the most semantically similar chunks to that query, stuff them into a prompt, and give you a best-effort answer based on whatever happened to land in the context window. Maybe it surfaces a few listings. Maybe it nails the unit count but ignores the zoning constraint. You’ll never know what it missed, because single-shot retrieval doesn’t know what it doesn’t know.

Now watch what an agentic system does with the same question:

Queries the county appraisal district for multi-family sales in the last 12 months, comparing sale price against assessed value
Pulls census tract data for each property, checks ACS income estimates for year-over-year growth
Filters to tracts where median income grew more than 8%
Checks property records for unit counts, drops anything with 20 or more
Queries zoning maps for commercial zone boundaries, calculates distance from each remaining property
Eliminates properties within 500 feet of a commercial zone
Cross-references the surviving set against listing histories and tax lien records for additional context
Returns a shortlist with documentation — sale price vs. assessed value, income growth figures, unit count, zoning clearance — for each property

That’s more than retrieval. It’s research. And the gap between the two is the entire point of this post.

I’ve been building search systems for a while now, and honestly, it took me longer than I’d like to admit to see this clearly. I kept trying to make RAG smarter instead of stepping back and asking what research actually looks like when you break it down.

How we got here

Every generation of search solved one problem and exposed the next.

Inverted indexes (Lucene, Elasticsearch) gave us millisecond keyword lookup across massive corpora — but you had to know the exact words. Search “heart attack” and you’d never find “myocardial infarction” unless someone hand-built a synonym list. I spent more time than I want to remember maintaining those synonym files by hand.

NLP annotation (spaCy, Stanford NER, MetaMap) solved extraction: pull structured entities — medications, dollar amounts, legal clauses — out of unstructured text. Medical notes became queryable FHIR records. SEC filings became structured financial data. Resumes became skills vectors. But these tools were extractors, not searchers. Someone still had to build the search layer on top, usually Elasticsearch.

Semantic search (Word2Vec, BERT) closed the vocabulary gap. “Heart attack” and “myocardial infarction” finally lived near each other in vector space. But semantic similarity is fuzzy — it finds things that are similar, not things that are right. Search “agencies with Nike experience” and you get results about athletic wear and sneaker culture.

RAG glued semantic search to LLMs. Retrieve chunks, stuff them into context, generate an answer. And yes — people have pushed RAG well beyond the naive version. Multi-query retrieval, reranking, query decomposition, graph-based retrieval, SQL + vector hybrids, Hybrid Search. These are real and they work. But even advanced RAG pipelines tend to be retrieval-centered: they get better at finding chunks, not at investigating a question.

Each of these layers still works. I use all of them. But none of them, alone, can do what I showed you above with the real estate query unless it’s running on a very specialized search engine.

Put simply: agentic search is a system that plans, executes, and revises a multi-step investigation across tools and data sources, rather than relying on a single retrieval pass.

It isn’t a replacement for any of those layers. It’s an autonomous system that orchestrates all of them.

An agentic search system:

Decomposes your goal into sub-questions you didn’t think to ask
Expands your query using synonym maps and domain knowledge
Plans which indexes, annotations, and data sources to consult
Executes dozens of lookups across structured indexes, full-text search, and annotated metadata
Reasons over intermediate results, identifies gaps, and goes back for more
Cross-references results across dimensions that no single retrieval step would catch
Synthesizes a final answer with provenance

The architecture is a reasoning loop with tools — but the power isn’t in the loop. It’s in the combination: an LLM that can read a taxonomy to expand its own query, grep a corpus for exact identifiers, run vector search for conceptual matches, cross-reference structured indexes, and decide what to do next based on what it found. The orchestration — knowing when to pivot, when to go deeper, when to stop — is what makes it research instead of retrieval.

Every domain where NLP extraction has been producing structured data becomes dramatically more powerful with an agent on top:

Expanding the System

The toolbox

An agentic search system doesn’t rely on a single retrieval method. It has a toolbox, it’s packed with multiple strategies it can deploy depending on the query, the corpus, and what it learns along the way:

The last row is where things get really interesting. The toolbox isn’t limited to search — it can include tools that generate answers, not just retrieve them. AlphaFold can predict a protein’s 3D structure from its amino acid sequence. DFT solvers (like the ones behind Google’s GNoME) can simulate material properties at the atomic level. Molecular docking tools can score how well a candidate drug binds to a target. These aren’t search tools in the traditional sense, but an agentic system can call them as part of a research pipeline — “find all known kinase inhibitors in the literature, then run docking simulations against this target, then cross-reference results with clinical trial outcomes.” The agent decides when to search and when to compute.

In materials science the toolbox might include VASP or Quantum ESPRESSO for DFT calculations, LAMMPS for molecular dynamics, the Materials Project API for known material properties, and AFLOW or OQMD for crystal structure databases. An agent could search the literature for thermoelectric candidates, query their band structures, run a DFT simulation on a promising variant, and cross-reference with synthesis feasibility — all in one loop.

In bioinformatics, the tools look different but the pattern is the same: BLAST for sequence alignment, UniProt and PDB for protein databases, AlphaFold for structure prediction, STRING for protein-protein interaction networks, KEGG and Reactome for pathway data, and ClinVar or OMIM for disease associations. An agent researching a gene variant could align sequences, predict structural impact, check known pathway interactions, and pull clinical significance — without a human stitching those queries together.

The agent doesn’t pick one tool. It runs multiple strategies in parallel. While vector search is finding semantically similar documents, grep is hunting for exact identifiers, BM25 is scoring keyword relevance, a specialized model might be running a simulation, and an LLM in the loop is assessing next steps or synthesizing the answer.

Taxonomies as directories

One of the most effective and simplest approaches I’ve found in agentic search is using taxonomies structured as file systems. I believe OpenClaw uses a similar approach. There’s no need for a database, just organize knowledge in directories and .md files. While databases like SQL, NoSQL, or Neo4J could also work (but as you will find, the directories themselves are part of an implicit graph), I haven’t explored them for this purpose yet. Instead, I’ve found that LLMs are remarkably adept at navigating and extracting information directly from file-based structures.

taxonomies/
├── medical/
│   ├── conditions/
│   │   ├── cardiovascular/
│   │   │   ├── _index.md          # "cardiovascular" synonyms, related terms
│   │   │   ├── heart-attack.md    # myocardial infarction, MI, cardiac arrest...
│   │   │   ├── arrhythmia.md
│   │   │   └── heart-failure.md
│   │   ├── oncology/
│   │   │   ├── _index.md
│   │   │   ├── breast-cancer.md
│   │   │   └── lung-cancer.md
│   │   └── endocrine/
│   │       ├── diabetes.md        # T1D, T2D, insulin resistance, A1C...
│   │       └── thyroid.md
│   ├── medications/
│   │   ├── _index.md
│   │   ├── metformin.md           # Glucophage, biguanide, dosages, interactions
│   │   └── lisinopril.md
│   └── procedures/
│       ├── surgical/
│       │   └── cabg.md            # coronary artery bypass graft, open heart...
│       └── diagnostic/
│           └── echocardiogram.md
├── financial/
│   ├── instruments/
│   │   ├── equity.md
│   │   ├── fixed-income.md
│   │   └── derivatives.md
│   └── filings/
│       ├── 10-k.md
│       ├── 10-q.md
│       └── 8-k.md
└── legal/
    ├── contract-types/
    │   ├── nda.md
    │   ├── msa.md
    │   └── sow.md
    └── clauses/
        ├── indemnification.md
        └── limitation-of-liability.md

Each .md file is a knowledge node:

# Heart Attack## Synonyms
- myocardial infarction
- MI
- cardiac arrest (related but distinct)
- heart event
- coronary event
## Related concepts
- [[cardiovascular/_index]]
- [[medications/aspirin]]
- [[procedures/surgical/cabg]]
## Regex patterns
- "heart attack"
- "MI" (when preceded by "acute" or "recent")
- /suffered a (cardiac|heart) event/i
## Notes
Distinguish from cardiac arrest (electrical) vs. heart attack (blockage).

Why this works:

Agents can navigate it — The directory structure is a traversable graph. Ask about “heart conditions” and the agent can ls taxonomies/medical/conditions/cardiovascular/ to discover what's there.
Grep works on it — Need all synonyms for “diabetes”? grep -r "diabetes" taxonomies/ catches it everywhere it appears.
It’s version-controlled — Taxonomies evolve. Git tracks every change, every addition, every correction.
It compounds — Every search that discovers a missing synonym can append it. The taxonomy grows with use (more next).

The _index.md files at each directory level provide roll-up definitions, parent-child relationships, and scope notes. The agent reads these to understand where it is in the hierarchy and what's available below.

Auto-annotation: improving the corpus as you search

As the system processes results, it can be prompt to extract patterns:

Phrases that appear frequently in high-relevance hits
Paragraph structures that signal important content (e.g., “In conclusion,” “Key findings,” “Risk factors include”)
Relationships between concepts that the original extraction pipeline missed

The agent proposes these as new annotations — a new synonym, a missing taxonomy link, a regex pattern that would have caught something the system missed. But it doesn’t write them blindly. This is an active learning loop: the agent surfaces candidates, and a domain expert approves or discards them. An oncologist glances at a proposed synonym and says “yes, that’s a valid alias” or “no, that’s a different concept.” The taxonomy grows with human-validated precision.

For higher-throughput domains where expert review is a bottleneck, you can use LLMs as judges — a second model evaluates whether the proposed annotation is consistent with the existing taxonomy, checks for contradictions, and auto-approves high-confidence suggestions while routing ambiguous ones to a human.

The corpus gets smarter every time you search it — but only with annotations that have been vetted.

Learning from search paths

Every search is a trace through the system: which tools were called, in what order, with what parameters, and what results came back. This trace is valuable.

The agent logs the full decision tree — not just the final answer, but every step:

Query decomposition steps
Which tools were selected and why
Intermediate results that led to pivots
Dead ends that were abandoned
The final synthesis

In practice, I build these playbooks/templates through simulation — it’s not glamorous work, it’s brute forcing. I take a set of known inputs and expected outputs, run the agentic loop with different prompts and tool configurations, and observe which paths actually find the information. Then I iteratively ask the system to propose its own index structure — “given what you’ve seen, how would you organize this corpus so you could find it faster next time?” — and refine the prompt that guides the exploration loop.

This produces two artifacts: a search playbook (the strategy that worked) and an index proposal (the structure that would make future searches faster). Both feed back into the system. When a similar query arrives, the agent doesn’t start from scratch — it starts from a playbook tuned to the corpus and the domain. And the index proposals, once reviewed, reshape the taxonomy itself.

The system isn’t just answering questions. It’s learning how to answer them better.

What this doesn’t solve

Agentic search has real costs. Each reasoning step adds token cost and latency — a nine-step research pipeline is slower and more expensive than a single vector lookup. Agent loops can spiral when the model chooses the wrong tool or follows a weak intermediate result for too long. Evaluation is still hard: how do you benchmark a system whose value often comes from finding what a single retrieval step would never surface?

None of that changes the core argument. Single-shot retrieval can’t do multi-step research. A well-tooled agentic system can.

Where this is going

The layers I described — inverted indexes, NLP extraction, semantic search, RAG, specialized AI tools — aren’t going away. They’re becoming the infrastructure that agentic systems orchestrate. The value is shifting from the retrieval mechanism to the orchestration logic: how to decompose a question, which tools to use when, how to cross-reference results, and how to know when you have enough.

If you’re building search today, you’re probably already using two or three of these layers. The step to agentic search isn’t a rewrite — it’s adding an orchestrator that knows how to use what you’ve already built. The taxonomy files, the annotation pipelines, the vector indexes — they’re all more valuable with an agent on top than they ever were alone.

The corpus gets smarter. The playbooks get sharper. The system incrementally compounds.