DuoRAG
Imagine a corpus of documents with scientist biographies.
The traditional RAG works fine until you ask questions like:
"Who was born before 1800?"
"How many are mathematicians?"
"List names and birthdays for mathematicians"
These result in an incomplete answer due to top-k with no signs of incompleteness.
For an initial corpus it is possible to improve this problem by extracting metadata for a predetermined set of fields. This approach has two problems:
- One has to predict all the questions that can be asked against the corpus upfront.
- Constantly revising that prediction as the documents change, e.g. adding nobel prizes later, or extending the document set to contain artists.
DuoRAG solves both problems by:
- An initial metadata (schema) discovery before the first ingestion
- Self-update schema with candidate fields when it fails to answer a question
A periodic "backfill" run then extracts and populates the candidate fields, or prunes them if the information is not contained within the corpus. If the backfill is running nightly, a question that has failed today, gets answered correctly tomorrow.
How It Works
DuoRAG maintains two parallel stores for every ingested document:
| Store | Backend | Used for |
|---|---|---|
| Vector store | ChromaDB | Semantic similarity search |
| Relational store | SQLite | Aggregation, filtering, counting |
At query time, an LLM uses tool-calling to decide which backend to hit — or both — based on the question. The LLM calls semantic_search (ChromaDB) or run_sql (SQLite) with the full table schema embedded in the tool description, so it can write correct SQL automatically.
Schema
DuoRAG allows you define an initial schema, if you prefer to cover the most predictable fields and let the rest evolve based on user queries. If you skip the schema, DuoRAG auto-discovers one by sampling a configurable subset of documents.
Key Features
- Dual-store routing — LLM picks the right backend per question
- Auto schema discovery — infer fields from document samples when no schema is provided
- Schema evolution — detect missing fields mid-session (
evolve=True) and add them live - Controlled fallback — blocks incomplete semantic search answers for aggregate questions; SQL failures prompt schema evolution instead of silent top-k fallback
- Incremental ingestion — hash-based deduplication skips unchanged documents
- Backfill — populate newly added fields from already-stored chunks without re-ingesting
Installation
pip install duo-rag
# or
uv add duo-ragSet your OpenAI API key:
export OPENAI_API_KEY=sk-... # or add it to a .env file in your project root
Quick Start
from duo_rag import DuoRAG, MetadataField # 1. Define a schema (or omit for auto-discovery) schema = [ MetadataField(name="birthplace", type="text", description="Person's place of birth"), MetadataField(name="occupation", type="text", description="Primary occupation or field"), MetadataField(name="year_of_birth", type="integer", description="Year the person was born"), ] # 2. Initialize rag = DuoRAG( llm_model="gpt-5-mini", schema=schema, data_dir="./my_data", ) # 3. Ingest documents (incremental — safe to call repeatedly) stats = rag.ingest("./documents/") print(stats) # {"new": 10, "changed": 0, "unchanged": 0} # 4. Query — routing is automatic print(rag.query("What did Marie Curie discover?")) # → semantic search print(rag.query("How many people were born after 1800?")) # → SQL print(rag.query("What is the most common occupation?")) # → SQL aggregation
Auto schema discovery
# No schema provided — DuoRAG infers fields from a document sample on first ingest rag = DuoRAG(llm_model="gpt-5-mini", data_dir="./my_data") rag.ingest("./documents/") print([f.name for f in rag.schema.fields]) # e.g. ["name", "birthplace", "occupation", ...]
Schema evolution
# By default, if SQL can't answer a question (missing column), DuoRAG won't # fall back to semantic search — avoiding misleading partial answers. answer = rag.query("How many people died before 1900?", evolve=True) # → explains the data isn't available yet as structured metadata # → [Schema Gap Detected] 'year_of_death' has been added. Run backfill() to populate it. # Populate the new field from all stored chunks result = rag.backfill() print(result) # {"populated": ["year_of_death"], "pruned": []} # Now the same query works precisely via SQL answer = rag.query("How many people died before 1900?") # If you prefer partial answers over no answer, enable fallback: answer = rag.query("How many people died before 1900?", fallback=True) # → returns top-k semantic results with a warning about incompleteness
Add a field manually
rag.add_field(MetadataField( name="nationality", type="text", description="Person's nationality", )) rag.backfill()
Running the Example
The repository includes an example script that ingests a set of biographical .txt files and opens an interactive query loop.
# Ingest documents and enter interactive mode python examples/example_usage.py # or: uv run python examples/example_usage.py # Run pre-defined demo queries first, then enter interactive mode python examples/example_usage.py --test # Also print the generated SQL alongside each answer python examples/example_usage.py --test --verbose
Interactive mode commands:
| Input | Action |
|---|---|
| Any question | Query withevolve=True — schema gaps detected automatically |
/backfill |
Populate newly added fields from stored chunks |
/ingest |
Re-ingest documents fromexamples/documents/ |
quit / exit / q |
Exit |
Evaluation
The repository includes an evaluation suite that tests DuoRAG's core capabilities — semantic search, SQL generation, schema evolution, backfill, and conversational follow-ups — using a combination of LLM-judge scoring and deterministic checks.
How it works
Tests are defined in examples/eval_tests.yaml as a sequence of stages, each containing one or more test cases. Every test specifies a question, expected behavior (e.g. SQL usage, expected keywords), and judge criteria. The eval runner:
- Ingests the sample documents from
examples/documents/ - Executes each test stage in order (basic queries → schema evolution → backfill → conversation)
- Scores each answer on two axes:
- LLM judge (0.0–1.0) — evaluates correctness, completeness, and relevance
- Deterministic checks — validates SQL usage, expected/excluded keywords, schema gap detection, etc.
- A test passes if the judge score is ≥ 0.7 and all deterministic checks pass
Running the eval
# Install eval dependencies pip install duo-rag[eval] # or: uv add duo-rag --group eval # Basic run (reuses existing eval_data if present) python examples/run_eval.py # Clean run — delete eval_data and start fresh python examples/run_eval.py --reset # Verbose — print answers and judge reasoning python examples/run_eval.py --verbose # Save a detailed JSON report python examples/run_eval.py --save-report eval_report.json
Test stages
| Stage | What it tests | Example question |
|---|---|---|
| Basic queries | Semantic search and SQL routing | "Who was born after 1800?" |
| Schema evolution | Gap detection and field auto-addition | "Who has died after 1900?" |
| Backfill | Populating newly added fields | Re-asks post-backfill query |
| Conversation | Multi-turn context preservation | "Total mathematicians?" → "Who are they?" |
Adding or modifying tests
Edit examples/eval_tests.yaml. Each test case supports:
- id: my_test question: "How many scientists were born in England?" type: quantitative # factual | quantitative | schema_evolution | conversational evolve: false # trigger schema evolution? judge_criteria: "Should return the correct count" expect_sql: true # assert SQL was used expected_keywords: ["England"] expected_names: ["Newton", "Darwin", "Faraday"] excluded_names: [] expect_gap_detected: false save_history: false # save conversation history for follow-up tests continues_from: "" # test id to continue conversation from
API Reference
DuoRAG
DuoRAG( llm_model: str = "gpt-5-mini", extraction_model: str = "gpt-5-mini", schema: list[MetadataField] | None = None, data_dir: str = "./duo_rag_data", chunk_size: int = 1000, chunk_overlap: int = 200, vector_store: VectorStore | None = None, relational_store: RelationalStore | None = None, prompts: PromptConfig | None = None, )
| Parameter | Default | Description |
|---|---|---|
llm_model |
See __init__.py |
OpenAI model for query routing and answering |
extraction_model |
See __init__.py |
OpenAI model used for metadata extraction during ingestion |
schema |
None |
List of MetadataField; auto-discovered on first ingest if omitted |
data_dir |
"./duo_rag_data" |
Directory for ChromaDB and SQLite persistence |
chunk_size |
1000 |
Max characters per text chunk |
chunk_overlap |
200 |
Character overlap between consecutive chunks |
vector_store |
None |
Custom VectorStore (default created in data_dir if omitted) |
relational_store |
None |
Custom RelationalStore (default created in data_dir if omitted) |
prompts |
None |
Custom PromptConfig for overriding system prompts |
ingest(path, on_progress=None) → dict
Ingest a file path or list of paths. Skips unchanged documents (hash-based). Auto-discovers schema if none is set.
path—strorlist[str]on_progress— optional(current: int, total: int) -> Nonecallback
Returns {"new": int, "changed": int, "unchanged": int}.
query(question, evolve=False, history=None, fallback=False) → str
Ask a question. The LLM routes to semantic search, SQL, or both.
evolve=True— after answering, check for schema gaps and add detected fields automaticallyhistory— list of prior{"role": ..., "content": ...}messages for multi-turn conversationfallback=False— when SQL fails (missing column or no rows), block fallback to semantic search to avoid incomplete answers. Set toTrueto allow the fallback with an incompleteness warning appended
Use rag.last_history to get the updated history after each call for follow-up questions. Use rag.last_sql to inspect the SQL that was generated (if any).
backfill(on_progress=None) → dict
Extract values for all unpopulated fields from already-stored chunks. Prunes fields that remain entirely NULL after backfill.
on_progress— optional(current: int, total: int) -> Nonecallback
Returns {"populated": [field names], "pruned": [field names]}.
add_field(field: MetadataField) → None
Add a new field to the live schema and the SQLite database. Call backfill() afterward to populate it from existing documents.