RagTune
Debug, benchmark, and monitor your RAG retrieval layer. EXPLAIN ANALYZE for production RAG.
Quickstart • Commands • Why RagTune • Concepts • FAQ
| I want to... | Command |
|---|---|
| Debug a single query | ragtune explain "my query" --collection prod |
| Run batch evaluation | ragtune simulate --collection prod --queries queries.json |
| Set up CI/CD quality gates | ragtune simulate --ci --min-recall 0.85 |
| Detect regressions | ragtune simulate --baseline runs/latest.json --fail-on-regression |
| Compare embedders | ragtune compare --embedders ollama,openai --docs ./docs |
| Quick health check | ragtune audit --collection prod --queries queries.json |
Quickstart
# 1. Start vector store docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant # 2. Ingest documents ragtune ingest ./docs --collection my-docs --embedder ollama # 3. Debug retrieval ragtune explain "How do I reset my password?" --collection my-docs
No API keys needed with Ollama (runs locally).
Already using PostgreSQL with pgvector?
Skip Docker entirely. Use your existing database:
ragtune ingest ./docs --collection my-docs --embedder ollama \
--store pgvector --pgvector-url postgres://user:pass@localhost/mydb
ragtune explain "How do I reset my password?" --collection my-docs \
--store pgvector --pgvector-url postgres://user:pass@localhost/mydbBuild Your Test Suite
# Save queries as you debug ragtune explain "How do I reset my password?" --collection my-docs --save ragtune explain "What are the rate limits?" --collection my-docs --save # Run evaluation once you have 20+ queries ragtune simulate --collection my-docs --queries golden-queries.json
Each --save adds the query to golden-queries.json.
What You'll See
explain — Debug a Query
Query: "How do I reset my password?"
[1] Score: 0.8934 | Source: docs/auth/password-reset.md
Text: To reset your password: 1. Click "Forgot Password"...
[2] Score: 0.8521 | Source: docs/auth/account-security.md
Text: Account Security ## Password Management...
DIAGNOSTICS
Score range: 0.7234 - 0.8934 (spread: 0.1700)
✓ Strong top match (>0.85): likely high-quality retrieval
simulate — Batch Metrics
Running 50 queries...
Recall@5: 0.82 MRR: 0.76 Coverage: 0.94
Latency: p50=45ms p95=120ms
FAILURES: 3 queries with Recall@5 = 0
✗ "How do I configure SSO?"
Expected: [sso-guide.md], Retrieved: [api-keys.md...]
💡 Run `ragtune explain "<query>"` to debug
Commands
| Command | Purpose |
|---|---|
ingest |
Load documents into vector store |
explain |
Debug retrieval for a single query |
simulate |
Batch benchmark with metrics + CI mode |
compare |
Compare embedders or chunk sizes |
audit |
Quick health check (pass/fail) |
report |
Generate markdown reports |
import-queries |
Import queries from CSV/JSON |
See CLI Reference for all flags and options.
CI/CD Quality Gates
# .github/workflows/rag-quality.yml - name: RAG Quality Gate run: | ragtune ingest ./docs --collection ci-test --embedder ollama ragtune simulate --collection ci-test --queries tests/golden-queries.json \ --ci --min-recall 0.85 --min-coverage 0.90 --max-latency-p95 500
Exit code 1 if thresholds fail. See examples/github-actions.yml for complete setup.
Regression Testing
Compare against a baseline to catch regressions before they reach production:
# Compare current run against baseline
ragtune simulate --collection prod --queries golden.json \
--baseline runs/baseline.json --fail-on-regressionOutput shows deltas for each metric:
BASELINE COMPARISON
Comparing against: 2026-01-15T12:00:00Z
─────────────────────────────────────────────────────────────
Recall@5: 0.900 → 0.850 ↓ 5.6% (REGRESSED)
MRR: 0.800 → 0.820 ↑ 2.5% (improved)
Coverage: 0.950 → 0.950 = 0.0% (unchanged)
Latency p95: 100ms → 120ms ↑ 20.0% (REGRESSED)
─────────────────────────────────────────────────────────────
❌ REGRESSION DETECTED
The following metrics decreased: [Recall@5, Latency p95]
Why RagTune?
RAG retrieval is a configuration problem: chunk size, embedding model, index type, top-k. Most teams tune by intuition. RagTune provides the measurement layer to make these decisions empirically, using standard IR metrics (Recall@k, MRR, NDCG) on your actual data.
| What Matters | Impact |
|---|---|
| Domain-appropriate chunking | 7%+ recall difference |
| Embedding model choice | 5% difference |
| Continuous monitoring | Catches data drift before users do |
RagTune vs. Other Tools
RagTune focuses on retrieval debugging, monitoring, and benchmarking, not end-to-end answer evaluation.
| RagTune | Ragas / DeepEval | misbahsy/RAGTune | |
|---|---|---|---|
| Focus | Retrieval layer | Full pipeline | Full pipeline |
| LLM calls | None required | Required | Required |
| Interface | CLI (CI/CD-native) | Python library | Streamlit UI |
| Speed | Fast (embedding only) | Slow (LLM inference) | Slow |
| CI/CD | First-class | Manual setup | None |
Use RagTune when: debugging retrieval, CI/CD quality gates, comparing embedders, deterministic benchmarks.
Use other tools when: evaluating LLM answer quality, you need answer_relevancy metrics.
Signs You Need This
Retrieval failures are silent. No error, no exception. Just gradually worse answers.
- Users complaining about "wrong answers" but you can't reproduce it
- No idea if that embedding change made things better or worse
- Retrieval was "good" in dev, failing in production
- You added documents but answers got worse
- Can't tell if the LLM is hallucinating or retrieval is broken
If any of these sound familiar:
ragtune explain "the query that's failing" --collection prodInstallation
# Homebrew (macOS/Linux) brew install metawake/tap/ragtune # Go Install go install github.com/metawake/ragtune/cmd/ragtune@latest # Or download binary from GitHub Releases
Prerequisites: Docker (for Qdrant), Ollama or API key for embeddings.
Embedders
| Embedder | Setup | Best For |
|---|---|---|
ollama |
Local, no API key | Development, privacy |
openai |
OPENAI_API_KEY |
General purpose |
voyage |
VOYAGE_API_KEY |
Legal, code (domain-tuned) |
cohere |
COHERE_API_KEY |
Multilingual |
tei |
Docker container | High throughput |
Vector Stores
| Store | Setup |
|---|---|
| Qdrant (default) | docker run -p 6333:6333 qdrant/qdrant |
| pgvector | --store pgvector --pgvector-url postgres://... |
| Weaviate | --store weaviate --weaviate-host localhost:8080 |
| Chroma | --store chroma --chroma-url http://localhost:8000 |
| Pinecone | --store pinecone --pinecone-host HOST |
Included Benchmarks
| Dataset | Documents | Purpose |
|---|---|---|
data/ |
9 | Quick testing |
benchmarks/hotpotqa-1k/ |
398 | General knowledge |
benchmarks/casehold-500/ |
500 | Legal domain |
benchmarks/synthetic-50k/ |
50,000 | Scale testing |
# Try it
ragtune ingest ./benchmarks/hotpotqa-1k/corpus --collection demo --embedder ollama
ragtune simulate --collection demo --queries ./benchmarks/hotpotqa-1k/queries.jsonDocumentation
| Guide | Description |
|---|---|
| Concepts | RAG basics, metrics explained |
| CLI Reference | All commands and flags |
| Quickstart | Step-by-step setup guide |
| Benchmarking Guide | Scale testing, runtimes |
| Deployment Patterns | CI/CD, production |
| FAQ | Common questions |
| Troubleshooting | Common issues and fixes |
Contributing
Contributions welcome. Please open an issue first to discuss significant changes.
License
MIT
