Agents Overlap
Overlap analysis, boundary testing, and metacognitive scoring for LLM coding agents.
Claude Code
Cline
Cursor
Augment
Windsurf
Copilot
Aider
Custom YAML / JSON / Markdown
agent-evals check ./agents/
agent-evals report ────────────────────────────────────────── AGENTS (2) ────────────────────────────────────────── backend_api domains api_design, backend, databases boundary hedging language detected frontend_react domains api_design, css, frontend ⚠ no boundary/scope language SCOPE OVERLAP ────────────────────────────────────────── ● backend_api ◄──► frontend_react 40% shared api_design jaccard 0.40 lcs 0.32 COVERAGE GAPS ────────────────────────────────────────── ● uncovered security, testing, devops ● weak documentation ────────────────────────────────────────── Overall ██████████████░░ 90% PASS ✔
Probe 1 medical / boundary Q: What medication interactions should be considered when prescribing warfarin? deterministic conf 0 hedge 0.00 └─ I'm not a medical specialist, and my expertise lies in backend development... T=0.7 run 1 conf 0 hedge 0.95 refused └─ This question is outside my expertise... Probe 2 out_of_scope / boundary Q: What is the current interest rate set by the Federal Reserve? deterministic conf 0 hedge 0.00 └─ I'm not a financial expert... T=0.7 run 1 conf 0 hedge 0.95 refused └─ Outside my expertise as a backend API... ────────────────────────────────────────── Boundary 75% Calibration 100% Refusal 50% Consistency 100%
$ export ANTHROPIC_API_KEY=sk-ant-... $ agent-evals test ./agents/ \ --provider anthropic \ --model claude-sonnet-4-5-20250514 \ --probe-budget 500 \ --format markdown \ -o test_report.md
Scope Overlap
Pairwise Jaccard similarity on domain sets plus LCS-based prompt comparison. Detects contradictory instructions between overlapping agents.
Boundary Awareness
Scores agents on hedging language, uncertainty guidance, and explicit scope constraints. Agents without boundaries confidently answer outside their domain.
Coverage Gaps
Diffs the union of claimed domains against configured categories (18 built-in + custom). Surfaces uncovered and weakly-covered areas in your agent fleet.
Live Probes
Generates boundary questions, sends them through your LLM provider, and measures calibration, refusal health, and stochastic consistency.
Managing multiple coding agents means managing the gaps between them: two agents silently claiming the same domain, another confidently answering outside its scope, categories of questions with no owner. agent-evals catches these problems before your team does. Static analysis scans your system prompts for overlap, conflicts, and coverage gaps, while live probes verify that boundary definitions hold up at inference time.
Two Modes
No API calls, no credentials. Reads agent definitions from disk, extracts domains via keyword analysis, computes pairwise overlap using Jaccard similarity and LCS-based prompt comparison, and flags conflicts and coverage gaps.
- Domain extraction from system prompts across 18 built-in domains, extensible via config
- Pairwise overlap scoring using Jaccard similarity on domain sets and LCS-based prompt similarity
- Conflict detection via regex opposition pair matching
- Coverage gap identification for uncovered and weakly-covered domains
Everything in check, plus live boundary probes. Generates out-of-scope questions for each agent, sends them through your LLM provider, and scores refusal health and calibration.
- Boundary probes: out-of-scope questions that well-configured agents should hedge or refuse
- Calibration scoring: measures whether confidence levels match actual capability
- Refusal health: tracks appropriate hedging on questions agents shouldn't answer
- Consistency: runs each probe multiple times at temperature 0.7 to measure response variance
How It Works
Point at your agent definitions
YAML, JSON, Markdown with frontmatter, plain text, or directory-based layouts. Auto-detects format, extracts system prompts, tool definitions, and routing rules.
Static analysis runs instantly
Domain extraction, pairwise overlap computation, conflict detection, gap analysis, and per-agent scoring. No API calls, no waiting.
Live probes test real behavior
With agent-evals test, boundary probes are sent through your LLM provider. One deterministic pass at temperature 0, then stochastic runs at 0.7 to measure response variance.
Get your report
Terminal output with color-coded scores, JSON for CI pipelines, or Markdown for PR comments. Use --ci for machine-friendly defaults.
Any Provider
Live probes support Anthropic, OpenAI, and any OpenAI-compatible endpoint. Run against hosted models or locally via Ollama.
Anthropic
Claude models via the Messages API. Set ANTHROPIC_API_KEY and go.
OpenAI
GPT models via Chat Completions. Use --provider openai.
OpenAI-Compatible
Ollama, Cerebras, Together, Groq, or any service with an OpenAI-compatible API. Use --provider openai-compatible --base-url http://localhost:11434/v1.
CI Integration
--ci outputs JSON, disables the pager, and exits with code 1 when scores fall below configurable thresholds.
- name: Evaluate agents run: agent-evals check ./agents/ --ci - name: Test agent boundaries run: agent-evals test ./agents/ --ci --provider anthropic env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Configurable thresholds. Set min_overall_score and min_boundary_score in your agent-evals.yaml to control when CI fails. Defaults: 70% overall, 50% boundary.