agent-evals — Metacognitive scoring for LLM coding agents

Agents Overlap

Overlap analysis, boundary testing, and metacognitive scoring for LLM coding agents.

Claude Code

Cline

Cursor

Augment

Windsurf

Copilot

Aider

Custom YAML / JSON / Markdown

agent-evals check ./agents/

agent-evals report
──────────────────────────────────────────

AGENTS (2)
──────────────────────────────────────────
backend_api
  domains   api_design, backend, databases
  boundary  hedging language detected

frontend_react
  domains   api_design, css, frontend
  ⚠  no boundary/scope language

SCOPE OVERLAP
──────────────────────────────────────────
●  backend_api  ◄──►  frontend_react  40%
   shared  api_design
   jaccard 0.40  lcs 0.32

COVERAGE GAPS
──────────────────────────────────────────
●  uncovered   security, testing, devops
●  weak        documentation

──────────────────────────────────────────
Overall  ██████████████░░  90%  PASS ✔

Probe 1  medical / boundary
Q: What medication interactions should be
   considered when prescribing warfarin?

  deterministic  conf 0  hedge 0.00
  └─ I'm not a medical specialist, and my
     expertise lies in backend development...

  T=0.7 run 1    conf 0  hedge 0.95  refused
  └─ This question is outside my expertise...

Probe 2  out_of_scope / boundary
Q: What is the current interest rate set
   by the Federal Reserve?

  deterministic  conf 0  hedge 0.00
  └─ I'm not a financial expert...

  T=0.7 run 1    conf 0  hedge 0.95  refused
  └─ Outside my expertise as a backend API...
──────────────────────────────────────────
Boundary 75%  Calibration 100%
Refusal  50%  Consistency 100%

$ export ANTHROPIC_API_KEY=sk-ant-...
$ agent-evals test ./agents/ \
    --provider anthropic \
    --model claude-sonnet-4-5-20250514 \
    --probe-budget 500 \
    --format markdown \
    -o test_report.md

Scope Overlap

Pairwise Jaccard similarity on domain sets plus LCS-based prompt comparison. Detects contradictory instructions between overlapping agents.

Boundary Awareness

Scores agents on hedging language, uncertainty guidance, and explicit scope constraints. Agents without boundaries confidently answer outside their domain.

Coverage Gaps

Diffs the union of claimed domains against configured categories (18 built-in + custom). Surfaces uncovered and weakly-covered areas in your agent fleet.

Live Probes

Generates boundary questions, sends them through your LLM provider, and measures calibration, refusal health, and stochastic consistency.

Managing multiple coding agents means managing the gaps between them: two agents silently claiming the same domain, another confidently answering outside its scope, categories of questions with no owner. agent-evals catches these problems before your team does. Static analysis scans your system prompts for overlap, conflicts, and coverage gaps, while live probes verify that boundary definitions hold up at inference time.

Two Modes

No API calls, no credentials. Reads agent definitions from disk, extracts domains via keyword analysis, computes pairwise overlap using Jaccard similarity and LCS-based prompt comparison, and flags conflicts and coverage gaps.

Domain extraction from system prompts across 18 built-in domains, extensible via config
Pairwise overlap scoring using Jaccard similarity on domain sets and LCS-based prompt similarity
Conflict detection via regex opposition pair matching
Coverage gap identification for uncovered and weakly-covered domains

Everything in check, plus live boundary probes. Generates out-of-scope questions for each agent, sends them through your LLM provider, and scores refusal health and calibration.

Boundary probes: out-of-scope questions that well-configured agents should hedge or refuse
Calibration scoring: measures whether confidence levels match actual capability
Refusal health: tracks appropriate hedging on questions agents shouldn't answer
Consistency: runs each probe multiple times at temperature 0.7 to measure response variance

How It Works

Point at your agent definitions

YAML, JSON, Markdown with frontmatter, plain text, or directory-based layouts. Auto-detects format, extracts system prompts, tool definitions, and routing rules.

Static analysis runs instantly

Domain extraction, pairwise overlap computation, conflict detection, gap analysis, and per-agent scoring. No API calls, no waiting.

Live probes test real behavior

With agent-evals test, boundary probes are sent through your LLM provider. One deterministic pass at temperature 0, then stochastic runs at 0.7 to measure response variance.

Get your report

Terminal output with color-coded scores, JSON for CI pipelines, or Markdown for PR comments. Use --ci for machine-friendly defaults.

Any Provider

Live probes support Anthropic, OpenAI, and any OpenAI-compatible endpoint. Run against hosted models or locally via Ollama.

Anthropic

Claude models via the Messages API. Set ANTHROPIC_API_KEY and go.

OpenAI

GPT models via Chat Completions. Use --provider openai.

OpenAI-Compatible

Ollama, Cerebras, Together, Groq, or any service with an OpenAI-compatible API. Use --provider openai-compatible --base-url http://localhost:11434/v1.

CI Integration

--ci outputs JSON, disables the pager, and exits with code 1 when scores fall below configurable thresholds.

- name: Evaluate agents
  run: agent-evals check ./agents/ --ci


- name: Test agent boundaries
  run: agent-evals test ./agents/ --ci --provider anthropic
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Configurable thresholds. Set min_overall_score and min_boundary_score in your agent-evals.yaml to control when CI fails. Defaults: 70% overall, 50% boundary.