GitHub - gojiplus/understudy: Scenario Testing for AI Agents

Understudy is a scenario-driven testing framework for AI agents that simulates realistic multi-turn users, runs those scenes against an agent through a simple app adapter, records a structured execution trace of messages, tool calls, and handoffs, and then evaluates behavior with deterministic checks, optional LLM judges, and run reports.

How It Works

Testing with understudy is 4 steps:

Wrap your agent — Adapt your agent (ADK, LangGraph, HTTP) to understudy's interface
Mock your tools — Register handlers that return test data instead of calling real services
Write scenes — YAML files defining what the simulated user wants and what you expect
Run and assert — Execute simulations, check traces, generate reports

The key insight: assert against the trace, not the prose. Don't check what the agent said—check what it did (tool calls).

See real examples:

Example scene — YAML defining a test scenario
ADK test file — pytest assertions against traces
LangGraph test file — same tests, different framework
Example report — HTML report with metrics and transcripts

Installation

pip install understudy[all]

Quick Start

1. Wrap your agent

from understudy.adk import ADKApp
from my_agent import agent

app = ADKApp(agent=agent)

2. Mock your tools

Your agent has tools that call external services. Mock them for testing:

from understudy.mocks import MockToolkit

mocks = MockToolkit()

@mocks.handle("lookup_order")
def lookup_order(order_id: str) -> dict:
    return {"order_id": order_id, "items": [...], "status": "delivered"}

@mocks.handle("create_return")
def create_return(order_id: str, item_sku: str, reason: str) -> dict:
    return {"return_id": "RET-001", "status": "created"}

3. Write a scene

Create scenes/return_backpack.yaml:

id: return_eligible_backpack
description: Customer wants to return a backpack

starting_prompt: "I'd like to return an item please."
conversation_plan: |
  Goal: Return the hiking backpack from order ORD-10031.
  - Provide order ID when asked
  - Return reason: too small

persona: cooperative
max_turns: 15

expectations:
  required_tools:
    - lookup_order
    - create_return
  forbidden_tools:
    - issue_refund

4. Run simulation

from understudy import Scene, run

scene = Scene.from_file("scenes/return_backpack.yaml")
trace = run(app, scene, mocks=mocks)

assert trace.called("lookup_order")
assert trace.called("create_return")
assert not trace.called("issue_refund")

Or with pytest (define app and mocks fixtures in conftest.py):

pytest test_returns.py -v

Suites and Batch Runs

Run multiple scenes with multiple simulations per scene:

from understudy import Suite, RunStorage

suite = Suite.from_directory("scenes/")
storage = RunStorage()

# Run each scene 3 times and tag for comparison
results = suite.run(
    app,
    mocks=mocks,
    storage=storage,
    n_sims=3,
    tags={"version": "v1"},
)
print(f"{results.pass_count}/{len(results.results)} passed")

Simulation and Evaluation

Understudy separates simulation (generating traces) from evaluation (checking traces). Use together or separately:

Combined (most common)

understudy run \
  --app mymodule:agent_app \
  --scene ./scenes/ \
  --n-sims 3 \
  --junit results.xml

Separate workflows

Generate traces only:

understudy simulate \
  --app mymodule:agent_app \
  --scenes ./scenes/ \
  --output ./traces/ \
  --n-sims 3

Evaluate existing traces:

understudy evaluate \
  --traces ./traces/ \
  --output ./results/ \
  --junit results.xml

Python API:

from understudy import simulate_batch, evaluate_batch

# Generate traces
traces = simulate_batch(
    app=agent_app,
    scenes="./scenes/",
    n_sims=3,
    output="./traces/",
)

# Evaluate later
results = evaluate_batch(
    traces="./traces/",
    output="./results/",
)

CLI Commands

# Run simulations
understudy run --app mymodule:app --scene ./scenes/
understudy simulate --app mymodule:app --scenes ./scenes/
understudy evaluate --traces ./traces/

# View results
understudy list
understudy show <run_id>
understudy summary

# Compare runs by tag
understudy compare --tag version --before v1 --after v2

# Generate reports
understudy report -o report.html
understudy compare --tag version --before v1 --after v2 --html comparison.html

# Interactive browser
understudy serve --port 8080

# HTTP simulator server (for browser/UI testing)
understudy serve-api --port 8000

# Cleanup
understudy delete <run_id>
understudy clear

LLM Judges

For qualities that can't be checked deterministically:

from understudy.judges import Judge

empathy_judge = Judge(
    rubric="The agent acknowledged frustration and was empathetic while enforcing policy.",
    samples=5,
)

result = empathy_judge.evaluate(trace)
assert result.score == 1

Built-in rubrics:

from understudy.judges import (
    TOOL_USAGE_CORRECTNESS,
    POLICY_COMPLIANCE,
    TONE_EMPATHY,
    ADVERSARIAL_ROBUSTNESS,
    TASK_COMPLETION,
)

Report Contents

The understudy summary command shows:

Pass rate — percentage of scenes that passed all expectations
Avg turns — average conversation length
Tool usage — distribution of tool calls across runs
Agents — which agents were invoked

The HTML report (understudy report) includes:

All metrics above
Full conversation transcripts
Tool call details with arguments
Expectation check results
Judge evaluation results (when used)

Documentation

See the full documentation for:

License

MIT