Understudy is a scenario-driven testing framework for AI agents that simulates realistic multi-turn users, runs those scenes against an agent through a simple app adapter, records a structured execution trace of messages, tool calls, and handoffs, and then evaluates behavior with deterministic checks, optional LLM judges, and run reports.
How It Works
Testing with understudy is 4 steps:
- Wrap your agent — Adapt your agent (ADK, LangGraph, HTTP) to understudy's interface
- Mock your tools — Register handlers that return test data instead of calling real services
- Write scenes — YAML files defining what the simulated user wants and what you expect
- Run and assert — Execute simulations, check traces, generate reports
The key insight: assert against the trace, not the prose. Don't check what the agent said—check what it did (tool calls).
See real examples:
- Example scene — YAML defining a test scenario
- ADK test file — pytest assertions against traces
- LangGraph test file — same tests, different framework
- Example report — HTML report with metrics and transcripts
Installation
pip install understudy[all]
Quick Start
1. Wrap your agent
from understudy.adk import ADKApp from my_agent import agent app = ADKApp(agent=agent)
2. Mock your tools
Your agent has tools that call external services. Mock them for testing:
from understudy.mocks import MockToolkit mocks = MockToolkit() @mocks.handle("lookup_order") def lookup_order(order_id: str) -> dict: return {"order_id": order_id, "items": [...], "status": "delivered"} @mocks.handle("create_return") def create_return(order_id: str, item_sku: str, reason: str) -> dict: return {"return_id": "RET-001", "status": "created"}
3. Write a scene
Create scenes/return_backpack.yaml:
id: return_eligible_backpack description: Customer wants to return a backpack starting_prompt: "I'd like to return an item please." conversation_plan: | Goal: Return the hiking backpack from order ORD-10031. - Provide order ID when asked - Return reason: too small persona: cooperative max_turns: 15 expectations: required_tools: - lookup_order - create_return forbidden_tools: - issue_refund
4. Run simulation
from understudy import Scene, run scene = Scene.from_file("scenes/return_backpack.yaml") trace = run(app, scene, mocks=mocks) assert trace.called("lookup_order") assert trace.called("create_return") assert not trace.called("issue_refund")
Or with pytest (define app and mocks fixtures in conftest.py):
pytest test_returns.py -v
Suites and Batch Runs
Run multiple scenes with multiple simulations per scene:
from understudy import Suite, RunStorage suite = Suite.from_directory("scenes/") storage = RunStorage() # Run each scene 3 times and tag for comparison results = suite.run( app, mocks=mocks, storage=storage, n_sims=3, tags={"version": "v1"}, ) print(f"{results.pass_count}/{len(results.results)} passed")
Simulation and Evaluation
Understudy separates simulation (generating traces) from evaluation (checking traces). Use together or separately:
Combined (most common)
understudy run \ --app mymodule:agent_app \ --scene ./scenes/ \ --n-sims 3 \ --junit results.xml
Separate workflows
Generate traces only:
understudy simulate \ --app mymodule:agent_app \ --scenes ./scenes/ \ --output ./traces/ \ --n-sims 3
Evaluate existing traces:
understudy evaluate \ --traces ./traces/ \ --output ./results/ \ --junit results.xml
Python API:
from understudy import simulate_batch, evaluate_batch # Generate traces traces = simulate_batch( app=agent_app, scenes="./scenes/", n_sims=3, output="./traces/", ) # Evaluate later results = evaluate_batch( traces="./traces/", output="./results/", )
CLI Commands
# Run simulations understudy run --app mymodule:app --scene ./scenes/ understudy simulate --app mymodule:app --scenes ./scenes/ understudy evaluate --traces ./traces/ # View results understudy list understudy show <run_id> understudy summary # Compare runs by tag understudy compare --tag version --before v1 --after v2 # Generate reports understudy report -o report.html understudy compare --tag version --before v1 --after v2 --html comparison.html # Interactive browser understudy serve --port 8080 # HTTP simulator server (for browser/UI testing) understudy serve-api --port 8000 # Cleanup understudy delete <run_id> understudy clear
LLM Judges
For qualities that can't be checked deterministically:
from understudy.judges import Judge empathy_judge = Judge( rubric="The agent acknowledged frustration and was empathetic while enforcing policy.", samples=5, ) result = empathy_judge.evaluate(trace) assert result.score == 1
Built-in rubrics:
from understudy.judges import ( TOOL_USAGE_CORRECTNESS, POLICY_COMPLIANCE, TONE_EMPATHY, ADVERSARIAL_ROBUSTNESS, TASK_COMPLETION, )
Report Contents
The understudy summary command shows:
- Pass rate — percentage of scenes that passed all expectations
- Avg turns — average conversation length
- Tool usage — distribution of tool calls across runs
- Agents — which agents were invoked
The HTML report (understudy report) includes:
- All metrics above
- Full conversation transcripts
- Tool call details with arguments
- Expectation check results
- Judge evaluation results (when used)
Documentation
See the full documentation for:
- Installation guide
- Writing scenes
- ADK integration
- LangGraph integration
- HTTP client for deployed agents
- API reference
License
MIT