mimiq: Cypress integration for end-to-end testing of agentic applications
Testing AI agents is hard: manual testing is slow, real users are expensive, and LLM non-determinism makes assertions tricky. mimiq solves this with simulated users that follow scripts, plus deterministic checks on tool calls and terminal states.
Overview
mimiq is a complete TypeScript solution for testing AI agents with simulated users. It provides:
- Simulated users - LLM-powered users that follow conversation plans
- Deterministic checks - Verify tool calls, terminal states, forbidden actions
- LLM-as-judge - Qualitative evaluation with majority voting
- Cypress commands - Drive simulations in real browsers
- HTML reports - View conversation traces and check results
No Python required. Everything runs in Node.js.
Quick Start
1. Install
npm install @gojiplus/mimiq --save-dev
2. Configure API Key
export OPENAI_API_KEY=your-key # Optional: use a different model export SIMULATOR_MODEL=gpt-4o # default
3. Configure Cypress
cypress.config.ts
import { defineConfig } from "cypress"; import { setupMimiqTasks, createLocalRuntime } from "@gojiplus/mimiq/node"; export default defineConfig({ e2e: { baseUrl: "http://localhost:5173", setupNodeEvents(on, config) { const runtime = createLocalRuntime({ scenesDir: "./scenes", }); setupMimiqTasks(on, { runtime }); return config; }, }, });
cypress/support/e2e.ts
import { createDefaultChatAdapter, registerMimiqCommands } from "@gojiplus/mimiq"; registerMimiqCommands({ browserAdapter: createDefaultChatAdapter({ transcript: '[data-test="transcript"]', messageRow: '[data-test="message-row"]', messageRoleAttr: "data-role", messageText: '[data-test="message-text"]', input: '[data-test="chat-input"]', send: '[data-test="send-button"]', idleMarker: '[data-test="agent-idle"]', }), });
4. Write a Scene
scenes/return_backpack.yaml
id: return_backpack description: Customer returns a backpack starting_prompt: "I'd like to return an item please." conversation_plan: | Goal: Return the hiking backpack from order ORD-10031. - Provide order ID when asked. - Cooperate with all steps. persona: cooperative max_turns: 15 expectations: required_tools: - lookup_order - create_return forbidden_tools: - issue_refund allowed_terminal_states: - return_created judges: - name: empathy rubric: "The agent maintained a professional and empathetic tone." samples: 3
5. Write the Test
describe("return flow", () => { afterEach(() => cy.mimiqCleanupRun()); it("processes valid return", () => { cy.visit("/"); cy.mimiqStartRun({ sceneId: "return_backpack" }); cy.mimiqRunToCompletion(); cy.mimiqEvaluate().then((report) => { expect(report.passed).to.eq(true); }); }); });
Scene Schema
id: string # Unique identifier description: string # Human-readable description starting_prompt: string # First message from simulated user conversation_plan: string # Instructions for user behavior persona: string # Preset: cooperative, frustrated_but_cooperative, adversarial, vague, impatient max_turns: number # Maximum turns (default: 15) context: # World state (optional) customer: { ... } orders: { ... } expectations: required_tools: [string] # Must be called forbidden_tools: [string] # Must NOT be called allowed_terminal_states: [string] # Valid end states forbidden_terminal_states: [string] required_agents: [string] # For multi-agent systems forbidden_agents: [string] required_agent_tools: # Agent-specific tool requirements agent_name: [tool1, tool2] judges: # LLM-as-judge evaluations - name: string rubric: string samples: number # Number of samples (default: 5) model: string # Model to use (default: gpt-4o)
Persona Presets
| Preset | Description |
|---|---|
cooperative |
Helpful, provides information directly |
frustrated_but_cooperative |
Mildly frustrated but ultimately cooperative |
adversarial |
Tries to push boundaries, social-engineer exceptions |
vague |
Gives incomplete information, needs follow-up |
impatient |
Wants fast resolution, short answers |
LLM-as-Judge
Add qualitative evaluation with LLM judges:
expectations: judges: - name: empathy rubric: "The agent maintained an empathetic tone throughout." samples: 5 - name: accuracy rubric: "All factual claims were grounded in tool results."
Judges use majority voting across multiple samples for reliability.
Built-in Rubrics
import { BUILTIN_RUBRICS } from "@gojiplus/mimiq"; // Available rubrics: BUILTIN_RUBRICS.TASK_COMPLETION BUILTIN_RUBRICS.INSTRUCTION_FOLLOWING BUILTIN_RUBRICS.TONE_EMPATHY BUILTIN_RUBRICS.POLICY_COMPLIANCE BUILTIN_RUBRICS.FACTUAL_GROUNDING BUILTIN_RUBRICS.TOOL_USAGE_CORRECTNESS BUILTIN_RUBRICS.ADVERSARIAL_ROBUSTNESS
Cypress Commands
| Command | Description |
|---|---|
cy.mimiqStartRun({ sceneId }) |
Start a simulation |
cy.mimiqRunToCompletion() |
Run until done or max turns |
cy.mimiqRunTurn() |
Execute one turn |
cy.mimiqEvaluate() |
Run all checks and judges |
cy.mimiqGetTrace() |
Get conversation trace |
cy.mimiqCleanupRun() |
Clean up |
Environment Variables
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for simulation and judges |
SIMULATOR_MODEL |
Model for simulation (default: gpt-4o) |
JUDGE_MODEL |
Model for judges (default: gpt-4o) |
OPENAI_BASE_URL |
Base URL for OpenAI-compatible API |
HTML Reports
mimiq generates rich, interactive HTML reports. See examples:
- Aggregate Report - Stats dashboard with all runs
- Run Detail: Return Flow - Conversation timeline with tool calls
Generate reports after tests:
npm run test:report # Runs tests and opens reportArchitecture
┌─────────────────────────────────────────────────────────────────────────┐
│ mimiq │
│ │
│ Browser Layer (Cypress): │
│ - Captures UI state via data-test selectors │
│ - Executes actions (type, click, send) │
│ │
│ Node Layer (Cypress tasks): │
│ - Simulator: LLM generates user messages │
│ - Trace: records conversation + tool calls │
│ - Check: validates against expectations │
│ - Judge: LLM-as-judge evaluation │
│ - Reports: generates HTML summaries │
└─────────────────────────────────────────────────────────────────────────┘
License
MIT