agentcheck
agentcheck: Trace β Replay β Test your AI agents like real software.
AgentCheck is a minimal but complete toolkit for tracing, replaying, diffing, and testing AI agent executions. Think of it as version control and testing for your AI agents.
π Install
β‘ Quickstart Demo
export OPENAI_API_KEY=sk-... # 1οΈβ£ Capture baseline trace python demo/demo_agent.py --output baseline.json # 2οΈβ£ Modify the prompt inside demo_agent.py (e.g. change tone) # 3οΈβ£ Replay with new code/model agentcheck replay baseline.json --output new.json # 4οΈβ£ See what changed agentcheck diff baseline.json new.json # 5οΈβ£ Assert the new output still mentions the user's name agentcheck assert new.json --contains "John Doe" # π 6οΈβ£ Test deterministic behavior python demo/demo_deterministic.py
Or run the complete demo:
π― Features
| Feature | Description | CLI Command | Python API |
|---|---|---|---|
| Trace | Capture agent execution (prompts, outputs, costs, timing) | agentcheck trace <command> |
@agentcheck.trace() |
| Replay | Re-run trace against current code/model | agentcheck replay trace.json |
agentcheck.replay_trace() |
| Diff | Compare traces and highlight changes | agentcheck diff trace_a.json trace_b.json |
agentcheck.diff_traces() |
| Assert | Test trace contents (CI-friendly) | agentcheck assert trace.json --contains "foo" |
agentcheck.assert_trace() |
| π Deterministic Testing | Test behavioral consistency of non-deterministic agents | (Python API only) | @agentcheck.deterministic_replay() |
| π Analytics Dashboard | Beautiful web GUI for trace analysis and testing insights | python launch_dashboard.py |
Web interface |
π Usage
Tracing with Decorator
import agentcheck import openai @agentcheck.trace(output="my_trace.json") def my_agent(user_input: str) -> str: response = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_input}] ) return response.choices[0].message.content # Automatically traces execution and saves to my_trace.json result = my_agent("Hello, world!")
π Deterministic Replay Testing
The Problem: AI agents are non-deterministic - they produce different outputs for identical inputs, making traditional testing impossible.
The Solution: AgentCheck's deterministic replay testing learns your agent's behavioral patterns and detects when behavior changes unexpectedly.
import agentcheck import openai @agentcheck.deterministic_replay( consistency_threshold=0.8, # 80% behavioral consistency required baseline_runs=5, # Run 5 times to establish baseline baseline_name="my_agent" # Name for this baseline ) def my_agent(user_input: str) -> str: with agentcheck.trace() as trace: response = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_input}], temperature=0.7 # Non-deterministic! ) # Record the LLM call trace.add_llm_call( messages=[{"role": "user", "content": user_input}], response={"content": response.choices[0].message.content}, model="gpt-4o-mini" ) return response.choices[0].message.content # Step 1: Establish behavioral baseline replayer = my_agent._deterministic_replayer test_inputs = ["What is Python?", "How do I install packages?"] replayer.establish_baseline( agent_func=my_agent, test_inputs=test_inputs, baseline_name="my_agent" ) # Step 2: Test current agent against baseline failures = replayer.test_consistency( agent_func=my_agent, test_inputs=test_inputs, baseline_name="my_agent" ) if failures: print(f"β {len(failures)} tests failed - agent behavior changed!") for failure in failures: print(f"Input: {failure.input_data}") print(f"Consistency Score: {failure.consistency_score:.3f}") else: print("β All tests passed - agent behavior is consistent!")
What it detects:
- Changes in reasoning patterns
- Different tool usage sequences
- Altered response structures
- Performance regressions
- Error rate changes
Perfect for:
- Regression testing after prompt changes
- Model version upgrades
- Code refactoring validation
- CI/CD pipeline integration
π Analytics Dashboard
Get beautiful insights into your agent performance with the built-in web dashboard:
# Launch the dashboard python launch_dashboard.py # Or manually with streamlit pip install streamlit plotly pandas numpy streamlit run agentcheck_dashboard.py
Dashboard Features:
- π Overview: Key metrics, traces over time, model usage distribution
- π Trace Analysis: Detailed step-by-step execution analysis
- π§ͺ Deterministic Testing: Baseline management and consistency trends
- π° Cost Analysis: Cost breakdowns by model and time periods
What you can track:
- Total traces and execution costs
- Error rates and failure patterns
- LLM model usage and performance
- Behavioral consistency trends
- Cost optimization opportunities
The dashboard automatically loads data from your traces/ and baselines/ directories and provides real-time analytics as you develop and test your agents.
Tracing with Context Manager
import agentcheck with agentcheck.Trace(output="trace.json") as t: # Your agent code here messages = [{"role": "user", "content": "Hello"}] # Manually add LLM calls to trace response = openai.chat.completions.create( model="gpt-4o-mini", messages=messages ) t.add_llm_call( messages=messages, response={"content": response.choices[0].message.content, "usage": response.usage}, model="gpt-4o-mini" )
CLI Commands
# Trace a Python script agentcheck trace "python my_agent.py" --output trace.json # Replay a trace with a different model agentcheck replay trace.json --model gpt-4 --output new_trace.json # Compare two traces agentcheck diff baseline.json new_trace.json # Assert trace contains expected content agentcheck assert trace.json --contains "expected output" # Assert with JSONPath agentcheck assert trace.json --jsonpath "$.steps[-1].output.content" --contains "John" # Assert cost and step constraints agentcheck assert trace.json --max-cost 0.05 --min-steps 1 --max-steps 10 # Pretty-print a trace agentcheck show trace.json
ποΈ Architecture
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β Your Agent βββββΆβ agentcheck βββββΆβ trace.json β
β β β tracer β β β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β replay β β diff β β assert β
β (re-execute) β β (compare) β β (test) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββββββ
β π deterministic β
β behavioral testing β
βββββββββββββββββββββββ
π Trace Format
AgentCheck uses a standardized JSON schema for traces:
{
"trace_id": "uuid",
"version": "1.0",
"start_time": "2024-01-01T12:00:00Z",
"end_time": "2024-01-01T12:00:05Z",
"metadata": {
"total_cost": 0.0023,
"function_name": "my_agent"
},
"steps": [
{
"step_id": "uuid",
"type": "llm_call",
"start_time": "2024-01-01T12:00:01Z",
"end_time": "2024-01-01T12:00:04Z",
"input": {
"messages": [...],
"model": "gpt-4o-mini"
},
"output": {
"content": "Agent response...",
"usage": {"prompt_tokens": 10, "completion_tokens": 20},
"cost": 0.0023
}
}
]
}π§ͺ Testing & CI Integration
AgentCheck is designed for CI/CD pipelines:
# Traditional trace testing agentcheck replay baseline_trace.json --output ci_trace.json agentcheck assert ci_trace.json --contains "expected behavior" --max-cost 0.10 # π Deterministic behavioral testing python -c " import agentcheck from my_agent import my_agent replayer = my_agent._deterministic_replayer test_inputs = ['test1', 'test2', 'test3'] failures = replayer.test_consistency( agent_func=my_agent, test_inputs=test_inputs, baseline_name='production' ) if failures: print(f'β {len(failures)} behavioral consistency tests failed') exit(1) else: print('β All behavioral tests passed') exit(0) " # Exit codes # 0 = success # 1 = assertion failed or error
π οΈ Development
# Install in development mode git clone https://github.com/agentcheck/agentcheck cd agentcheck pip install -e ".[dev]" # Run tests pytest # Format code ruff format . # Type check mypy agentcheck/
πΊοΈ Roadmap
π§ Core Framework Improvements
Enhanced Tracing & Observability
- Multi-Agent Tracing: Support for complex agent orchestrations and conversations
- Real-time Streaming: Live trace streaming for long-running agents
- Custom Metrics: User-defined KPIs and business metrics tracking
- Performance Profiling: Detailed timing analysis and bottleneck detection
- Memory Usage Tracking: Monitor agent memory consumption and optimization
Advanced Testing Capabilities
- Property-Based Testing: Generate test cases automatically based on agent specifications
- Mutation Testing: Automatically modify prompts/code to test robustness
- Load Testing: Concurrent agent execution testing with performance metrics
- A/B Testing Framework: Built-in support for comparing agent variants
- Regression Test Suite: Automated detection of performance and quality regressions
Enterprise Integration
- CI/CD Plugins: Native GitHub Actions, GitLab CI, Jenkins integrations
- Database Backends: PostgreSQL, MongoDB support for large-scale trace storage
- SSO & RBAC: Enterprise authentication and role-based access control
- Audit Logging: Comprehensive audit trails for compliance requirements
- API Gateway: REST/GraphQL APIs for enterprise system integration
π― Specialized Features
Multi-Modal Agent Support
- Vision Agent Testing: Image/video input tracing and consistency testing
- Audio Agent Testing: Speech-to-text and text-to-speech agent validation
- Document Processing: PDF, Word, Excel agent testing capabilities
- Code Generation: Specialized testing for code-generating agents
Advanced Analytics & Insights
- Predictive Analytics: ML-powered prediction of agent behavior changes
- Anomaly Detection: Automatic detection of unusual agent behaviors
- Cost Optimization: AI-powered recommendations for cost reduction
- Quality Scoring: Automated quality assessment and improvement suggestions
- Behavioral Clustering: Group similar agent behaviors for pattern analysis
Developer Experience
- IDE Extensions: VS Code, PyCharm plugins for inline testing
- Interactive Debugging: Step-through debugging for agent executions
- Visual Flow Builder: Drag-and-drop agent testing pipeline creation
- Template Library: Pre-built testing templates for common agent patterns
- Auto-Documentation: Generate testing documentation from traces
π Platform & Infrastructure
Cloud & Deployment
- AgentCheck Cloud: Hosted platform for teams and enterprises
- Kubernetes Operator: Native Kubernetes deployment and scaling
- Docker Compose: One-click local development environment
- Serverless Support: AWS Lambda, Azure Functions, Google Cloud Functions
- Edge Computing: Testing for edge-deployed agents
Ecosystem Integration
- LangChain Integration: Native support for LangChain agents and chains
- AutoGen Integration: Multi-agent conversation testing
- CrewAI Integration: Specialized crew-based agent testing
- Custom Framework Support: Plugin system for any agent framework
π’ Enterprise Testing Standards
How AgentCheck Achieves Enterprise-Grade Testing
1. Compliance & Governance
# Regulatory compliance testing @agentcheck.compliance_test( standards=["SOX", "GDPR", "HIPAA"], audit_trail=True, data_retention_days=2555 # 7 years ) def financial_advisor_agent(query: str) -> str: # Agent implementation pass # Test for compliance violations failures = agentcheck.test_compliance( agent_func=financial_advisor_agent, test_cases=load_compliance_test_cases(), regulations=["financial_advice_disclosure", "data_privacy"] )
2. Quality Assurance Framework
# Multi-dimensional quality testing quality_metrics = agentcheck.QualityFramework([ agentcheck.AccuracyMetric(threshold=0.95), agentcheck.SafetyMetric(harmful_content_threshold=0.0), agentcheck.BiasMetric(demographic_fairness=True), agentcheck.LatencyMetric(max_response_time_ms=2000), agentcheck.CostMetric(max_cost_per_request=0.10), agentcheck.ConsistencyMetric(behavioral_threshold=0.85) ]) # Enterprise-grade testing pipeline test_results = quality_metrics.evaluate( agent_func=my_agent, test_dataset=enterprise_test_dataset, environments=["staging", "production"] )
3. Security & Safety Testing
# Comprehensive security testing security_tests = agentcheck.SecurityTestSuite([ agentcheck.PromptInjectionTest(), agentcheck.DataLeakageTest(), agentcheck.AdversarialInputTest(), agentcheck.AuthorizationTest(), agentcheck.PIIDetectionTest() ]) # Red team testing red_team_results = security_tests.run_red_team_scenarios( agent_func=my_agent, attack_vectors=["jailbreaking", "data_extraction", "privilege_escalation"] )
4. Performance & Scalability Testing
# Load testing with realistic scenarios load_test = agentcheck.LoadTest( concurrent_users=1000, ramp_up_time=300, # 5 minutes test_duration=3600, # 1 hour realistic_user_behavior=True ) performance_results = load_test.run( agent_func=my_agent, user_scenarios=enterprise_user_scenarios ) # SLA validation sla_compliance = agentcheck.validate_sla( results=performance_results, requirements={ "p95_latency_ms": 1500, "availability_percent": 99.9, "error_rate_percent": 0.1, "throughput_rps": 100 } )
5. Continuous Monitoring & Alerting
# Production monitoring monitor = agentcheck.ProductionMonitor( alert_channels=["slack", "email", "pagerduty"], thresholds={ "error_rate": 0.01, # 1% error rate "latency_p99": 3000, # 3 second P99 latency "cost_per_hour": 50.0, # $50/hour cost limit "behavioral_drift": 0.2 # 20% behavior change } ) # Real-time alerts monitor.start_monitoring( agent_func=my_agent, baseline_name="production_v1.0" )
Enterprise Implementation Checklist
π Testing Standards
- Behavioral Consistency: β₯85% consistency across test runs
- Performance SLAs: P95 latency <2s, 99.9% availability
- Cost Controls: Automated cost monitoring and alerts
- Security Validation: Regular red team testing and vulnerability scans
- Compliance Testing: Automated regulatory compliance validation
- Quality Gates: Multi-stage testing pipeline with approval gates
π Monitoring & Observability
- Real-time Dashboards: Executive and operational dashboards
- Automated Alerting: PagerDuty/Slack integration for critical issues
- Audit Trails: Complete audit logs for all agent interactions
- Performance Baselines: Established performance benchmarks
- Business Metrics: Custom KPIs aligned with business objectives
π Security & Governance
- Access Controls: Role-based access to testing and monitoring
- Data Protection: Encryption at rest and in transit
- Incident Response: Automated incident detection and response
- Change Management: Controlled deployment with rollback capabilities
- Documentation: Comprehensive testing and operational documentation
ROI Metrics for Enterprise Adoption
Risk Reduction:
- 90% reduction in production agent failures
- 75% faster incident detection and resolution
- 60% reduction in compliance violations
Cost Optimization:
- 40% reduction in LLM API costs through optimization
- 50% reduction in manual testing effort
- 30% faster time-to-market for new agent features
Quality Improvement:
- 95% improvement in agent response consistency
- 80% reduction in customer complaints
- 99.9% uptime achievement for critical agent services
π License
MIT License - see LICENSE file.
π€ Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Built for the era of AI agents π€β¨