GitHub - hvardhan878/agentcheck

9 min read Original article β†—

agentcheck

agentcheck: Trace β‹… Replay β‹… Test your AI agents like real software.

PyPI version Python 3.9+ License: MIT

AgentCheck is a minimal but complete toolkit for tracing, replaying, diffing, and testing AI agent executions. Think of it as version control and testing for your AI agents.

πŸš€ Install

⚑ Quickstart Demo

export OPENAI_API_KEY=sk-...

# 1️⃣ Capture baseline trace
python demo/demo_agent.py --output baseline.json

# 2️⃣ Modify the prompt inside demo_agent.py (e.g. change tone)
# 3️⃣ Replay with new code/model  
agentcheck replay baseline.json --output new.json

# 4️⃣ See what changed
agentcheck diff baseline.json new.json

# 5️⃣ Assert the new output still mentions the user's name
agentcheck assert new.json --contains "John Doe"

# πŸ†• 6️⃣ Test deterministic behavior
python demo/demo_deterministic.py

Or run the complete demo:

🎯 Features

Feature Description CLI Command Python API
Trace Capture agent execution (prompts, outputs, costs, timing) agentcheck trace <command> @agentcheck.trace()
Replay Re-run trace against current code/model agentcheck replay trace.json agentcheck.replay_trace()
Diff Compare traces and highlight changes agentcheck diff trace_a.json trace_b.json agentcheck.diff_traces()
Assert Test trace contents (CI-friendly) agentcheck assert trace.json --contains "foo" agentcheck.assert_trace()
πŸ†• Deterministic Testing Test behavioral consistency of non-deterministic agents (Python API only) @agentcheck.deterministic_replay()
πŸ†• Analytics Dashboard Beautiful web GUI for trace analysis and testing insights python launch_dashboard.py Web interface

πŸ“– Usage

Tracing with Decorator

import agentcheck
import openai

@agentcheck.trace(output="my_trace.json")
def my_agent(user_input: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_input}]
    )
    return response.choices[0].message.content

# Automatically traces execution and saves to my_trace.json
result = my_agent("Hello, world!")

πŸ†• Deterministic Replay Testing

The Problem: AI agents are non-deterministic - they produce different outputs for identical inputs, making traditional testing impossible.

The Solution: AgentCheck's deterministic replay testing learns your agent's behavioral patterns and detects when behavior changes unexpectedly.

import agentcheck
import openai

@agentcheck.deterministic_replay(
    consistency_threshold=0.8,  # 80% behavioral consistency required
    baseline_runs=5,           # Run 5 times to establish baseline
    baseline_name="my_agent"   # Name for this baseline
)
def my_agent(user_input: str) -> str:
    with agentcheck.trace() as trace:
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_input}],
            temperature=0.7  # Non-deterministic!
        )
        
        # Record the LLM call
        trace.add_llm_call(
            messages=[{"role": "user", "content": user_input}],
            response={"content": response.choices[0].message.content},
            model="gpt-4o-mini"
        )
        
        return response.choices[0].message.content

# Step 1: Establish behavioral baseline
replayer = my_agent._deterministic_replayer
test_inputs = ["What is Python?", "How do I install packages?"]

replayer.establish_baseline(
    agent_func=my_agent,
    test_inputs=test_inputs,
    baseline_name="my_agent"
)

# Step 2: Test current agent against baseline
failures = replayer.test_consistency(
    agent_func=my_agent,
    test_inputs=test_inputs,
    baseline_name="my_agent"
)

if failures:
    print(f"❌ {len(failures)} tests failed - agent behavior changed!")
    for failure in failures:
        print(f"Input: {failure.input_data}")
        print(f"Consistency Score: {failure.consistency_score:.3f}")
else:
    print("βœ… All tests passed - agent behavior is consistent!")

What it detects:

  • Changes in reasoning patterns
  • Different tool usage sequences
  • Altered response structures
  • Performance regressions
  • Error rate changes

Perfect for:

  • Regression testing after prompt changes
  • Model version upgrades
  • Code refactoring validation
  • CI/CD pipeline integration

πŸ†• Analytics Dashboard

Get beautiful insights into your agent performance with the built-in web dashboard:

# Launch the dashboard
python launch_dashboard.py

# Or manually with streamlit
pip install streamlit plotly pandas numpy
streamlit run agentcheck_dashboard.py

Dashboard Features:

  • πŸ“Š Overview: Key metrics, traces over time, model usage distribution
  • πŸ” Trace Analysis: Detailed step-by-step execution analysis
  • πŸ§ͺ Deterministic Testing: Baseline management and consistency trends
  • πŸ’° Cost Analysis: Cost breakdowns by model and time periods

What you can track:

  • Total traces and execution costs
  • Error rates and failure patterns
  • LLM model usage and performance
  • Behavioral consistency trends
  • Cost optimization opportunities

The dashboard automatically loads data from your traces/ and baselines/ directories and provides real-time analytics as you develop and test your agents.

Tracing with Context Manager

import agentcheck

with agentcheck.Trace(output="trace.json") as t:
    # Your agent code here
    messages = [{"role": "user", "content": "Hello"}]
    
    # Manually add LLM calls to trace
    response = openai.chat.completions.create(
        model="gpt-4o-mini", messages=messages
    )
    
    t.add_llm_call(
        messages=messages,
        response={"content": response.choices[0].message.content, "usage": response.usage},
        model="gpt-4o-mini"
    )

CLI Commands

# Trace a Python script
agentcheck trace "python my_agent.py" --output trace.json

# Replay a trace with a different model
agentcheck replay trace.json --model gpt-4 --output new_trace.json

# Compare two traces
agentcheck diff baseline.json new_trace.json

# Assert trace contains expected content
agentcheck assert trace.json --contains "expected output"

# Assert with JSONPath
agentcheck assert trace.json --jsonpath "$.steps[-1].output.content" --contains "John"

# Assert cost and step constraints
agentcheck assert trace.json --max-cost 0.05 --min-steps 1 --max-steps 10

# Pretty-print a trace
agentcheck show trace.json

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Your Agent    │───▢│ agentcheck   │───▢│  trace.json     β”‚
β”‚                 β”‚    β”‚   tracer     β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                    β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β–Ό                     β–Ό                     β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     replay      β”‚   β”‚      diff       β”‚   β”‚     assert      β”‚
                    β”‚   (re-execute)  β”‚   β”‚   (compare)     β”‚   β”‚    (test)       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                    β”‚
                                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                          β”‚ πŸ†• deterministic    β”‚
                                          β”‚ behavioral testing  β”‚
                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Trace Format

AgentCheck uses a standardized JSON schema for traces:

{
  "trace_id": "uuid",
  "version": "1.0", 
  "start_time": "2024-01-01T12:00:00Z",
  "end_time": "2024-01-01T12:00:05Z",
  "metadata": {
    "total_cost": 0.0023,
    "function_name": "my_agent"
  },
  "steps": [
    {
      "step_id": "uuid",
      "type": "llm_call",
      "start_time": "2024-01-01T12:00:01Z", 
      "end_time": "2024-01-01T12:00:04Z",
      "input": {
        "messages": [...],
        "model": "gpt-4o-mini"
      },
      "output": {
        "content": "Agent response...",
        "usage": {"prompt_tokens": 10, "completion_tokens": 20},
        "cost": 0.0023
      }
    }
  ]
}

πŸ§ͺ Testing & CI Integration

AgentCheck is designed for CI/CD pipelines:

# Traditional trace testing
agentcheck replay baseline_trace.json --output ci_trace.json
agentcheck assert ci_trace.json --contains "expected behavior" --max-cost 0.10

# πŸ†• Deterministic behavioral testing
python -c "
import agentcheck
from my_agent import my_agent

replayer = my_agent._deterministic_replayer
test_inputs = ['test1', 'test2', 'test3']

failures = replayer.test_consistency(
    agent_func=my_agent,
    test_inputs=test_inputs,
    baseline_name='production'
)

if failures:
    print(f'❌ {len(failures)} behavioral consistency tests failed')
    exit(1)
else:
    print('βœ… All behavioral tests passed')
    exit(0)
"

# Exit codes
# 0 = success
# 1 = assertion failed or error

πŸ› οΈ Development

# Install in development mode
git clone https://github.com/agentcheck/agentcheck
cd agentcheck
pip install -e ".[dev]"

# Run tests
pytest

# Format code  
ruff format .

# Type check
mypy agentcheck/

πŸ—ΊοΈ Roadmap

πŸ”§ Core Framework Improvements

Enhanced Tracing & Observability

  • Multi-Agent Tracing: Support for complex agent orchestrations and conversations
  • Real-time Streaming: Live trace streaming for long-running agents
  • Custom Metrics: User-defined KPIs and business metrics tracking
  • Performance Profiling: Detailed timing analysis and bottleneck detection
  • Memory Usage Tracking: Monitor agent memory consumption and optimization

Advanced Testing Capabilities

  • Property-Based Testing: Generate test cases automatically based on agent specifications
  • Mutation Testing: Automatically modify prompts/code to test robustness
  • Load Testing: Concurrent agent execution testing with performance metrics
  • A/B Testing Framework: Built-in support for comparing agent variants
  • Regression Test Suite: Automated detection of performance and quality regressions

Enterprise Integration

  • CI/CD Plugins: Native GitHub Actions, GitLab CI, Jenkins integrations
  • Database Backends: PostgreSQL, MongoDB support for large-scale trace storage
  • SSO & RBAC: Enterprise authentication and role-based access control
  • Audit Logging: Comprehensive audit trails for compliance requirements
  • API Gateway: REST/GraphQL APIs for enterprise system integration

🎯 Specialized Features

Multi-Modal Agent Support

  • Vision Agent Testing: Image/video input tracing and consistency testing
  • Audio Agent Testing: Speech-to-text and text-to-speech agent validation
  • Document Processing: PDF, Word, Excel agent testing capabilities
  • Code Generation: Specialized testing for code-generating agents

Advanced Analytics & Insights

  • Predictive Analytics: ML-powered prediction of agent behavior changes
  • Anomaly Detection: Automatic detection of unusual agent behaviors
  • Cost Optimization: AI-powered recommendations for cost reduction
  • Quality Scoring: Automated quality assessment and improvement suggestions
  • Behavioral Clustering: Group similar agent behaviors for pattern analysis

Developer Experience

  • IDE Extensions: VS Code, PyCharm plugins for inline testing
  • Interactive Debugging: Step-through debugging for agent executions
  • Visual Flow Builder: Drag-and-drop agent testing pipeline creation
  • Template Library: Pre-built testing templates for common agent patterns
  • Auto-Documentation: Generate testing documentation from traces

πŸš€ Platform & Infrastructure

Cloud & Deployment

  • AgentCheck Cloud: Hosted platform for teams and enterprises
  • Kubernetes Operator: Native Kubernetes deployment and scaling
  • Docker Compose: One-click local development environment
  • Serverless Support: AWS Lambda, Azure Functions, Google Cloud Functions
  • Edge Computing: Testing for edge-deployed agents

Ecosystem Integration

  • LangChain Integration: Native support for LangChain agents and chains
  • AutoGen Integration: Multi-agent conversation testing
  • CrewAI Integration: Specialized crew-based agent testing
  • Custom Framework Support: Plugin system for any agent framework

🏒 Enterprise Testing Standards

How AgentCheck Achieves Enterprise-Grade Testing

1. Compliance & Governance

# Regulatory compliance testing
@agentcheck.compliance_test(
    standards=["SOX", "GDPR", "HIPAA"],
    audit_trail=True,
    data_retention_days=2555  # 7 years
)
def financial_advisor_agent(query: str) -> str:
    # Agent implementation
    pass

# Test for compliance violations
failures = agentcheck.test_compliance(
    agent_func=financial_advisor_agent,
    test_cases=load_compliance_test_cases(),
    regulations=["financial_advice_disclosure", "data_privacy"]
)

2. Quality Assurance Framework

# Multi-dimensional quality testing
quality_metrics = agentcheck.QualityFramework([
    agentcheck.AccuracyMetric(threshold=0.95),
    agentcheck.SafetyMetric(harmful_content_threshold=0.0),
    agentcheck.BiasMetric(demographic_fairness=True),
    agentcheck.LatencyMetric(max_response_time_ms=2000),
    agentcheck.CostMetric(max_cost_per_request=0.10),
    agentcheck.ConsistencyMetric(behavioral_threshold=0.85)
])

# Enterprise-grade testing pipeline
test_results = quality_metrics.evaluate(
    agent_func=my_agent,
    test_dataset=enterprise_test_dataset,
    environments=["staging", "production"]
)

3. Security & Safety Testing

# Comprehensive security testing
security_tests = agentcheck.SecurityTestSuite([
    agentcheck.PromptInjectionTest(),
    agentcheck.DataLeakageTest(),
    agentcheck.AdversarialInputTest(),
    agentcheck.AuthorizationTest(),
    agentcheck.PIIDetectionTest()
])

# Red team testing
red_team_results = security_tests.run_red_team_scenarios(
    agent_func=my_agent,
    attack_vectors=["jailbreaking", "data_extraction", "privilege_escalation"]
)

4. Performance & Scalability Testing

# Load testing with realistic scenarios
load_test = agentcheck.LoadTest(
    concurrent_users=1000,
    ramp_up_time=300,  # 5 minutes
    test_duration=3600,  # 1 hour
    realistic_user_behavior=True
)

performance_results = load_test.run(
    agent_func=my_agent,
    user_scenarios=enterprise_user_scenarios
)

# SLA validation
sla_compliance = agentcheck.validate_sla(
    results=performance_results,
    requirements={
        "p95_latency_ms": 1500,
        "availability_percent": 99.9,
        "error_rate_percent": 0.1,
        "throughput_rps": 100
    }
)

5. Continuous Monitoring & Alerting

# Production monitoring
monitor = agentcheck.ProductionMonitor(
    alert_channels=["slack", "email", "pagerduty"],
    thresholds={
        "error_rate": 0.01,  # 1% error rate
        "latency_p99": 3000,  # 3 second P99 latency
        "cost_per_hour": 50.0,  # $50/hour cost limit
        "behavioral_drift": 0.2  # 20% behavior change
    }
)

# Real-time alerts
monitor.start_monitoring(
    agent_func=my_agent,
    baseline_name="production_v1.0"
)

Enterprise Implementation Checklist

πŸ“‹ Testing Standards

  • Behavioral Consistency: β‰₯85% consistency across test runs
  • Performance SLAs: P95 latency <2s, 99.9% availability
  • Cost Controls: Automated cost monitoring and alerts
  • Security Validation: Regular red team testing and vulnerability scans
  • Compliance Testing: Automated regulatory compliance validation
  • Quality Gates: Multi-stage testing pipeline with approval gates

πŸ“Š Monitoring & Observability

  • Real-time Dashboards: Executive and operational dashboards
  • Automated Alerting: PagerDuty/Slack integration for critical issues
  • Audit Trails: Complete audit logs for all agent interactions
  • Performance Baselines: Established performance benchmarks
  • Business Metrics: Custom KPIs aligned with business objectives

πŸ”’ Security & Governance

  • Access Controls: Role-based access to testing and monitoring
  • Data Protection: Encryption at rest and in transit
  • Incident Response: Automated incident detection and response
  • Change Management: Controlled deployment with rollback capabilities
  • Documentation: Comprehensive testing and operational documentation

ROI Metrics for Enterprise Adoption

Risk Reduction:

  • 90% reduction in production agent failures
  • 75% faster incident detection and resolution
  • 60% reduction in compliance violations

Cost Optimization:

  • 40% reduction in LLM API costs through optimization
  • 50% reduction in manual testing effort
  • 30% faster time-to-market for new agent features

Quality Improvement:

  • 95% improvement in agent response consistency
  • 80% reduction in customer complaints
  • 99.9% uptime achievement for critical agent services

πŸ“„ License

MIT License - see LICENSE file.

🀝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.


Built for the era of AI agents πŸ€–βœ¨