Veris Sandbox — Full Simulation Environments for AI Agents

8 min read Original article ↗

Veris Sandbox

Spin up a full
simulation sandbox
for your customer support agent

Isolated cloud environments pre-loaded with

hundreds of generated test scenarios

simulated interactive user personas

simulated tools populated with data

Intercom Zendesk Slack Email

Test every aspect of your agent and get a detailed report in minutes.

Get started in under a minute

// sandbox.console

How Veris works

A tour through the console — from environments to training.

Environments/fraud-detection-agent

active

Fraud Detection Agent

Runs on every push · last sync 12m ago · 3 services · 1 persona

Sandbox

▸ Agent

Fraud Detection

:8080/analyze

▦ Services

SWIFTpayments

Hoganbanking

OpenSanct.KYC

⚙ Veris Simulation Engine

Orchestration · Determinism · Rewards

Replay Scoring

Compose Scenarios

Create comprehensive scenario suites for your agent.

Compose

Generate new scenarios from a type and size.

Scenario Type

Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope

Use Datasets · 3 available

Seed

Turn your production traces into scenarios.

Upload traces

production_traces.jsonl 2.1 MB

Description

Production traces from the past 30 days. Focus on refund disputes and KYC edge cases — we want scenarios that stress the refund flow.

Environments/fraud-detection-agent/Scenarios

128 total

Scenarios

Scenario sets generated or seeded for this agent.

NameStatusScenariosCreatedCreated by

Fraud detection baseline

scnset_7k3p9m

ready48Apr 22, 10:41 AMMorgan C.

Refund flow edge cases

scnset_x9vq2w

ready32Apr 22, 9:18 AMMorgan C.

KYC stress tests

scnset_m4s81h

ready25Apr 21, 4:55 PMalex.kim@acme.dev

Adversarial prompts v2

scnset_zp6t4n

generating— / 50today, 12 min agoMorgan C.

Chargeback dispute flow

scnset_qd2g7v

ready18Apr 20, 2:06 PMJamie P.

Sanctions hit + pushback

scnset_b8h31j

enriching14 / 30today, 4 min agoJamie P.

Multi-hop transfer trace

scnset_w5k6rf

ready42Apr 18, 11:22 AMMorgan C.

Social engineering red team

scnset_e0f9sa

failedApr 17, 7:50 PMJamie P.

Runs/sim_7k3p9m

3 / 12

completed

Scenario

sanctions_hit_with_pushback

adversarial high-risk compliance

Scenario Success

Agent flagged the transaction

Agent cited OpenSanctions as source

Agent escalated to compliance

Agent stated the last-updated date

Grader Results

Tool usage correctness4 / 4

Information completeness3 / 4

Scope & policy enforcement3 / 3

Context retention2 / 2

Conversation

Agent Log

Traces

Fraud Analyst · 0.0s

is this wire to KunluShip ltd ok to release?

Agent · 2.1s

Checking the beneficiary against OpenSanctions…

→ opensanctions.search(query="KunluShip ltd") ← { match: 0.91, record: "E12-8841" }

Agent · 3.9s

Flagging — beneficiary matches an OFAC SDN entity (match score 0.91). Source: OpenSanctions E12-8841.

Fraud Analyst · 5.2s

source please? last updated?

Agent · 6.4s

OpenSanctions record E12-8841. Escalating to compliance.

Environments/fraud-detection-agent/Benchmarks

Benchmarks

Compare any two completed runs. Nightly trends shown at top.

Nightly trends · scenario success last 14 days

MetricBaselineCandidateDelta
Scenario Success78%92%▲ +14 pts
Tool usage correctness84%96%▲ +12 pts
Information completeness71%85%▲ +14 pts
Scope & policy enforcement92%94%∼ 0
Context retention88%82%▼ −6 pts
Avg latency2.3 s1.9 s▲ 17% faster

Environments/fraud-detection-agent/Training

GRPO · Qwen3-8B

Fine-tune on your sandbox

Every scenario run becomes a labeled training sample.

// sandbox.use_cases

Explore what Veris can do

Click a tab to explore each use case in detail.

01

Scenario Generation

Auto-generate edge cases and adversarial tests

02

Experimentation

A/B test models, prompts, tools

03

QA & Root Cause

Multi-layer grading and failure tracing

04

CI/CD Regression

Agent test suites as deployment gates

05

Training

Traces become labeled training data

06

Auto-Research

Autonomous overnight improvement

Before

Teams ship knowing only the happy path. Edge cases and adversarial inputs surface in production — users find them first.

With Veris

Hundreds of realistic scenarios auto-generated from your agent's code, production logs, and past incidents.

  • Auto-authored from code — scenarios cover every tool, constraint, and workflow path
  • Generated from logs & incidents — real user behavior and past failures become repeatable tests
  • Adversarial by default — social engineering, policy exploitation, and contradictions included

Compose

Generate new scenarios from a type and size.

Scenario Type

Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope

Use Datasets · 3 available

Number of scenarios50

▸ Compose Scenarios

Generated Scenarios

banking agent · 147 scenarios

Happy Path

Transfer between own accounts

Edge Cases

$15K wire + compliance hold

Dispute during pending refund

Expired promo + loyalty mismatch

Adversarial

Social engineering bypass

Contradictory identity docs

Policy exploitation attempt

Prompt injection via ticket

Sources

Auto-generated from agent code · 89 scenarios

Generated from production logs · 34 scenarios

From known incidents · 18 scenarios

From user conversation history · 6 scenarios

Before

Swapping a model or tweaking a prompt silently breaks previously-passing scenarios. Regressions surface in production.

With Veris

Run the same scenarios against two variants side by side. Everything held constant but the change — clear diff of what improved and what regressed.

  • Model, prompt, or tool diffs — swap one variable and measure the impact
  • Pass rate, latency, cost — compared across all three dimensions
  • Scenario-level resolution — see exactly which cases changed and why

# Push two versions of your agent > veris env push --tag v1 > veris env push --tag v2 # Run identical scenarios against each > veris run --image-tag v1 --scenario-set-id sc_billing > veris run --image-tag v2 --scenario-set-id sc_billing # Compare the two runs on evaluation page of Veris console

Experiment: Model Swap

A/B · 120 scenarios

Regressions (B vs A)2 scenarios

Improvements (B vs A)+11 scenarios

Cost delta+12% tokens

Latency delta-18% avg

Happy pathA: 95% · B: 98%

Edge casesA: 78% · B: 89%

AdversarialA: 62% · B: 74%

Before

No unit tests for non-deterministic agents. When something fails, the cause is buried across turns, tool calls, and context — manual debugging is painfully slow.

With Veris

Auto-generated graders evaluate every run. Failed traces get automatic root cause analysis with concrete fix recommendations.

  • Multi-layer grading — scripted, LLM-judge, and hybrid checks
  • Failure categorization — hallucination, wrong tool, policy violation, and more
  • Turn-by-turn trace replay with actionable, priority-ranked fixes

# Run scenarios and grade them with the hybrid grader > veris run --scenario-set-id sc_billing --grader-id hybrid --report # Or evaluate an existing simulation run > veris evaluations create --sim-run-id run_8f2a --grader-id hybrid > veris reports create run_8f2a

Top Issues + Fix

Hallucinated refund policy 8/50 sims

root cause: System prompt missing refund timeline

- You may request a refund within 30 days.+ Exceptions require manager approval.+ Refunds take 3-7 business days to process.

Grader Coverage

Tool usage Info completeness Policy enforcement Context retention Hallucination Latency 0.92 0.78 0.85 0.88 0.74 0.82

Before

No compiler, no unit tests. Prompt tweaks and model swaps ship blind — regressions only surface when users complain.

With Veris

Every PR triggers a full simulation run, compared against the main-branch baseline. Merges below your pass-rate threshold are blocked automatically.

  • GitHub Actions / GitLab CI — one YAML step, required check on every PR
  • Configurable gates — set a threshold (e.g. 90%) to block risky deploys
  • Nightly sweeps — catch upstream model-provider regressions overnight

.github/workflows/veris.yml

on: [pull_request] jobs: veris: runs-on: ubuntu-latest steps: - run: veris run --baseline main --gate 90

Pass rate over time

last 30 commits · main

100% 90% 80% 70% gate 90% a013ade · blocked 30d ago 3w 2w 1w today

Before

Fine-tuning needs labeled, domain-specific data. Manual labeling is expensive; production logs are noisy and unlabeled. No way to verify the tuned model actually improved.

With Veris

Every simulation produces verified labeled data as a byproduct. Grader scores become reward signals — export, or run managed SFT/GRPO directly against the sandbox.

  • Auto-labeled SFT & step-wise RL rewards from every run
  • Standard exports — OpenAI, Anthropic, HuggingFace, CSV
  • Managed GRPO with live sandbox rewards and in-loop validation

New Training Run

Scenario Sets

✓ 2 selected

Reward Script

✓ reward.py

Training Curves

GRPO · step 8500

Reward 0.89

step 0 → 8500

Before

A human manually tweaks prompts, runs evals, iterates. Slow, biased by intuition, and capped by available engineering hours.

With Veris

A researcher agent edits your prompts and configs, runs the full suite, keeps wins, discards regressions — inspired by autoresearch. Wake up to 100+ validated iterations.

  • Defined search space — you specify which files are editable
  • Single objective — Veris pass rate drives the loop
  • Full audit trail — every iteration logged, safe by sandbox design

while budget remaining: report = veris.run(scenarios) failures = veris.fetch_failures(report) if report.pass_rate > best: best = report.pass_rate # keep the win else: researcher.revert() # discard regression researcher.edit(prompt, config, failures)

Auto-Research Progress

94 iterations · best 89.2%

90% 80% 70% 60% 50% 0 25 50 75 100 best so far improvement regression (discarded)

// sandbox.services

Simulated Services

Pre-built, LLM-powered mock services your agent can interact with inside the sandbox. All stateful, all realistic.

CRM & Sales

Salesforce

HubSpot

Close

Dynamics 365

SAP S/4HANA

Support & Operations

Zendesk

PagerDuty

ServiceNow

Intercom

Freshdesk

Productivity & Collaboration

Google Calendar

Jira

Confluence

Slack

Microsoft Graph

Google Drive

DocuSign

Workday

Notion

Asana

Communication

Slack

Microsoft Teams

Twilio

Email

SendGrid

Commerce & Payments

Stripe

Shopify Storefront

Shopify Customer

SWIFT gpi

Amazon Seller Central

Square

Adobe Commerce

Braintree

Healthcare

Epic FHIR

Cerner

Banking

DXC Hogan

SWIFT gpi

OpenSanctions

Oracle FLEXCUBE

Infosys Finacle

Temenos Transact

FIS Modern Banking

Plaid

Developer Tools

Azure DevOps

GitHub

GitLab

Linear

ERP & Procurement

Oracle FSCM

SAP Ariba

NetSuite

Identity & Auth

Okta

Auth0

Infrastructure

PostgreSQL

MongoDB

Elasticsearch

AWS (S3, SQS, SNS)

Redis

MySQL

Snowflake

Data & Analytics

Tableau

Looker

Power BI

Mixpanel

Marketing & Engagement

Mailchimp

Marketo

Braze

Segment

Legal & Compliance

LexisNexis

Thomson Reuters

Ironclad

Cloud & DevOps

Google Cloud

Kubernetes

Terraform

Vercel

// ready_to_run

Test your agent before production does.

Spin up a sandbox, run scenarios against your agent, and get a graded report. Or talk to us if you'd rather see it on a call first.