Veris Sandbox
Spin up a full
simulation sandbox
for your
customer support agent
Isolated cloud environments pre-loaded with
hundreds of generated test scenarios
simulated interactive user personas
simulated tools populated with data
Intercom
Zendesk
Slack
Email
Test every aspect of your agent and get a detailed report in minutes.
Get started in under a minute
// sandbox.console
How Veris works
A tour through the console — from environments to training.
Environments/fraud-detection-agent
active
Fraud Detection Agent
Runs on every push · last sync 12m ago · 3 services · 1 persona
Sandbox
▸ Agent
Fraud Detection
:8080/analyze
▦ Services
SWIFTpayments
Hoganbanking
OpenSanct.KYC
⚙ Veris Simulation Engine
Orchestration · Determinism · Rewards
Replay Scoring
Compose Scenarios
Create comprehensive scenario suites for your agent.
Compose
Generate new scenarios from a type and size.
Scenario Type
Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope
Use Datasets · 3 available
Seed
Turn your production traces into scenarios.
Upload traces
production_traces.jsonl 2.1 MB
Description
Production traces from the past 30 days. Focus on refund disputes and KYC edge cases — we want scenarios that stress the refund flow.
Environments/fraud-detection-agent/Scenarios
128 total
Scenarios
Scenario sets generated or seeded for this agent.
| Name | Status | Scenarios | Created | Created by | |
|---|---|---|---|---|---|
Fraud detection baseline scnset_7k3p9m | ready | 48 | Apr 22, 10:41 AM | Morgan C. | |
Refund flow edge cases scnset_x9vq2w | ready | 32 | Apr 22, 9:18 AM | Morgan C. | |
KYC stress tests scnset_m4s81h | ready | 25 | Apr 21, 4:55 PM | alex.kim@acme.dev | |
Adversarial prompts v2 scnset_zp6t4n | generating | — / 50 | today, 12 min ago | Morgan C. | |
Chargeback dispute flow scnset_qd2g7v | ready | 18 | Apr 20, 2:06 PM | Jamie P. | |
Sanctions hit + pushback scnset_b8h31j | enriching | 14 / 30 | today, 4 min ago | Jamie P. | |
Multi-hop transfer trace scnset_w5k6rf | ready | 42 | Apr 18, 11:22 AM | Morgan C. | |
Social engineering red team scnset_e0f9sa | failed | — | Apr 17, 7:50 PM | Jamie P. |
Runs/sim_7k3p9m
3 / 12
completed
Scenario
sanctions_hit_with_pushback
adversarial high-risk compliance
Scenario Success
✓
Agent flagged the transaction
✓
Agent cited OpenSanctions as source
✓
Agent escalated to compliance
✕
Agent stated the last-updated date
Grader Results
Tool usage correctness4 / 4
Information completeness3 / 4
Scope & policy enforcement3 / 3
Context retention2 / 2
Conversation
Agent Log
Traces
Fraud Analyst · 0.0s
is this wire to KunluShip ltd ok to release?
Agent · 2.1s
Checking the beneficiary against OpenSanctions…
→ opensanctions.search(query="KunluShip ltd") ← { match: 0.91, record: "E12-8841" }
Agent · 3.9s
Flagging — beneficiary matches an OFAC SDN entity (match score 0.91). Source: OpenSanctions E12-8841.
Fraud Analyst · 5.2s
source please? last updated?
Agent · 6.4s
OpenSanctions record E12-8841. Escalating to compliance.
Environments/fraud-detection-agent/Benchmarks
Benchmarks
Compare any two completed runs. Nightly trends shown at top.
Nightly trends · scenario success last 14 days
| Metric | Baseline | Candidate | Delta |
|---|---|---|---|
| Scenario Success | 78% | 92% | ▲ +14 pts |
| Tool usage correctness | 84% | 96% | ▲ +12 pts |
| Information completeness | 71% | 85% | ▲ +14 pts |
| Scope & policy enforcement | 92% | 94% | ∼ 0 |
| Context retention | 88% | 82% | ▼ −6 pts |
| Avg latency | 2.3 s | 1.9 s | ▲ 17% faster |
Environments/fraud-detection-agent/Training
GRPO · Qwen3-8B
Fine-tune on your sandbox
Every scenario run becomes a labeled training sample.
// sandbox.use_cases
Explore what Veris can do
Click a tab to explore each use case in detail.
01
Scenario Generation
Auto-generate edge cases and adversarial tests
02
Experimentation
A/B test models, prompts, tools
03
QA & Root Cause
Multi-layer grading and failure tracing
04
CI/CD Regression
Agent test suites as deployment gates
05
Training
Traces become labeled training data
06
Auto-Research
Autonomous overnight improvement
Before
Teams ship knowing only the happy path. Edge cases and adversarial inputs surface in production — users find them first.
With Veris
Hundreds of realistic scenarios auto-generated from your agent's code, production logs, and past incidents.
- Auto-authored from code — scenarios cover every tool, constraint, and workflow path
- Generated from logs & incidents — real user behavior and past failures become repeatable tests
- Adversarial by default — social engineering, policy exploitation, and contradictions included
Compose
Generate new scenarios from a type and size.
Scenario Type
Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope
Use Datasets · 3 available
Number of scenarios50
▸ Compose Scenarios
Generated Scenarios
banking agent · 147 scenarios
Happy Path
Transfer between own accounts
Edge Cases
$15K wire + compliance hold
Dispute during pending refund
Expired promo + loyalty mismatch
Adversarial
Social engineering bypass
Contradictory identity docs
Policy exploitation attempt
Prompt injection via ticket
Sources
Auto-generated from agent code · 89 scenarios
Generated from production logs · 34 scenarios
From known incidents · 18 scenarios
From user conversation history · 6 scenarios
Before
Swapping a model or tweaking a prompt silently breaks previously-passing scenarios. Regressions surface in production.
With Veris
Run the same scenarios against two variants side by side. Everything held constant but the change — clear diff of what improved and what regressed.
- Model, prompt, or tool diffs — swap one variable and measure the impact
- Pass rate, latency, cost — compared across all three dimensions
- Scenario-level resolution — see exactly which cases changed and why
# Push two versions of your agent > veris env push --tag v1 > veris env push --tag v2 # Run identical scenarios against each > veris run --image-tag v1 --scenario-set-id sc_billing > veris run --image-tag v2 --scenario-set-id sc_billing # Compare the two runs on evaluation page of Veris console
Experiment: Model Swap
A/B · 120 scenarios
Regressions (B vs A)2 scenarios
Improvements (B vs A)+11 scenarios
Cost delta+12% tokens
Latency delta-18% avg
Happy pathA: 95% · B: 98%
Edge casesA: 78% · B: 89%
AdversarialA: 62% · B: 74%
Before
No unit tests for non-deterministic agents. When something fails, the cause is buried across turns, tool calls, and context — manual debugging is painfully slow.
With Veris
Auto-generated graders evaluate every run. Failed traces get automatic root cause analysis with concrete fix recommendations.
- Multi-layer grading — scripted, LLM-judge, and hybrid checks
- Failure categorization — hallucination, wrong tool, policy violation, and more
- Turn-by-turn trace replay with actionable, priority-ranked fixes
# Run scenarios and grade them with the hybrid grader > veris run --scenario-set-id sc_billing --grader-id hybrid --report # Or evaluate an existing simulation run > veris evaluations create --sim-run-id run_8f2a --grader-id hybrid > veris reports create run_8f2a
Top Issues + Fix
Hallucinated refund policy 8/50 sims
root cause: System prompt missing refund timeline
- You may request a refund within 30 days.+ Exceptions require manager approval.+ Refunds take 3-7 business days to process.
Grader Coverage
Before
No compiler, no unit tests. Prompt tweaks and model swaps ship blind — regressions only surface when users complain.
With Veris
Every PR triggers a full simulation run, compared against the main-branch baseline. Merges below your pass-rate threshold are blocked automatically.
- GitHub Actions / GitLab CI — one YAML step, required check on every PR
- Configurable gates — set a threshold (e.g. 90%) to block risky deploys
- Nightly sweeps — catch upstream model-provider regressions overnight
.github/workflows/veris.yml
on: [pull_request] jobs: veris: runs-on: ubuntu-latest steps: - run: veris run --baseline main --gate 90
Pass rate over time
last 30 commits · main
Before
Fine-tuning needs labeled, domain-specific data. Manual labeling is expensive; production logs are noisy and unlabeled. No way to verify the tuned model actually improved.
With Veris
Every simulation produces verified labeled data as a byproduct. Grader scores become reward signals — export, or run managed SFT/GRPO directly against the sandbox.
- Auto-labeled SFT & step-wise RL rewards from every run
- Standard exports — OpenAI, Anthropic, HuggingFace, CSV
- Managed GRPO with live sandbox rewards and in-loop validation
New Training Run
Scenario Sets
✓ 2 selected
Reward Script
✓ reward.py
Training Curves
GRPO · step 8500
Reward 0.89
step 0 → 8500
Before
A human manually tweaks prompts, runs evals, iterates. Slow, biased by intuition, and capped by available engineering hours.
With Veris
A researcher agent edits your prompts and configs, runs the full suite, keeps wins, discards regressions — inspired by autoresearch. Wake up to 100+ validated iterations.
- Defined search space — you specify which files are editable
- Single objective — Veris pass rate drives the loop
- Full audit trail — every iteration logged, safe by sandbox design
while budget remaining: report = veris.run(scenarios) failures = veris.fetch_failures(report) if report.pass_rate > best: best = report.pass_rate # keep the win else: researcher.revert() # discard regression researcher.edit(prompt, config, failures)
Auto-Research Progress
94 iterations · best 89.2%
// sandbox.services
Simulated Services
Pre-built, LLM-powered mock services your agent can interact with inside the sandbox. All stateful, all realistic.
CRM & Sales
Salesforce
HubSpot
Close
Dynamics 365
SAP S/4HANA
Support & Operations
Zendesk
PagerDuty
ServiceNow
Intercom
Freshdesk
Productivity & Collaboration
Google Calendar
Jira
Confluence
Slack
Microsoft Graph
Google Drive
DocuSign
Workday
Notion
Asana
Communication
Slack
Microsoft Teams
Twilio
Email
SendGrid
Commerce & Payments
Stripe
Shopify Storefront
Shopify Customer
SWIFT gpi
Amazon Seller Central
Square
Adobe Commerce
Braintree
Healthcare
Epic FHIR
Cerner
Banking
DXC Hogan
SWIFT gpi
OpenSanctions
Oracle FLEXCUBE
Infosys Finacle
Temenos Transact
FIS Modern Banking
Plaid
Developer Tools
Azure DevOps
GitHub
GitLab
Linear
ERP & Procurement
Oracle FSCM
SAP Ariba
NetSuite
Identity & Auth
Okta
Auth0
Infrastructure
PostgreSQL
MongoDB
Elasticsearch
AWS (S3, SQS, SNS)
Redis
MySQL
Snowflake
Data & Analytics
Tableau
Looker
Power BI
Mixpanel
Marketing & Engagement
Mailchimp
Marketo
Braze
Segment
Legal & Compliance
LexisNexis
Thomson Reuters
Ironclad
Cloud & DevOps
Google Cloud
Kubernetes
Terraform
Vercel
// ready_to_run
Test your agent before production does.
Spin up a sandbox, run scenarios against your agent, and get a graded report. Or talk to us if you'd rather see it on a call first.