Show HN: VR.dev – Open-source verifiers for what AI agents did
vr.devHey HN,
Quick origin story: vr.dev started as a virtual reality project. The domain fit perfectly. The developer adoption did not. Rather than let a good domain go to waste, I pivoted to the other kind of VR: verification and rewards for AI agents.
The problem I kept running into: agents report success but system state tells a different story. The database row is still active. The IMAP sent folder is empty. The tests pass because the agent modified the tests. Real benchmarks put agent success at 12-30%, and even among reported successes a large fraction are procedurally wrong in ways that are hard to catch without actually checking state.
So I built a library of verifiers that check real system state rather than trusting agent self-reports. There are 38 of them across 19 domains right now, organized into three tiers: HARD (deterministic probes against databases, files, APIs, git), SOFT (LLM rubric scoring for things like tone or coherence that don't have a deterministic test), and AGENTIC (verifiers that actively probe the environment via headless browser, IMAP, or shell).
The design decision I'd most like feedback on is the composition model. SOFT scores are gated behind HARD checks, so if the deterministic check fails, the composed score is 0.0 regardless of what the LLM judge says. The idea is to make reward hacking structurally harder rather than just hoping the judge catches it.
MIT licensed, runs locally via pip install vrdev, no dependency on the hosted API which matters if you're using it in a training loop. Full verifier list at https://vr.dev/registry.
Curious whether the HARD/SOFT/AGENTIC taxonomy makes sense to people, whether fail_closed is the right default, and whether anyone has built something similar and run into problems I haven't hit yet.
https://vr.dev
https://github.com/vrDotDev/vr-dev
https://pypi.org/project/vrdev/ This is an interesting space. Right now we've gotten to a point where agents can do most tasks, but they will get lazy/skip steps if you're not precise in the requirements. We need ways to validate that expands beyond software tests. This is a good direction but a few thoughts:
1. From what I can tell the agent who does the task is running the validation. Keeping the validation agent as a separate context avoids the validator knowing what the software is supposed to do vs what it does
2. There's a lot of prior art around org structures to validate things that we've built out over the last ~100 years that we can apply in this space. E.g. look at the way that blind trials are run totally agree, and fwiw, nothing in this implementation requires that the agent verifies it themselves. the hope is something that ultimately exists as a verification mechanism on one side of an agent-to-agent interaction/delegation. because it is true that, even though we've got some adversarial aspects built into the verification, that's not truly blind from the actor (unless you explicitly design the use of these in that way, which is what I've considered as the better design)