Settings

Theme

Show HN: Nyx – multi-turn, adaptive, offensive testing harness for AI agents

fabraix.com

20 points by zachdotai a month ago · 12 comments · 1 min read

Reader

We built Nyx to solve a problem we kept hitting while building agents: AI agents break in ways traditional software doesn't. Logic bugs, reasoning failures, edge cases that manual testing and static benchmarks never explore.

Nyx is an autonomous testing harness that probes your AI agents to find failure modes before users do. It’s used to find logic bugs, instruction following failures, edge cases in agent behavior, and for red-team security testing (jailbreaks, prompt injection, tool hijacking)

Technical approach: * Pure blackbox (no special access needed - test like your users interact) * Multi-turn adaptive conversations * Multi-modal testing (voice, text, images, documents, browser interactions) * Massively parallel by default

Instead of spending time writing static evals for the key failure modes of your AI agents, point Nyx at any system and it autonomously discovers failure modes that matter. We typically find issues in under 10 minutes that manual audits take hours to surface.

This is early work and we know the methodology is still going to evolve. We would love nothing more than feedback from the community as we iterate on this.

ibrahim-fab a month ago

Nice. Definitely true that evaluating agents behavior is by far the toughest part of building them. Also most eval cases are added without thought and not maintained when agent behaviour updates. Interesting approach.

  • zachdotaiOP a month ago

    We wrote some thoughts on static vs. dynamic evals and how it relates to understanding the security posture of an AI system. Static security evals no longer carry the signal they used to. A one-shot pass/fail tells you almost nothing about real-world risk.

    Would love your thoughts on this: https://fabraix.com/blog/adversarial-cost-to-exploit

azhassan1 a month ago

Where do you draw the line between this and coverage-guided fuzzing? A lot of what you describe (parallel, adaptive, finds edge cases in unbounded input spaces) maps cleanly onto the fuzzing playbook, which has decades of theory behind it - corpus management, mutation scheduling, minimization of found crashes.

Are you borrowing from that literature or treating agent testing as a distinct problem? Feels like there's real transfer available if you're not already pulling from it.

aacudad a month ago

I am not sure this will work, seems like added complexity to something simple

ljhasdr a month ago

i need to try this before mythos comes to attack our service. thanks!

AmineAfia a month ago

Can I integrate this in my CI/CD pipeline?

adam_rida a month ago

Very cool!

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection