Inside the Top 1%: Engineering Tenzai’s AI Hacker to Compete with Elite Humans

7 min read Original article ↗

An evaluation of Tenzai’s autonomous hacking agent across six major Capture-the-Flag competitions designed for human security researchers.

Over the past several weeks, we evaluated Tenzai’s autonomous hacking agent across a set of challenging Capture-the-Flag (CTF) competitions designed for human security researchers. Across six platforms, our AI hacker achieved scores placing it within the top 1% of participants, outperforming more than 125,000 human competitors.

Our goal in publishing these results is not “just” to report a milestone, but to push for clear evaluation standards for autonomous offensive security systems. Still, this is indeed a milestone, as it shows that systematic, AI-driven offensive security is not theoretical or buzz worthy - but works at scale.

The six platforms - websec.fr, dreamhack.io, websec.co.il, hack.arrrg.de, pwnable.tw, and Lakera’s Agent Breaker - are popular competitions, attracting some of the brightest hacker minds in the world.

Tenzai's AI hacker reached the top 1% in all of them (for an average cost of just $12.92!) and a runtime average of a hair under two hours. 

We’re confident that this is just the beginning, and this technology will get faster and cheaper.

The practical effect is that elite offensive security expertise is available on demand and at a much larger scale than previously possible. We hope this achievement stands as a milestone in demonstrating offensive cyber abilities of AI agents.

Why CTFs? 

To reach this level of performance, we followed a deliberately different validation approach. Rather than focusing on bug bounty programs or crawling open-source software in search of easily discoverable vulnerabilities, the evaluation prioritized environments that reward deeper offensive reasoning. Bug bounty programs and large-scale CVE discovery often incentivize finding many simple issues across a wide surface area. While valuable, this dynamic can favor breadth and automation over depth.

In Capture the Flag (CTFs) challenges, participants must discover and exploit vulnerabilities in unfamiliar systems without prior knowledge of the challenge implementation. CTFs are a well known mechanism and many are more difficult than the hardest certifications in the penetration testing industry.

Our reasoning for focusing on CTFs is two-fold:

  1. Working with a standard: To measure an agent, you need a known difficulty curve. Bug bounties and public applications are too inconsistent to serve as a rigorous evaluation standard. The majority of CTFs have normalized difficulty levels with clear categories and consistent execution environments. 
  2. No noise: Evaluations cannot be moving targets. Bug bounties and production software rapidly change and encourage shallow scalable findings rather than complex techniques and reasoning.

While enterprise applications are very different from capture the flag challenges, experienced penetration testers use CTFs to train and challenge themselves. In the CTFs we chose, many challenges required combining several weaknesses within the same system. In practice, this resembles the exploitation patterns seen in real attacks more closely than isolated vulnerability detection.

There are hundreds of CTFs out there and we wanted to pick those that are useful as evaluations. Therefore, we selected competitions with the following characteristics:

  • Large participant pools, often numbering in the tens of thousands
  • Clear difficulty bands, where higher rankings depend on solving the hardest problems
  • Competitions with gated writeups or unpublished solutions, reducing the likelihood that answers appear in model training data

The Challenges

The six platforms - dreamhack.io, pwnable.tw, Lakera’s Agent Breaker, websec.fr, websec.co.il, and hack.arrrg.de - are popular competitions attracting some of the brightest hacker minds in the world.

Popular competitions attracting some of the brightest hacker minds in the world - Tenzai's AI hacker ranks in top 1%

An example run of a successful flag achievement in Dreamhack

Dreamhack Challenge 1678 (Difficulty level 8 out of 10)- A challenge with 17 solvers and no public writeups. To solve this, the agent needed to chain three distinct attacks together, preventing the first from interfering with the second.

The agent starts by reading the source code and understanding the application “as normal.” As part of that, it finds the vulnerable endpoint - POST /api/http - and analyzes it.

These security concerns are then converted to an internal Lead that will be used to follow up and track this work

New Lead: SSRF via /api/http with partial IP checks bypass potential

The agent then creates itself a long TO-DO list with the top leads to further investigate in it, including the SSRF lead it identified through source code:

The agent turns to doing recon on the application and understanding how authentication and business logic work. While the agent traced the authentication flow, it realized that the class-transformer library used by the server to handle client requests is potentially vulnerable to a prototype pollution vulnerability. After trying out an attack and seeing a successful pollution, the agent can now escalate its privileges to Administrator in this application. The agent then realizes this is a useful part of a vulnerability chain with the potential SSRF lead, which is transformed to:

Update Lead: Redis arbitrary command execution via SSRF + Prototype Pollution chain

We end with a call to a sub-agent that will do the nitty gritty technical work of exploiting this chain.

The agent now iteratively exploits each step in the chain. This included successfully exploiting the prototype pollution to escalate privileges. Exploiting the SSRF, incrementally proving each part, including fetching a URL from g.co (to prove the SSRF itself), understanding subtle behavior of DNS resolution in NodeJS to bypass checks and finding a method to convert an IP address to a 16-character domain name. The agent independently tested a series of public services before settling on nip.io. Once the agent achieved communication with the Redis instance, it searched the web how to exploit an old Redis instance, found CVE-2022-0543 and read the flag.

The Lead is finalized into a Finding with the following description:

Agent Harness

These results are not solely a consequence of improvements in foundation models.

While modern language models have strong reasoning and coding abilities, they are not trained for uncertain processes such as vulnerability discovery without additional system support.

Exploitation often involves exploring multiple hypotheses, maintaining structured knowledge about the target system, and revisiting assumptions made when an approach fails.

To support this process, the Tenzai system uses an agent harness that orchestrates model reasoning and execution. The harness manages state, tracks discovered information, and coordinates exploration of multiple attack paths.

Implications

Despite these results, the system does not yet match the capabilities of the strongest human researchers. A small group of highly skilled hackers are still stronger than the agent in many competitions. Closing that gap remains an open challenge for our team.

We view these results as an indicator of capability, not a complete evaluation of autonomous offensive systems.

Why? Because CTFs ≠ full penetration testing; top 1% ≠ superhuman.

CTFs ≠ Full Penetration Testing: Capture-the-Flag environments are a proxy for skill but do not represent anything near the complexity of enterprise application security.

Top 1% ≠ Superhuman: Ranking in the top percentile is an achievement, but it does not equate to "superhuman" capability.

Production Complexity: Real-world complexity such as cloud and on prem networks and interactions between dozens of applications are not captured in controlled challenges.

Historically, working with most capable penetration testers has been limited to specific firms and contracts. Organizations would settle for generic services and often test only a small portion of their systems.

Autonomous systems capable of performing at the level of the top percentile of human competitors introduce a different model. Rather than replacing human expertise, they allow organizations to scale offensive security capability across many more systems and testing cycles.

This is the direction we are pursuing at Tenzai.