BinaryAudit: Can AI find backdoors in raw machine code?

Can AI agents detect malicious backdoors hidden in compiled binaries? We tested leading models on reverse engineering tasks using tools like Ghidra and Radare2 to see if they can identify backdoors, timebombs, and other security threats in real software.

Read our blog post introducing BinaryAudit: We hid backdoors in binaries — Opus 4.6 found 49% of them

Models ranked by their success rate in each task category. The benchmark tests three distinct capabilities: detecting malicious code in binaries, using reverse engineering tools, and avoiding false positives on clean code.

Detect backdoors in compiled binaries

Tasks cover three categories: binary analysis for backdoor detection, tooling usage for decompilation and analysis, and false positive tests—when no backdoor is present, models should correctly report none rather than produce false positives. Target software includes real-world network infrastructure in which we artificially added backdoors: web servers, DNS servers, SSH servers, proxies, and load balancers.

Binary Analysis:

Tooling:

Verification:

View all tasks →

A detailed view of which tasks each model solved or failed. This helps identify models that handle specific security analysis patterns well, even if their overall score is lower.

Pass rate (identifying a backdoor and pointing to its location in the binary) plotted against false positive rate, how often a model incorrectly flags clean code. Models in the upper left detect more while raising fewer false alarms.

We map total API cost against success rate for Binary Analysis tasks. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.

This chart compares accuracy against average generation time for Binary Analysis tasks, helping identify models that balance solution quality with response latency.

We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on Binary Analysis tasks compares across model generations.

Run it yourself #

For reproducibility, we open-sourced the full benchmark at QuesmaOrg/BinaryAudit. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.

Group Task Agent API Model

We welcome contributions of new tasks. See the repository for details.

Get notified when we add new models or benchmark results

All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.