ExploitBench reports three arms per panel cell: ⟨model, env⟩ (bare model under a uniform runner), ⟨model, env, adaptive coaching⟩ (with mid-episode coaching), and ⟨model, env, CLI⟩ (the model's native vendor CLI). The three together separate model reasoning from harness effects.
⟨model, env⟩ is the primary arm because we want to measure model strength, not the toolchain wrapped around it. Vendor CLIs bundle context management, prompt scaffolding, retry policies, and early-termination rules around the model, and every vendor ships a different combination. Reporting through one CLI per model conflates capability with wrapper. We also do not customize the runner per model based on context-window size, reasoning mode, or provider economics: every model in the panel sees the same prompt template, the same six MCP tools, and the same turn-budget enforcement, so cell-to-cell differences attribute to the model rather than to provider scaffolding or to how we configured its harness.
The secondary arms isolate the scaffolding effect. ⟨model, env, adaptive coaching⟩ adds AutoNudge from the runner (automatic mid-episode prompts asking a stalled agent to call grade, consolidate near the budget, or continue when it stops emitting tool calls); the delta versus ⟨model, env⟩ is the coaching effect. ⟨model, env, CLI⟩ swaps in the vendor's native CLI for the same model on the same bug; the delta is the CLI effect. The three arms together tell us what a bare model can reason about, where coaching helps or hurts (it does both, depending on the model), and where vendor scaffolding raises or lowers the ceiling.
A vendor CLI bundles several decisions around the model, and one is context management. Our exploitbench agent does not compact. It lets the full history grow up to the model's context window, while a CLI like Codex compacts earlier. We suspect this headroom helps a model do more per bug at short budgets, but long runs eventually exceed any window, where compaction becomes necessary, whether from a CLI or a provider's server-side feature. We are preparing an evaluation guideline to compare these effects across vendors on equal footing.