ToolMisuseBench
ToolMisuseBench is a deterministic, offline benchmark dataset for evaluating tool-using agents under realistic failure conditions, including schema misuse, execution failures, interface drift, and recovery under budget constraints.
This dataset is intended for reproducible evaluation of agent tool-use behavior, not for training a general-purpose language model.
Dataset Summary
ToolMisuseBench evaluates whether an agent can:
- make valid tool calls under schema constraints
- recover after failures (timeouts, rate limits, authz, drift, adversarial errors)
- satisfy task goals under bounded tool-call/step/retry budgets
- minimize policy violations and invalid tool invocations
All tasks are synthetic and generated with deterministic seeds to ensure reproducibility.
Repository and Evaluator
- Project repository (code + evaluator + baselines): https://github.com/akgitrepos/toolmisusebench
- Recommended evaluation flow uses the project CLI and harness.
Supported Evaluation Use Cases
- baseline benchmarking for tool-using agents
- robustness testing under controlled tool failures
- recovery-quality analysis after failure injection
- budgeted success tradeoff analysis (success vs tool-call cap)
Data Structure
Dataset layout:
train/tasks.jsonldev/tasks.jsonltest_public/tasks.jsonlmanifest.jsonv0_1_freeze.json
Each row in tasks.jsonl is a single benchmark task containing:
task_idsplit(train | dev | test_public)difficulty(easy | medium | hard)domain(crud | retrieval | files | scheduling | mixed)instructiontoolset_idtool_schemasinitial_statesuccess_criteriabudget(max_steps,max_tool_calls,max_retries,timeout_ms)fault_plangold_summary(optional)seed
Dataset Size (v0.1 Release)
- Train: 5000
- Dev: 800
- Test Public: 1000
- Total: 6800
Domains
- CRUD
- Retrieval
- Files
- Scheduling
Fault Model
Supported fault types:
schema_driftrate_limittimeoutauthzadversarial_error
Faults are declaratively specified per task and replayed deterministically.
Viewer Note on Null Values
In the Hugging Face table viewer, nested fields inside fault_plan.trigger and fault_plan.payload
may appear as null for some rows.
This is expected: different fault types use different subsets of fields, and the viewer displays a
unified schema across all rows. A null value in this context typically means "not applicable for
this fault type," not missing or corrupted data.
Data Generation
Generated synthetically using deterministic templates, seeded randomization, and task-level coherence checks.
Generation reference command:
toolmisusebench generate \
--version v0.1 \
--out data/toolmisusebench_v0_1 \
--seed 42 \
--size-profile large
Coherence and quality audit reference command:
python -m generator.quality_report \
--dataset data/toolmisusebench_v0_1 \
--splits train,dev,test_public
Scoring and Evaluation
Use the official evaluator in the project repo.
Example:
toolmisusebench eval \
--dataset data/toolmisusebench_v0_1 \
--split test_public \
--agent heuristic \
--report out/report.json
For detailed metric definitions, see SCORING.md in this dataset repository.
Reproducibility Notes
- deterministic generation and replay under fixed seeds
- per-task fault plans are deterministic
- checksums included in
manifest.json - freeze metadata included in
v0_1_freeze.json
Limitations
- synthetic tasks do not capture all real-world API/tool semantics
- benchmark is focused on controlled robustness comparisons, not full production realism
Ethics and Privacy
- no personal data
- no proprietary user logs
- no sensitive external data sources used
License
Dataset: CC-BY-4.0 Code/evaluator: MIT (see project repository)
Citation
If you use ToolMisuseBench, please cite the project.
@misc{toolmisusebench2026,
title={ToolMisuseBench: A Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems},
author={ToolMisuseBench Authors},
year={2026},
howpublished={\url{https://github.com/akgitrepos/toolmisusebench}}
}
- Downloads last month
- 7