sigdelakshey/ToolMisuseBench · Datasets at Hugging Face

ToolMisuseBench

ToolMisuseBench is a deterministic, offline benchmark dataset for evaluating tool-using agents under realistic failure conditions, including schema misuse, execution failures, interface drift, and recovery under budget constraints.

This dataset is intended for reproducible evaluation of agent tool-use behavior, not for training a general-purpose language model.

Dataset Summary

ToolMisuseBench evaluates whether an agent can:

make valid tool calls under schema constraints
recover after failures (timeouts, rate limits, authz, drift, adversarial errors)
satisfy task goals under bounded tool-call/step/retry budgets
minimize policy violations and invalid tool invocations

All tasks are synthetic and generated with deterministic seeds to ensure reproducibility.

Repository and Evaluator

Project repository (code + evaluator + baselines): https://github.com/akgitrepos/toolmisusebench
Recommended evaluation flow uses the project CLI and harness.

Supported Evaluation Use Cases

baseline benchmarking for tool-using agents
robustness testing under controlled tool failures
recovery-quality analysis after failure injection
budgeted success tradeoff analysis (success vs tool-call cap)

Data Structure

Dataset layout:

train/tasks.jsonl
dev/tasks.jsonl
test_public/tasks.jsonl
manifest.json
v0_1_freeze.json

Each row in tasks.jsonl is a single benchmark task containing:

task_id
split (train | dev | test_public)
difficulty (easy | medium | hard)
domain (crud | retrieval | files | scheduling | mixed)
instruction
toolset_id
tool_schemas
initial_state
success_criteria
budget (max_steps, max_tool_calls, max_retries, timeout_ms)
fault_plan
gold_summary (optional)
seed

Dataset Size (v0.1 Release)

Train: 5000
Dev: 800
Test Public: 1000
Total: 6800

Domains

CRUD
Retrieval
Files
Scheduling

Fault Model

Supported fault types:

schema_drift
rate_limit
timeout
authz
adversarial_error

Faults are declaratively specified per task and replayed deterministically.

Viewer Note on Null Values

In the Hugging Face table viewer, nested fields inside fault_plan.trigger and fault_plan.payload may appear as null for some rows.

This is expected: different fault types use different subsets of fields, and the viewer displays a unified schema across all rows. A null value in this context typically means "not applicable for this fault type," not missing or corrupted data.

Data Generation

Generated synthetically using deterministic templates, seeded randomization, and task-level coherence checks.

Generation reference command:

toolmisusebench generate \
  --version v0.1 \
  --out data/toolmisusebench_v0_1 \
  --seed 42 \
  --size-profile large

Coherence and quality audit reference command:

python -m generator.quality_report \
  --dataset data/toolmisusebench_v0_1 \
  --splits train,dev,test_public

Scoring and Evaluation

Use the official evaluator in the project repo.

Example:

toolmisusebench eval \
  --dataset data/toolmisusebench_v0_1 \
  --split test_public \
  --agent heuristic \
  --report out/report.json

For detailed metric definitions, see SCORING.md in this dataset repository.

Reproducibility Notes

deterministic generation and replay under fixed seeds
per-task fault plans are deterministic
checksums included in manifest.json
freeze metadata included in v0_1_freeze.json

Limitations

synthetic tasks do not capture all real-world API/tool semantics
benchmark is focused on controlled robustness comparisons, not full production realism

Ethics and Privacy

no personal data
no proprietary user logs
no sensitive external data sources used

License

Dataset: CC-BY-4.0 Code/evaluator: MIT (see project repository)

Citation

If you use ToolMisuseBench, please cite the project.

@misc{toolmisusebench2026,
  title={ToolMisuseBench: A Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems},
  author={ToolMisuseBench Authors},
  year={2026},
  howpublished={\url{https://github.com/akgitrepos/toolmisusebench}}
}

Downloads last month: 7