Gaming Agentic Benchmarks

We gamed an Agentic Benchmark — and ranked first on the leaderboard.

This was possible because I can easily read all the submission files. As well as their scored datasets. This helps me in building an agent that can easily learn from their answers and probably score high even on the hidden test set. For each breakthrough in LLMs we look at certain benchmarks, leaderboards and standardized tests. These models provide a yardstick with which we measure the world. But first we must trust the test that measures it. Scientific benchmarks are built for transparency and reproducibility. This allows researchers and practicioners to reproduce this effort. But this trait introduces a vulnerability. As the test dataset is released, its easy to train an AI model on top of it.


from huggingface_hub import list_repo_files

# List all files in the dataset repository
files = list_repo_files(repo_id="adyen/DABstep", repo_type="dataset")

all_files = []
print("Files found in repo:")
for f in files:
    all_files.append(f)

for filename in all_files:
    if filename in ['.gitattributes', '.gitignore', 'LICENSE',]:
        continue 
    hf_hub_download(
        repo_id="adyen/DABstep",
        repo_type="dataset",
        filename=filename,
        local_dir=".",
        force_download=False
    )

The code is self explainatory. And you can use it to extract the reasoning traces of your competitors and train a smaller language model to do the same.

Another issue with this is the test set is static, which makes it easier to game once its made available as an API call.

Could benchmark builders use a stronger LLM to generate dynamic test sets? This would make gaming the system more difficult, since users would receive non-repeating query sets. This can be achieved if the queries are statistically similar but not identical.