GitHub - aleksafar/clockbench: ClockBench - Visual Reasoning AI Benchmark

ClockBench

ClockBench - Visual Reasoning AI Benchmark.

This is a public dataset that includes 10 clocks, out of 180 clocks available in the private dataset. Full dataset if intentionally kept private in order to avoid data leaking into models training.

Setup Instructions

Install dependencies declared in pyproject.toml

The script depends on requests. If you prefer not to use pip install -e ., you can alternatively run pip install requests directly.

Run Instructions

Benchmark includes two scripts designed to be run consecutively:

python3 clockbench_evaluate.py

This script runs an evaluation of a chosen model via OpenRouter API. Please add your API key and specify a model to evaluate in the leading section of the script. Script outputs the results in a JSON file.

python3 clockbench_grade.py

This script grades the results. Script outputs the results in a JSON file.

Contributing

Pull requests are welcome.