GitHub - razbakov/skills-benchmark

SKILL.md Review Schema Benchmark

Which review framework best evaluates AI agent skill files?

We designed 5 review schemas for evaluating SKILL.md files, ran them against real-world skills using multiple LLMs, and collected the results. Now we want your judgment.

This benchmark is feeding into Skills Manager, a management layer for AI agent skills.

Prompts

Prompt	Dimensions	Scale
RADAR	6	30
PRISM	8	40
A2E	10	30
TDC-VRG	6	18
Logic-Linter	6	30

Results

Each schema folder contains subfolders per skill reviewed. Result files are named <model>-<score>.md.

Best comparison point: the cloudflare skill reviewed across 3 schemas × 3 models:

Schema	Claude	ChatGPT	Gemini
RADAR	21/30	20/30	20/30
PRISM	29/40	21/40	23/40
TDC-VRG	9/18	6/18	9/18

We want your feedback

Read a few reviews, then open an issue with your take:

Which review would you actually use to improve a skill?
Which found real problems vs. generated noise?
Were the suggested fixes specific enough to act on?
Did any schema over- or under-score?

Contributing

Run a review: Pick a schema + a skill + an LLM → submit the result as a PR
Propose a schema: Add prompts/<NAME>/prompt.md with at least one review
Suggest skills to review: Open an issue

See CONTRIBUTING.md for details.

License

MIT