GitHub - razbakov/skills-benchmark

2 min read Original article ↗

SKILL.md Review Schema Benchmark

Which review framework best evaluates AI agent skill files?

We designed 5 review schemas for evaluating SKILL.md files, ran them against real-world skills using multiple LLMs, and collected the results. Now we want your judgment.

This benchmark is feeding into Skills Manager, a management layer for AI agent skills.

Prompts

Prompt Dimensions Scale
RADAR 6 30
PRISM 8 40
A2E 10 30
TDC-VRG 6 18
Logic-Linter 6 30

Results

Each schema folder contains subfolders per skill reviewed. Result files are named <model>-<score>.md.

Best comparison point: the cloudflare skill reviewed across 3 schemas × 3 models:

Schema Claude ChatGPT Gemini
RADAR 21/30 20/30 20/30
PRISM 29/40 21/40 23/40
TDC-VRG 9/18 6/18 9/18

We want your feedback

Read a few reviews, then open an issue with your take:

  • Which review would you actually use to improve a skill?
  • Which found real problems vs. generated noise?
  • Were the suggested fixes specific enough to act on?
  • Did any schema over- or under-score?

Contributing

  • Run a review: Pick a schema + a skill + an LLM → submit the result as a PR
  • Propose a schema: Add prompts/<NAME>/prompt.md with at least one review
  • Suggest skills to review: Open an issue

See CONTRIBUTING.md for details.

License

MIT