SKILL.md Review Schema Benchmark
Which review framework best evaluates AI agent skill files?
We designed 5 review schemas for evaluating SKILL.md files, ran them against real-world skills using multiple LLMs, and collected the results. Now we want your judgment.
This benchmark is feeding into Skills Manager, a management layer for AI agent skills.
Prompts
| Prompt | Dimensions | Scale |
|---|---|---|
| RADAR | 6 | 30 |
| PRISM | 8 | 40 |
| A2E | 10 | 30 |
| TDC-VRG | 6 | 18 |
| Logic-Linter | 6 | 30 |
Results
Each schema folder contains subfolders per skill reviewed. Result files are named <model>-<score>.md.
Best comparison point: the cloudflare skill reviewed across 3 schemas × 3 models:
| Schema | Claude | ChatGPT | Gemini |
|---|---|---|---|
| RADAR | 21/30 | 20/30 | 20/30 |
| PRISM | 29/40 | 21/40 | 23/40 |
| TDC-VRG | 9/18 | 6/18 | 9/18 |
We want your feedback
Read a few reviews, then open an issue with your take:
- Which review would you actually use to improve a skill?
- Which found real problems vs. generated noise?
- Were the suggested fixes specific enough to act on?
- Did any schema over- or under-score?
Contributing
- Run a review: Pick a schema + a skill + an LLM → submit the result as a PR
- Propose a schema: Add
prompts/<NAME>/prompt.mdwith at least one review - Suggest skills to review: Open an issue
See CONTRIBUTING.md for details.
License
MIT