A MatterSec Labs Benchmark
The best model depends
on who is asking
SecLens evaluates LLMs on real-world vulnerability detection through five stakeholder lenses. Decision Scores diverge by up to 31 points for the same model.
// Key Finding
Qwen3-Coder earns an A for Head of Engineering but a D for CISO. Claude Haiku 4.5, ranked 8th on the leaderboard, scores 2nd for CISO. No single model dominates — six different models lead at least one of 8 vulnerability categories.
01 // Leaderboard
Choose your lens
Select a stakeholder role to see how model rankings shift. Same evaluation data, different priorities.
Aggregate score across all 35 dimensions with equal weighting.
| # | Model | Score | Grade | LB % | vs LB |
|---|
02 // Categories
No single model dominates
F1 scores by model and OWASP-aligned vulnerability category. Six different models lead at least one category.
03 // Analysis
Divergence and cost
Models with conservative strategies earn top grades for Engineering but fail for CISO. Spending more does not guarantee better results.
Role Divergence Index
Max - min score across 5 roles. Higher = more stakeholder-dependent.
Cost vs. Quality
Cost per task vs. leaderboard score. 8 models with cost tracking.
04 // Weight Profiles
Different priorities, different outcomes
Each role weights 7 dimension categories differently. The same 35 dimensions, filtered through distinct organizational needs.
05 // Dataset
Real-world CVE tasks
Sourced from confirmed CVEs in open-source projects across 10 languages and 8 OWASP-aligned categories.
406
Total Tasks
203 true positive + 203 post-patch
93
Source Projects
Open-source repos with confirmed CVEs
10
Languages
Python, JS, Go, Ruby, Rust, Java, PHP, C, C++, C#
8
Categories
OWASP Top 10:2021 aligned
4
Severity Levels
Critical (25), High (74), Medium (83), Low (21)
35
Dimensions
Detection, Coverage, Reasoning, Efficiency, Tool-Use, Risk, Robustness
06 // Methodology
From CVE tasks to role-specific grades
Step 01
Evaluate
Models tested on 406 CVE tasks in two layers: Code-in-Prompt (single-turn reasoning) and Tool-Use (sandboxed codebase navigation).
Step 02
Measure
Each task scored on verdict (1pt), CWE classification (+1pt), and location accuracy (+1pt IoU). 35 aggregate dimensions computed across 7 categories.
Step 03
Normalize
Dimensions normalized to [0,1] using four strategies: ratio, MCC, lower-is-better, higher-is-better. Fixed reference caps eliminate cohort artifacts.
Step 04
Score per Role
Five YAML weight profiles select 12-16 dimensions each. Decision Score = weighted sum / available weight × 100, yielding grades A through F.
07 // Vulnerability Categories
OWASP-aligned coverage
| Category | Tasks | OWASP | Leader (F1) | Worst (F1) |
|---|---|---|---|---|
| Broken Access Control | 82 | A01:2021 | Kimi K2.5 (0.667) | Qwen3-Coder (0.128) |
| Cryptographic Failures | 64 | A02:2021 | Gemini 3 Flash (0.676) | Qwen3-Coder (0.118) |
| Injection | 62 | A03:2021 | Gemini 3.1 Pro (0.632) | Qwen3-Coder (0.062) |
| Improper Input Validation | 58 | Extended | Haiku 4.5 (0.675) | Qwen3-Coder (0.125) |
| SSRF | 46 | A10:2021 | Sonnet 4.6 (0.690) | Qwen3-Coder (0.512) |
| Authentication Failures | 38 | A07:2021 | Kimi K2.5 (0.585) | Opus 4.6 (0.000) |
| Data Integrity Failures | 36 | A08:2021 | Gemini 3 Flash (0.680) | Qwen3-Coder (0.200) |
| Memory Safety | 20 | Extended | Haiku 4.5 (0.690) | Qwen3-Coder (0.308) |