SecLens — Role-Specific Evaluation of LLMs for Security Vulnerability Detection

3 min read Original article ↗

A MatterSec Labs Benchmark

The best model depends
on who is asking

SecLens evaluates LLMs on real-world vulnerability detection through five stakeholder lenses. Decision Scores diverge by up to 31 points for the same model.

arXiv cs.CR License: MIT Paper: CC BY 4.0

// Key Finding

Qwen3-Coder earns an A for Head of Engineering but a D for CISO. Claude Haiku 4.5, ranked 8th on the leaderboard, scores 2nd for CISO. No single model dominates — six different models lead at least one of 8 vulnerability categories.

01 // Leaderboard

Choose your lens

Select a stakeholder role to see how model rankings shift. Same evaluation data, different priorities.

Aggregate score across all 35 dimensions with equal weighting.

# Model Score Grade LB % vs LB

02 // Categories

No single model dominates

F1 scores by model and OWASP-aligned vulnerability category. Six different models lead at least one category.

03 // Analysis

Divergence and cost

Models with conservative strategies earn top grades for Engineering but fail for CISO. Spending more does not guarantee better results.

Role Divergence Index

Max - min score across 5 roles. Higher = more stakeholder-dependent.

Cost vs. Quality

Cost per task vs. leaderboard score. 8 models with cost tracking.

04 // Weight Profiles

Different priorities, different outcomes

Each role weights 7 dimension categories differently. The same 35 dimensions, filtered through distinct organizational needs.

05 // Dataset

Real-world CVE tasks

Sourced from confirmed CVEs in open-source projects across 10 languages and 8 OWASP-aligned categories.

406

Total Tasks

203 true positive + 203 post-patch

93

Source Projects

Open-source repos with confirmed CVEs

10

Languages

Python, JS, Go, Ruby, Rust, Java, PHP, C, C++, C#

8

Categories

OWASP Top 10:2021 aligned

4

Severity Levels

Critical (25), High (74), Medium (83), Low (21)

35

Dimensions

Detection, Coverage, Reasoning, Efficiency, Tool-Use, Risk, Robustness

06 // Methodology

From CVE tasks to role-specific grades

Step 01

Evaluate

Models tested on 406 CVE tasks in two layers: Code-in-Prompt (single-turn reasoning) and Tool-Use (sandboxed codebase navigation).

Step 02

Measure

Each task scored on verdict (1pt), CWE classification (+1pt), and location accuracy (+1pt IoU). 35 aggregate dimensions computed across 7 categories.

Step 03

Normalize

Dimensions normalized to [0,1] using four strategies: ratio, MCC, lower-is-better, higher-is-better. Fixed reference caps eliminate cohort artifacts.

Step 04

Score per Role

Five YAML weight profiles select 12-16 dimensions each. Decision Score = weighted sum / available weight × 100, yielding grades A through F.

07 // Vulnerability Categories

OWASP-aligned coverage

Category Tasks OWASP Leader (F1) Worst (F1)
Broken Access Control82A01:2021Kimi K2.5 (0.667)Qwen3-Coder (0.128)
Cryptographic Failures64A02:2021Gemini 3 Flash (0.676)Qwen3-Coder (0.118)
Injection62A03:2021Gemini 3.1 Pro (0.632)Qwen3-Coder (0.062)
Improper Input Validation58ExtendedHaiku 4.5 (0.675)Qwen3-Coder (0.125)
SSRF46A10:2021Sonnet 4.6 (0.690)Qwen3-Coder (0.512)
Authentication Failures38A07:2021Kimi K2.5 (0.585)Opus 4.6 (0.000)
Data Integrity Failures36A08:2021Gemini 3 Flash (0.680)Qwen3-Coder (0.200)
Memory Safety20ExtendedHaiku 4.5 (0.690)Qwen3-Coder (0.308)

Open Source

Star the repo to follow updates and new model evaluations.

★ Star on GitHub