SecLens — Role-Specific Evaluation of LLMs for Security Vulnerability Detection

A MatterSec Labs Benchmark

The best model depends
on who is asking

SecLens evaluates LLMs on real-world vulnerability detection through five stakeholder lenses. Decision Scores diverge by up to 31 points for the same model.

// Key Finding

Qwen3-Coder earns an A for Head of Engineering but a D for CISO. Claude Haiku 4.5, ranked 8th on the leaderboard, scores 2nd for CISO. No single model dominates — six different models lead at least one of 8 vulnerability categories.

01 // Leaderboard

Choose your lens

Select a stakeholder role to see how model rankings shift. Same evaluation data, different priorities.

Aggregate score across all 35 dimensions with equal weighting.

#	Model	Score	Grade	LB %	vs LB

02 // Categories

No single model dominates

F1 scores by model and OWASP-aligned vulnerability category. Six different models lead at least one category.

03 // Analysis

Divergence and cost

Models with conservative strategies earn top grades for Engineering but fail for CISO. Spending more does not guarantee better results.

Role Divergence Index

Max - min score across 5 roles. Higher = more stakeholder-dependent.

Cost vs. Quality

Cost per task vs. leaderboard score. 8 models with cost tracking.

04 // Weight Profiles

Different priorities, different outcomes

Each role weights 7 dimension categories differently. The same 35 dimensions, filtered through distinct organizational needs.

05 // Dataset

Real-world CVE tasks

Sourced from confirmed CVEs in open-source projects across 10 languages and 8 OWASP-aligned categories.

406

Total Tasks

203 true positive + 203 post-patch

Source Projects

Open-source repos with confirmed CVEs

Languages

Python, JS, Go, Ruby, Rust, Java, PHP, C, C++, C#

From CVE tasks to role-specific grades

Step 01

Evaluate

Models tested on 406 CVE tasks in two layers: Code-in-Prompt (single-turn reasoning) and Tool-Use (sandboxed codebase navigation).

Step 02

Measure

Each task scored on verdict (1pt), CWE classification (+1pt), and location accuracy (+1pt IoU). 35 aggregate dimensions computed across 7 categories.

Step 03

Normalize

Dimensions normalized to [0,1] using four strategies: ratio, MCC, lower-is-better, higher-is-better. Fixed reference caps eliminate cohort artifacts.

Step 04

Score per Role

Five YAML weight profiles select 12-16 dimensions each. Decision Score = weighted sum / available weight × 100, yielding grades A through F.

07 // Vulnerability Categories

OWASP-aligned coverage

Category	Tasks	OWASP	Leader (F1)	Worst (F1)
Broken Access Control	82	A01:2021	Kimi K2.5 (0.667)	Qwen3-Coder (0.128)
Cryptographic Failures	64	A02:2021	Gemini 3 Flash (0.676)	Qwen3-Coder (0.118)
Injection	62	A03:2021	Gemini 3.1 Pro (0.632)	Qwen3-Coder (0.062)
Improper Input Validation	58	Extended	Haiku 4.5 (0.675)	Qwen3-Coder (0.125)
SSRF	46	A10:2021	Sonnet 4.6 (0.690)	Qwen3-Coder (0.512)
Authentication Failures	38	A07:2021	Kimi K2.5 (0.585)	Opus 4.6 (0.000)
Data Integrity Failures	36	A08:2021	Gemini 3 Flash (0.680)	Qwen3-Coder (0.200)
Memory Safety	20	Extended	Haiku 4.5 (0.690)	Qwen3-Coder (0.308)

Open Source

Star the repo to follow updates and new model evaluations.

★ Star on GitHub

The best model dependson who is asking