RSA Conference 2026 / Ghost Security

Can LLMs Fix Real-World Security Vulnerabilities?

We tested 16 frontier language models on real CVEs from the GitHub Advisory Database. The best model fixes 22.5% of the curated VulnBench-200 subset; the median patches only ~3.4% of the full 1,650-instance benchmark.

Models Tested: 16
Full Benchmark CVEs: 1,650
Curated Subset: 200
CWE Types: 55
Ecosystems: 7

Key Findings

What We Learned

With description-only file hints and best-of-3 variance reduction on the curated subset, even the best model patches fewer than 1 in 4 real vulnerabilities — and performance collapses across the full 1,650-instance benchmark.

16.6%

Best on Full 1,650

Claude Opus 4.6 leads the full benchmark at 16.6% (273/1650), narrowly ahead of GPT-5.3 Codex at 16.4% (271/1650) — a ranking reversal from the curated subset.

22.5%

Best on Curated 200

GPT-5.3 Codex leads VulnBench-200 (best-of-3) at 22.5%, followed by GPT-5.4 at 18.5% and Claude Opus 4.6 at 16.0%.

3.4%

Full Benchmark Median

The median model fixes only 3.4% of the full 1,650-instance benchmark — roughly 1 in 30 real vulnerabilities.

0.417

Highest Full Benchmark Score

Claude Opus 4.6 achieves the highest mean judge score on the full benchmark, indicating consistently higher patch quality at scale.

Rankings Shift

GPT-5.4 drops from 2nd on the curated subset (18.5%) to 3.4% on the full benchmark. Claude Sonnet and Gemini Flash scale better.

9.7%

Best Value Model

Gemini 3 Flash achieves 9.7% on the full benchmark (160/1650) at the lowest cost per instance — strong performance for a lightweight model.

Results

VulnBench-200 Leaderboard

All models evaluated on the same 200 CVE instances using best-of-3 runs with description-only file hints. Judge: Claude Opus 4.6.

VulnBench-200 leaderboard: curated 200-instance subset, best-of-3 runs
#	Model	Pass Rate	Score	Passed	Avg Time	Cost
1	GPT-5.3 Codex OpenAI	22.5%	0.468	45/200	43.4s	$8.74
2	GPT-5.4 OpenAI	18.5%	0.407	37/200	7.3s	$4.81
3	Claude Opus 4.6 Anthropic	16.0%	0.404	32/200	19.6s	$10.17
4	GPT-5.2 OpenAI	15.0%	0.322	30/200	75.2s	$11.30
5	Claude Sonnet 4.6 Anthropic	10.5%	0.322	21/200	16.0s	$6.87
6	Gemini 3 Flash Google	7.5%	0.318	15/200	5.1s	$3.13
7	GLM-5 Zhipu AI	7.0%	0.249	14/200	91.9s	$4.26
8	Kimi K2.5 Moonshot AI	6.5%	0.228	13/200	65.7s	$3.47
9	Grok 4.1 Fast xAI	5.5%	0.273	11/200	61.1s	$3.46
10	GPT-5 Mini OpenAI	5.0%	0.275	10/200	25.4s	$3.63
11	DeepSeek V3.2 DeepSeek	4.5%	0.253	9/200	78.7s	$3.25
12	Claude Haiku 4.5 Anthropic	3.5%	0.263	7/200	7.0s	$3.95
13	Gemini 3.1 Pro Google	2.5%	0.093	5/200	44.4s	$9.60
14	MiniMax M2.5 MiniMax	1.5%	0.181	3/200	45.0s	$3.25
14	MiniMax M2.7 MiniMax	1.5%	0.099	3/200	22.8s	$1.74
16	Step 3.5 Flash StepFun	0.0%	0.000	0/200	44.0s	$0.00

Results

VulnBench-1650 Leaderboard

The full benchmark: every model run once against all 1,650 instances. Rankings shift at scale — Claude Opus 4.6 takes the lead. Judge: Claude Opus 4.6.

VulnBench-1650 leaderboard: full 1,650-instance benchmark, single pass
#	Model	Pass Rate	Score	Passed	Avg Time
1	Claude Opus 4.6 Anthropic	16.6%	0.417	273/1650	19.3s
2	GPT-5.3 Codex OpenAI	16.4%	0.301	271/1650	23.4s
3	Claude Sonnet 4.6 Anthropic	12.2%	0.335	201/1650	16.5s
4	Gemini 3 Flash Google	9.7%	0.311	160/1650	8.8s
5	GPT-5.2 OpenAI	6.7%	0.129	111/1650	26.1s
6	GLM-5 Zhipu AI	5.9%	0.219	98/1650	90.6s
7	GPT-5 Mini OpenAI	5.1%	0.246	85/1650	18.6s
8	Claude Haiku 4.5 Anthropic	4.9%	0.259	80/1650	6.5s
9	Gemini 3.1 Pro Google	3.5%	0.097	58/1650	56.1s
10	GPT-5.4 OpenAI	3.4%	0.092	56/1650	4.5s
10	Kimi K2.5 Moonshot AI	3.4%	0.102	56/1650	25.7s
12	MiniMax M2.7 MiniMax	2.2%	0.114	36/1650	24.2s
13	MiniMax M2.5 MiniMax	2.1%	0.092	35/1650	23.7s
14	Grok 4.1 Fast xAI	1.9%	0.108	31/1650	25.6s
15	DeepSeek V3.2 DeepSeek	1.7%	0.094	28/1650	25.2s
16	Step 3.5 Flash StepFun	0.1%	0.001	2/1650	42.3s

Difficulty Tiers

Vulnerability Difficulty Classification

VulnBench instances are classified into three difficulty tiers (67 / 67 / 66 in the curated subset).

Tier 1 — Pattern

67 instances

XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.

Tier 2 — Logic

67 instances

Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.

Tier 3 — Deep

66 instances

Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.

Methodology

How VulnBench Works

Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.

Collect Real CVEs

10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.
Build Instances

Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.
Prompt Models

Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.
Judge Patches

Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.

Dataset

VulnBench at a Glance

Curated instances drawn from unique GitHub repositories, balanced across difficulty tiers.

Full Benchmark: 1,650
Curated Subset: 200
CWE Types: 55
Ecosystems: 7
Mean Lines: 36
Mean Files: 1.9

Severity Distribution

Critical 21

High 42

Medium 137

Top CWEs

CWE-79 (XSS) 38

CWE-22 (Path Trav) 25

CWE-400 (DoS) 25

CWE-20 (Input Val) 23

CWE-94 (Code Inj) 19

Ecosystems

npm 134

pip 54

Maven 5

RubyGems 3

Composer 2

Rust 1

Swift 1

CVE Year Range

2026 16

2025 55

2024 48

2023 26

2022 24

Earlier 31

Run the Benchmark Yourself

VulnBench is fully open source and reproducible. Evaluate any LLM on the curated 200-instance subset with a single command.

View on GitHub

$ vulnbench run \
    --model claude-opus-4.6 \
    --subset vulnbench-full \
    --judge claude-opus-4.6

  [████████████████████░░░░] 1354/1650

  Model      Claude Opus 4.6
  Pass Rate  16.6%
  Score      0.417
  Passed     273/1650
  Time       19.3s

VulnBench / Ghost Security / RSA Conference 2026

Built with data from the GitHub Advisory Database and NIST NVD