Skip to main content
RSA Conference 2026 / Ghost Security

Can LLMs Fix Real-World Security Vulnerabilities?

We tested 16 frontier language models on real CVEs from the GitHub Advisory Database. The best model fixes 22.5% of the curated VulnBench-200 subset; the median patches only ~3.4% of the full 1,650-instance benchmark.

Models Tested
16
Models Tested
Full Benchmark CVEs
1,650
Full Benchmark CVEs
Curated Subset
200
Curated Subset
CWE Types
55
CWE Types
Ecosystems
7
Ecosystems
Key Findings

What We Learned

With description-only file hints and best-of-3 variance reduction on the curated subset, even the best model patches fewer than 1 in 4 real vulnerabilities — and performance collapses across the full 1,650-instance benchmark.

16.6%
Best on Full 1,650

Claude Opus 4.6 leads the full benchmark at 16.6% (273/1650), narrowly ahead of GPT-5.3 Codex at 16.4% (271/1650) — a ranking reversal from the curated subset.

22.5%
Best on Curated 200

GPT-5.3 Codex leads VulnBench-200 (best-of-3) at 22.5%, followed by GPT-5.4 at 18.5% and Claude Opus 4.6 at 16.0%.

3.4%
Full Benchmark Median

The median model fixes only 3.4% of the full 1,650-instance benchmark — roughly 1 in 30 real vulnerabilities.

0.417
Highest Full Benchmark Score

Claude Opus 4.6 achieves the highest mean judge score on the full benchmark, indicating consistently higher patch quality at scale.

Rankings Shift
Rankings Shift

GPT-5.4 drops from 2nd on the curated subset (18.5%) to 3.4% on the full benchmark. Claude Sonnet and Gemini Flash scale better.

9.7%
Best Value Model

Gemini 3 Flash achieves 9.7% on the full benchmark (160/1650) at the lowest cost per instance — strong performance for a lightweight model.

Results

VulnBench-200 Leaderboard

All models evaluated on the same 200 CVE instances using best-of-3 runs with description-only file hints. Judge: Claude Opus 4.6.

VulnBench-200 leaderboard: curated 200-instance subset, best-of-3 runs
# Model Pass Rate Score Passed Avg Time Cost
1
GPT-5.3 Codex
OpenAI
22.5%
0.468 45/200 43.4s $8.74
2
GPT-5.4
OpenAI
18.5%
0.407 37/200 7.3s $4.81
3
Claude Opus 4.6
Anthropic
16.0%
0.404 32/200 19.6s $10.17
4
GPT-5.2
OpenAI
15.0%
0.322 30/200 75.2s $11.30
5
Claude Sonnet 4.6
Anthropic
10.5%
0.322 21/200 16.0s $6.87
6
Gemini 3 Flash
Google
7.5%
0.318 15/200 5.1s $3.13
7
GLM-5
Zhipu AI
7.0%
0.249 14/200 91.9s $4.26
8
Kimi K2.5
Moonshot AI
6.5%
0.228 13/200 65.7s $3.47
9
Grok 4.1 Fast
xAI
5.5%
0.273 11/200 61.1s $3.46
10
GPT-5 Mini
OpenAI
5.0%
0.275 10/200 25.4s $3.63
11
DeepSeek V3.2
DeepSeek
4.5%
0.253 9/200 78.7s $3.25
12
Claude Haiku 4.5
Anthropic
3.5%
0.263 7/200 7.0s $3.95
13
Gemini 3.1 Pro
Google
2.5%
0.093 5/200 44.4s $9.60
14
MiniMax M2.5
MiniMax
1.5%
0.181 3/200 45.0s $3.25
14
MiniMax M2.7
MiniMax
1.5%
0.099 3/200 22.8s $1.74
16
Step 3.5 Flash
StepFun
0.0%
0.000 0/200 44.0s $0.00
Results

VulnBench-1650 Leaderboard

The full benchmark: every model run once against all 1,650 instances. Rankings shift at scale — Claude Opus 4.6 takes the lead. Judge: Claude Opus 4.6.

VulnBench-1650 leaderboard: full 1,650-instance benchmark, single pass
# Model Pass Rate Score Passed Avg Time
1
Claude Opus 4.6
Anthropic
16.6%
0.417 273/1650 19.3s
2
GPT-5.3 Codex
OpenAI
16.4%
0.301 271/1650 23.4s
3
Claude Sonnet 4.6
Anthropic
12.2%
0.335 201/1650 16.5s
4
Gemini 3 Flash
Google
9.7%
0.311 160/1650 8.8s
5
GPT-5.2
OpenAI
6.7%
0.129 111/1650 26.1s
6
GLM-5
Zhipu AI
5.9%
0.219 98/1650 90.6s
7
GPT-5 Mini
OpenAI
5.1%
0.246 85/1650 18.6s
8
Claude Haiku 4.5
Anthropic
4.9%
0.259 80/1650 6.5s
9
Gemini 3.1 Pro
Google
3.5%
0.097 58/1650 56.1s
10
GPT-5.4
OpenAI
3.4%
0.092 56/1650 4.5s
10
Kimi K2.5
Moonshot AI
3.4%
0.102 56/1650 25.7s
12
MiniMax M2.7
MiniMax
2.2%
0.114 36/1650 24.2s
13
MiniMax M2.5
MiniMax
2.1%
0.092 35/1650 23.7s
14
Grok 4.1 Fast
xAI
1.9%
0.108 31/1650 25.6s
15
DeepSeek V3.2
DeepSeek
1.7%
0.094 28/1650 25.2s
16
Step 3.5 Flash
StepFun
0.1%
0.001 2/1650 42.3s
Difficulty Tiers

Vulnerability Difficulty Classification

VulnBench instances are classified into three difficulty tiers (67 / 67 / 66 in the curated subset).

Tier 1 — Pattern
67 instances

XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.

Tier 2 — Logic
67 instances

Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.

Tier 3 — Deep
66 instances

Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.

Methodology

How VulnBench Works

Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.

  1. Collect Real CVEs

    10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.

  2. Build Instances

    Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.

  3. Prompt Models

    Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.

  4. Judge Patches

    Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.

Dataset

VulnBench at a Glance

Curated instances drawn from unique GitHub repositories, balanced across difficulty tiers.

Full Benchmark
1,650
Full Benchmark
Curated Subset
200
Curated Subset
CWE Types
55
CWE Types
Ecosystems
7
Ecosystems
Mean Lines
36
Mean Lines
Mean Files
1.9
Mean Files
Severity Distribution
Critical 21
High 42
Medium 137
Top CWEs
CWE-79 (XSS) 38
CWE-22 (Path Trav) 25
CWE-400 (DoS) 25
CWE-20 (Input Val) 23
CWE-94 (Code Inj) 19
Ecosystems
npm 134
pip 54
Maven 5
RubyGems 3
Composer 2
Rust 1
Swift 1
CVE Year Range
2026 16
2025 55
2024 48
2023 26
2022 24
Earlier 31

Run the Benchmark Yourself

VulnBench is fully open source and reproducible. Evaluate any LLM on the curated 200-instance subset with a single command.

$ vulnbench run \
    --model claude-opus-4.6 \
    --subset vulnbench-full \
    --judge claude-opus-4.6

  [████████████████████░░░░] 1354/1650

  Model      Claude Opus 4.6
  Pass Rate  16.6%
  Score      0.417
  Passed     273/1650
  Time       19.3s
VulnBench / Ghost Security / RSA Conference 2026
Built with data from the GitHub Advisory Database and NIST NVD