Bob Tournament Runner — AI Code Generation Benchmarks

Score vs Cost

Quality plotted against cost per round. Top-left = best value. Bubble size = number of tournaments.

Tournament Results

Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner

gpt-5.4

Score

79.2%

Builders

15

Rounds

2

complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner

gpt-5.4

Score

82.6%

Builders

15

Rounds

3

complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

95.9%

Builders

15

Rounds

4

complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

82.8%

Builders

15

Rounds

4

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

99.9%

Builders

14

Rounds

1

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

Qwen3-Coder-480B-A35B-Instruct

Score

99.9%

Builders

14

Rounds

1

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

Qwen3-Coder-480B-A35B-Instruct

Score

99.9%

Builders

14

Rounds

4

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

Qwen3-Coder-480B-A35B-Instruct

Score

99.9%

Builders

14

Rounds

1

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner

gpt-5.4

Score

79.7%

Builders

14

Rounds

5

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner

gemini-3-flash-preview

Score

71.5%

Builders

14

Rounds

5

complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

0.0%

Builders

11

Rounds

3

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner

Qwen3-Coder-480B-A35B-Instruct

Score

0.0%

Builders

11

Rounds

3

complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner

Llama-3.3-70B-Instruct-Turbo

Score

0.0%

Builders

11

Rounds

3

complex

skill_quality.py

Supabase telemetry. 210 lines. 5 RPCs, JSONL fallback, fcntl locking.

Winner

Qwen3-Coder-480B-A35B-Instruct

Score

0.0%

Builders

11

Rounds

3

medium

skill_tracker.py

Breadcrumb file reader/parser. 60 lines. JSON parsing, timestamps.

Winner

DeepSeek-V3.2

Score

0.0%

Builders

11

Rounds

3

medium

skill_tracker.py

Breadcrumb file reader/parser. 60 lines. JSON parsing, timestamps.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

86.5%

Builders

11

Rounds

5

complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner

gpt-5.4

Score

78.9%

Builders

11

Rounds

5

complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner

gemini-3-flash-preview

Score

86.2%

Builders

11

Rounds

5

complex

skill_quality.py

Supabase telemetry. 210 lines. 5 RPCs, JSONL fallback, fcntl locking.

Winner

Mistral-Small-3.2-24B-Instruct-2506

Score

85.4%

Builders

11

Rounds

5

complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner

gpt-5.4

Score

72.0%

Builders

11

Rounds

5

medium

skill_tracker.py

Breadcrumb file reader/parser. 60 lines. JSON parsing, timestamps.

Winner

gemini-3-flash-preview

Score

85.9%

Builders

4

Rounds

2

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner

Mistral-Small-3.2-24B-Instruct-2506

Score

71.5%

Builders

6

Rounds

2

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

gpt-5.4

Score

90.0%

Builders

11

Rounds

5

Latest Leaderboard

Best score per builder in the most recent tournament.

# ▴	Builder ▴	Model ▴	Provider ▴	Params ▴	Arch ▴	Tier ▴	Self-Host ▴	Score ▴	Tests ▴	Avg Tests ▴	Latency ▴	$/round ▴
1	gpt-1	gpt-5.4	OpenAI	—	Proprietary	frontier	Cloud only	90.0%	35/35	100%	2s	$0.0517
2	gemini3-flash	gemini-3-flash-preview	Google	—	Proprietary	budget	Cloud only	90.0%	35/35	100%	2s	$0.0085
3	sonnet-1	claude-sonnet-4-6	Anthropic	—	Proprietary	premium	Cloud only	90.0%	35/35	100%	2s	$0.0540
4	llama-1	Llama-3.3-70B-Instruct-Turbo	DeepInfra	—	Unknown	unknown	Unknown	90.0%	35/35	100%	3s	$0.0013
5	mistral-sm	Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	24B	Dense	budget	Mac Mini M4 Pro 24GB	90.0%	35/35	100%	2s	$0.0010
6	qwen-1	Qwen3-Coder-480B-A35B-Instruct	DeepInfra	480B	MoE	premium	2× Mac Ultra 192GB cluster	90.0%	35/35	100%	3s	$0.0052
7	nemo-1	NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	120B	MoE	mid	Mac Studio M4 Ultra 128GB	90.0%	35/35	100%	3s	$0.0032
8	minimax-1	MiniMax-M2.5	DeepInfra	—	Proprietary	mid	Unknown	90.0%	35/35	100%	3s	$0.0019
9	qwen-turbo	Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	480B	MoE (Turbo)	mid	2× Mac Ultra 192GB cluster	90.0%	35/35	100%	3s	$0.0029
10	dsv3-1	DeepSeek-V3.2	DeepInfra	685B	MoE	budget	3× Mac Ultra 192GB cluster	89.9%	35/35	100%	7s	$0.0004
11	phi4-1	phi-4	DeepInfra	14B	Dense	budget	Mac Mini M4 16GB	75.0%	35/35	100%	2s	$0.0006

Lifetime Model Performance

Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).

Model ▴	Provider ▴	Params ▴	Self-Host ▴	T / Rounds ▴	Confidence ▴	Pass Rate ▴	Avg Score ▴	Lifetime $ ▴	Score/$ ▴	Value Score ▴
gpt-5.4	OpenAI	—	Cloud only	17T / 64r	HIGH	76%	61.9%	—	—	79
phi-4	DeepInfra	14B	Mac Mini M4 16GB	17T / 64r	HIGH	69%	47.7%	$0.06	8	71
Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	24B	Mac Mini M4 Pro 24GB	17T / 64r	HIGH	70%	58.4%	$0.08	8	77
DeepSeek-V3.2	DeepInfra	685B	3× Mac Ultra 192GB cluster	16T / 62r	HIGH	42%	32.9%	$0.05	6	53
Llama-3.3-70B-Instruct-Turbo	DeepInfra	—	Unknown	14T / 57r	HIGH	68%	50.5%	$0.10	5	70
NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	120B	Mac Studio M4 Ultra 128GB	16T / 74r	HIGH	59%	50.4%	$0.23	2	64
Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	480B	2× Mac Ultra 192GB cluster	16T / 62r	HIGH	65%	54.2%	$0.31	2	67
Qwen3-Coder-480B-A35B-Instruct	DeepInfra	480B	2× Mac Ultra 192GB cluster	16T / 62r	HIGH	66%	51.4%	$0.44	1	64
MiniMax-M2.5	DeepInfra	—	Unknown	16T / 62r	HIGH	39%	30.6%	$0.32	1	43
gemini-3-flash-preview	Google	—	Cloud only	17T / 64r	HIGH	75%	61.0%	$0.95	1	69
claude-sonnet-4-6	Anthropic	—	Cloud only	16T / 62r	HIGH	70%	53.6%	$4.13	0	57
Total Spend								$7.48

Insufficient Data

Models with fewer than 6 tournaments. Scores are preliminary.

Model ▴	Provider ▴	Params ▴	Self-Host ▴	T / Rounds ▴	Confidence ▴	Pass Rate ▴	Avg Score ▴	Lifetime $ ▴	Score/$ ▴	Value Score ▴
gemini-3-pro-preview	Google	—	Cloud only	4T / 12r	LOW	59%	50.0%	$0.64	—	—
gemini-2.5-flash	Google	—	Cloud only	4T / 12r	LOW	55%	49.0%	$0.19	—	—
Llama-3.3-70B-Instruct	DeepInfra	70B	Mac Studio M4 Max 64GB	3T / 7r	LOW	55%	45.8%	—	—	—
mistral-small	DeepInfra	—	Unknown	2T / 4r	VERY LOW	69%	45.0%	—	—	—
phi4	DeepInfra	—	Unknown	2T / 4r	VERY LOW	67%	33.8%	—	—	—

ELO Ratings

Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.

#	Model	Provider	ELO	Confidence	T / Rounds
1	gpt-5.4	OpenAI	1910	HIGH	17T / 64r
2	gemini-3-flash-preview	Google	1909	HIGH	17T / 64r
3	claude-sonnet-4-6	Anthropic	1741	HIGH	16T / 62r
4	Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	1713	HIGH	17T / 64r
5	Qwen3-Coder-480B-A35B-Instruct	DeepInfra	1653	HIGH	16T / 62r
6	NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	1574	HIGH	16T / 74r
7	gemini-3-pro-preview	Google	1530	LOW	4T / 12r
8	Llama-3.3-70B-Instruct-Turbo	DeepInfra	1524	HIGH	14T / 57r
10	Llama-3.3-70B-Instruct	DeepInfra	1452	LOW	3T / 7r
11	phi4	DeepInfra	1419	VERY LOW	2T / 4r
12	mistral-small	DeepInfra	1414	VERY LOW	2T / 4r
13	Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	1382	HIGH	16T / 62r
14	MiniMax-M2.5	DeepInfra	1303	HIGH	16T / 62r
15	gemini-2.5-flash	Google	1302	LOW	4T / 12r
16	DeepSeek-V3.2	DeepInfra	1109	HIGH	16T / 62r
17	phi-4	DeepInfra	1064	HIGH	17T / 64r

Pass@1 vs Self-Repair

How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.

Model	Provider	Pass@1	Repaired	Failed
gpt-5.4	OpenAI	16%	41%	44%
gemini-3-flash-preview	Google	12%	31%	56%
Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	11%	31%	58%
Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	11%	32%	56%
claude-sonnet-4-6	Anthropic	13%	37%	50%
Qwen3-Coder-480B-A35B-Instruct	DeepInfra	11%	32%	56%
Llama-3.3-70B-Instruct-Turbo	DeepInfra	9%	33%	58%
NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	9%	20%	70%
gemini-3-pro-preview	Google	17%	0%	83%
gemini-2.5-flash	Google	17%	0%	83%
phi-4	DeepInfra	11%	31%	58%
Llama-3.3-70B-Instruct	DeepInfra	29%	0%	71%
mistral-small	DeepInfra	25%	25%	50%
phi4	DeepInfra	25%	0%	75%
DeepSeek-V3.2	DeepInfra	10%	29%	61%
MiniMax-M2.5	DeepInfra	6%	21%	73%

Performance by Complexity

Average score broken down by task complexity tier.

Model	Provider	Trivial	Medium	Complex
gpt-5.4	OpenAI	—	—	—
gemini-3-flash-preview	Google	—	—	—
Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	—	—	—
Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	—	—	—
claude-sonnet-4-6	Anthropic	—	—	—
Qwen3-Coder-480B-A35B-Instruct	DeepInfra	—	—	—
Llama-3.3-70B-Instruct-Turbo	DeepInfra	—	—	—
NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	—	—	—
gemini-3-pro-preview	Google	—	—	—
gemini-2.5-flash	Google	—	—	—
phi-4	DeepInfra	—	—	—
Llama-3.3-70B-Instruct	DeepInfra	—	—	—
mistral-small	DeepInfra	—	—	—
phi4	DeepInfra	—	—	—
DeepSeek-V3.2	DeepInfra	—	—	—
MiniMax-M2.5	DeepInfra	—	—	—

Head-to-Head Comparison

Select two models to compare side-by-side.

vs

VS

Own Hardware vs Cloud

Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.

Single Mac

4/17

Mac/DGX Cluster

3/17

Cloud Only

10/17

All Open $/yr

$2

Proprietary $/yr

$10

Buy or rent?

Top self-hostable model scores 99.9% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$13/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.

Cloud only (proprietary)

gemini-2.5-flash, gemini-3-flash-preview, claude-sonnet-4-6, gpt-5.4, gemini-3-pro-preview, MiniMax-M2.5, Llama-3.3-70B-Instruct-Turbo, mistral-small, phi4, llama3.3:70b-instruct-q4_K_M. These are closed-source — no self-hosting option.

Cluster options (large MoE models)

Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.9%
Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.9%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.8%

Single-Mac starter kit

phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 79.9% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 99.9% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 99.9% best)
NVIDIA-Nemotron-3-Super-120B-A12B → Mac Studio M4 Ultra 128GB (~$5,500, 70GB Q4, 99.9% best)

Key Insights

Open models beat premium

Nemotron wins at $0.10/$0.50 per MTok. Opus costs $0.00 lifetime for rank #11.

DeepSeek-V3.2 is a value star

$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.

Precise specs commoditise models

14/15 builders passed all tests. When the spec is clear, model choice barely matters.

Iteration hurts more than helps

Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.

Premium = consistency, not quality

Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.

Total spend: $7.48

Across 31 tournaments, 66 rounds, 17 models. Opus is 0% of total spend.

How We Test

Tournament Protocol

Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.

Scoring

Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.

Fair Comparison

All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.

Cost Tracking

Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.