Bob Tournament Runner

Tournament Results

Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner

gpt-5.4

Score

79.2%

Builders

Rounds

complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner

gpt-5.4

Score

82.6%

Builders

Rounds

complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

95.9%

Builders

Rounds

complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

82.8%

Builders

Rounds

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner

NVIDIA-Nemotron-3-Super-120B-A12B

Score

99.9%

Builders

Rounds

Latest Leaderboard

Best score per builder in the most recent tournament.

# ▴	Builder ▴	Model ▴	Provider ▴	Params ▴	Arch ▴	Tier ▴	Self-Host ▴	Score ▴	Tests ▴	Avg Tests ▴	Latency ▴	$/round ▴
1	nemo-2	NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	120B	MoE	mid	Mac Studio M4 Ultra 128GB	99.9%	35/35	100%	2s	$0.0017
2	llama-1	Llama-3.3-70B-Instruct	DeepInfra	70B	Dense	mid	Mac Studio M4 Max 64GB	99.9%	35/35	100%	2s	$0.0014
3	mistral-sm	Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	24B	Dense	budget	Mac Mini M4 Pro 24GB	99.9%	35/35	100%	2s	$0.0011
4	gemini25-flash	gemini-2.5-flash	Google	—	Proprietary	budget	Cloud only	99.9%	35/35	100%	2s	$0.0021
5	gemini3-flash	gemini-3-flash-preview	Google	—	Proprietary	budget	Cloud only	99.9%	35/35	100%	2s	$0.0021
6	gpt-1	gpt-5.4	OpenAI	—	Proprietary	frontier	Cloud only	99.9%	35/35	100%	4s	$0.0350
7	sonnet-1	claude-sonnet-4-6	Anthropic	—	Proprietary	premium	Cloud only	99.8%	35/35	100%	3s	$0.0510
8	gemini3-pro	gemini-3-pro-preview	Google	—	Proprietary	premium	Cloud only	99.8%	35/35	100%	5s	$0.0325
9	qwen-turbo	Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	480B	MoE (Turbo)	mid	2× Mac Ultra 192GB cluster	99.7%	35/35	100%	5s	$0.0019
10	qwen-1	Qwen3-Coder-480B-A35B-Instruct	DeepInfra	480B	MoE	premium	2× Mac Ultra 192GB cluster	99.7%	35/35	100%	9s	$0.0045
11	nemo-1	NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	120B	MoE	mid	Mac Studio M4 Ultra 128GB	99.7%	35/35	100%	10s	$0.0017
12	phi4-1	phi-4	DeepInfra	14B	Dense	budget	Mac Mini M4 16GB	79.9%	35/35	100%	2s	$0.0005
13	dsv3-1	DeepSeek-V3.2	DeepInfra	685B	MoE	budget	3× Mac Ultra 192GB cluster	79.7%	35/35	100%	10s	$0.0017
14	minimax-1	MiniMax-M2.5	DeepInfra	—	Proprietary	mid	Unknown	79.5%	35/35	100%	16s	$0.0034

Lifetime Model Performance

Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).

Model	Provider	Params	Self-Host	T / Rounds	Confidence	Pass Rate	Avg Score	Trend	Lifetime $	Score/$	Value Score
Total Spend									$0.14

Insufficient Data

Models with fewer than 6 tournaments. Scores are preliminary.

Model	Provider	Params	Self-Host	T / Rounds	Confidence	Pass Rate	Avg Score	Lifetime $	Score/$	Value Score
Llama-3.3-70B-Instruct	DeepInfra	70B	Mac Studio M4 Max 64GB	1T / 1r	VERY LOW	100%	99.9%	$0.00	—	—
Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	24B	Mac Mini M4 Pro 24GB	1T / 1r	VERY LOW	100%	99.9%	$0.00	—	—
gemini-2.5-flash	Google	—	Cloud only	1T / 1r	VERY LOW	100%	99.9%	$0.00	—	—
gemini-3-flash-preview	Google	—	Cloud only	1T / 1r	VERY LOW	100%	99.9%	$0.00	—	—
gpt-5.4	OpenAI	—	Cloud only	1T / 1r	VERY LOW	100%	99.9%	$0.03	—	—
claude-sonnet-4-6	Anthropic	—	Cloud only	1T / 1r	VERY LOW	100%	99.8%	$0.05	—	—
NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	120B	Mac Studio M4 Ultra 128GB	1T / 2r	VERY LOW	100%	99.8%	$0.00	—	—
gemini-3-pro-preview	Google	—	Cloud only	1T / 1r	VERY LOW	100%	99.8%	$0.03	—	—
Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	480B	2× Mac Ultra 192GB cluster	1T / 1r	VERY LOW	100%	99.7%	$0.00	—	—
Qwen3-Coder-480B-A35B-Instruct	DeepInfra	480B	2× Mac Ultra 192GB cluster	1T / 1r	VERY LOW	100%	99.7%	$0.00	—	—
phi-4	DeepInfra	14B	Mac Mini M4 16GB	1T / 1r	VERY LOW	100%	79.9%	$0.00	—	—
DeepSeek-V3.2	DeepInfra	685B	3× Mac Ultra 192GB cluster	1T / 1r	VERY LOW	100%	79.7%	$0.00	—	—
MiniMax-M2.5	DeepInfra	—	Unknown	1T / 1r	VERY LOW	100%	79.5%	$0.00	—	—

ELO Ratings

Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.

#	Model	Provider	ELO	Confidence	T / Rounds
1	Llama-3.3-70B-Instruct	DeepInfra	1639	VERY LOW	1T / 1r
2	Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	1625	VERY LOW	1T / 1r
3	gemini-2.5-flash	Google	1595	VERY LOW	1T / 1r
4	gemini-3-flash-preview	Google	1569	VERY LOW	1T / 1r
5	gpt-5.4	OpenAI	1542	VERY LOW	1T / 1r
6	claude-sonnet-4-6	Anthropic	1519	VERY LOW	1T / 1r
7	NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	1507	VERY LOW	1T / 2r
8	gemini-3-pro-preview	Google	1491	VERY LOW	1T / 1r
9	Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	1465	VERY LOW	1T / 1r
10	Qwen3-Coder-480B-A35B-Instruct	DeepInfra	1438	VERY LOW	1T / 1r
11	phi-4	DeepInfra	1403	VERY LOW	1T / 1r
12	DeepSeek-V3.2	DeepInfra	1368	VERY LOW	1T / 1r
13	MiniMax-M2.5	DeepInfra	1340	VERY LOW	1T / 1r

Pass@1 vs Self-Repair

How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.

Model	Provider	Pass@1	Repaired	Failed
Llama-3.3-70B-Instruct	DeepInfra	100%	0%	0%
Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	100%	0%	0%
gemini-2.5-flash	Google	100%	0%	0%
gemini-3-flash-preview	Google	100%	0%	0%
gpt-5.4	OpenAI	100%	0%	0%
claude-sonnet-4-6	Anthropic	100%	0%	0%
NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	100%	0%	0%
gemini-3-pro-preview	Google	100%	0%	0%
Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	100%	0%	0%
Qwen3-Coder-480B-A35B-Instruct	DeepInfra	100%	0%	0%
phi-4	DeepInfra	100%	0%	0%
DeepSeek-V3.2	DeepInfra	100%	0%	0%
MiniMax-M2.5	DeepInfra	100%	0%	0%

Performance by Complexity

Average score broken down by task complexity tier.

Model	Provider	Trivial	Medium	Complex
Llama-3.3-70B-Instruct	DeepInfra	—	—	—
Mistral-Small-3.2-24B-Instruct-2506	DeepInfra	—	—	—
gemini-2.5-flash	Google	—	—	—
gemini-3-flash-preview	Google	—	—	—
gpt-5.4	OpenAI	—	—	—
claude-sonnet-4-6	Anthropic	—	—	—
NVIDIA-Nemotron-3-Super-120B-A12B	DeepInfra	—	—	—
gemini-3-pro-preview	Google	—	—	—
Qwen3-Coder-480B-A35B-Instruct-Turb...	DeepInfra	—	—	—
Qwen3-Coder-480B-A35B-Instruct	DeepInfra	—	—	—
phi-4	DeepInfra	—	—	—
DeepSeek-V3.2	DeepInfra	—	—	—
MiniMax-M2.5	DeepInfra	—	—	—

Head-to-Head Comparison

Select two models to compare side-by-side.

Own Hardware vs Cloud

Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.

Single Mac

4/13

Mac/DGX Cluster

3/13

Cloud Only

6/13

All Open $/yr

Proprietary $/yr

Buy or rent?

Top self-hostable model scores 99.9% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$1/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.

Cloud only (proprietary)

gemini-2.5-flash, gemini-3-flash-preview, claude-sonnet-4-6, gpt-5.4, gemini-3-pro-preview, MiniMax-M2.5. These are closed-source — no self-hosting option.

Cluster options (large MoE models)

Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.7%
Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.7%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 79.7%

Single-Mac starter kit

phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 79.9% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 99.9% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 99.9% best)
NVIDIA-Nemotron-3-Super-120B-A12B → Mac Studio M4 Ultra 128GB (~$5,500, 70GB Q4, 99.9% best)

Key Insights

Open models beat premium

Nemotron wins at $0.10/$0.50 per MTok. Opus costs $0.00 lifetime for rank #11.

DeepSeek-V3.2 is a value star

$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.

Precise specs commoditise models

14/15 builders passed all tests. When the spec is clear, model choice barely matters.

Iteration hurts more than helps

Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.

Premium = consistency, not quality

Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.

Total spend: $0.14

Across 5 tournaments, 1 rounds, 13 models. Opus is 0% of total spend.

How We Test

Tournament Protocol

Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.

Scoring

Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.

Fair Comparison

All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.

Cost Tracking

Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.

Score vs Cost

Tournament Results

issue_classifier.py

session_metrics.py

quality_pipeline.py

ops_report.py

greeting.py

Latest Leaderboard

Lifetime Model Performance

Insufficient Data

ELO Ratings

Pass@1 vs Self-Repair

Performance by Complexity

Head-to-Head Comparison

Own Hardware vs Cloud

Buy or rent?

Cloud only (proprietary)

Cluster options (large MoE models)

Single-Mac starter kit

Key Insights

Open models beat premium

DeepSeek-V3.2 is a value star

Precise specs commoditise models

Iteration hurts more than helps

Premium = consistency, not quality

Total spend: $0.14

How We Test

Tournament Protocol

Scoring

Fair Comparison

Cost Tracking