Bob Tournament Runner

LLM models compete to write production Python. Best code wins.

Auto-updated after each tournament · Volume Vision · Download raw data (JSONL)

Tournaments
9
Total Rounds
7
Models
13
Total Spend
$1.96

Score vs Cost

Quality plotted against cost per round. Top-left = best value. Bubble size = number of tournaments.

Tournament Results

Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
gpt-5.4
Score
79.2%
Builders
15
Rounds
2
complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner
gpt-5.4
Score
82.6%
Builders
15
Rounds
3
complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
95.9%
Builders
15
Rounds
4
complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
82.8%
Builders
15
Rounds
4
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
99.9%
Builders
14
Rounds
1
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
99.9%
Builders
14
Rounds
1
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
99.9%
Builders
14
Rounds
4
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
99.9%
Builders
14
Rounds
1
complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
gpt-5.4
Score
79.7%
Builders
14
Rounds
5

Latest Leaderboard

Best score per builder in the most recent tournament.

# Builder Model Provider Params Arch Tier Self-Host Score ScoreTests Avg Tests Latency $/round
1gpt-1gpt-5.4 OpenAI Proprietary frontier Cloud only 79.7%
33/3595% 76s $0.0549
2gemini3-flashgemini-3-flash-preview Google Proprietary budget Cloud only 78.9%
33/3595% 28s $0.0035
3qwen-1Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B MoE premium 2× Mac Ultra 192GB cluster 78.0%
33/3538% 59s $0.0047
4mistral-smMistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Dense budget Mac Mini M4 Pro 24GB 76.8%
32/3585% 63s $0.0023
5qwen-turboQwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B MoE (Turbo) mid 2× Mac Ultra 192GB cluster 76.3%
33/3537% 45s $0.0016
6gemini25-flashgemini-2.5-flash Google Proprietary budget Cloud only 76.1%
32/3537% 39s $0.0013
7gemini3-progemini-3-pro-preview Google Proprietary premium Cloud only 74.4%
33/3538% 37s $0.0069
8nemo-2NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 73.2%
29/3557% 59s $0.0048
9nemo-1NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 72.0%
28/3555% 56s $0.0048
10phi4-1phi-4 DeepInfra 14B Dense budget Mac Mini M4 16GB 64.1%
33/3590% 58s $0.0013
11sonnet-1claude-sonnet-4-6 Anthropic Proprietary premium Cloud only 62.8%
34/3538% 40s $0.0403
12llama-1Llama-3.3-70B-Instruct DeepInfra 70B Dense mid Mac Studio M4 Max 64GB 60.3%
32/3538% 72s $0.0018
13minimax-1MiniMax-M2.5 DeepInfra Proprietary mid Unknown 57.0%
33/3519% 139s $0.0028

Lifetime Model Performance

Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).

ModelProviderParamsSelf-HostT / RoundsConfidencePass RateAvg ScoreTrendLifetime $Score/$Value Score
Total Spend$1.96

Insufficient Data

Models with fewer than 6 tournaments. Scores are preliminary.

ModelProviderParamsSelf-HostT / RoundsConfidencePass RateAvg ScoreTrendLifetime $Score/$Value Score
gpt-5.4 OpenAI Cloud only 3T / 7r LOW 96% 85.3% $0.85
gemini-3-flash-preview Google Cloud only 3T / 7r LOW 96% 84.5% $0.17
Mistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Mac Mini M4 Pro 24GB 3T / 7r LOW 90% 81.1% $0.02
phi-4 DeepInfra 14B Mac Mini M4 16GB 3T / 7r LOW 93% 66.2% $0.01
NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B Mac Studio M4 Ultra 128GB 3T / 14r LOW 68% 63.9% $0.06
Qwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B 2× Mac Ultra 192GB cluster 3T / 7r LOW 55% 53.9% $0.06
gemini-2.5-flash Google Cloud only 3T / 7r LOW 55% 50.1% $0.06
gemini-3-pro-preview Google Cloud only 3T / 7r LOW 56% 49.7% $0.24
Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B 2× Mac Ultra 192GB cluster 3T / 7r LOW 56% 48.5% $0.04
claude-sonnet-4-6 Anthropic Cloud only 3T / 7r LOW 56% 46.3% $0.41
Llama-3.3-70B-Instruct DeepInfra 70B Mac Studio M4 Max 64GB 3T / 7r LOW 55% 45.8%
MiniMax-M2.5 DeepInfra Unknown 3T / 7r LOW 42% 33.7% $0.03
DeepSeek-V3.2 DeepInfra 685B 3× Mac Ultra 192GB cluster 3T / 7r LOW 29% 25.6% $0.01

ELO Ratings

Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.

#ModelProviderELOConfidenceT / Rounds
1gpt-5.4OpenAI1853LOW3T / 7r
2gemini-3-flash-previewGoogle1724LOW3T / 7r
3Qwen3-Coder-480B-A35B-InstructDeepInfra1685LOW3T / 7r
4Mistral-Small-3.2-24B-Instruct-2506DeepInfra1646LOW3T / 7r
5gemini-2.5-flashGoogle1628LOW3T / 7r
6gemini-3-pro-previewGoogle1546LOW3T / 7r
7Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra1474LOW3T / 7r
8Llama-3.3-70B-InstructDeepInfra1452LOW3T / 7r
9claude-sonnet-4-6Anthropic1435LOW3T / 7r
10DeepSeek-V3.2DeepInfra1333LOW3T / 7r
11phi-4DeepInfra1279LOW3T / 7r
12NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra1275LOW3T / 14r
13MiniMax-M2.5DeepInfra1170LOW3T / 7r

Pass@1 vs Self-Repair

How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.

ModelProviderPass@1RepairedFailedDistribution
gpt-5.4OpenAI29%0%71%
gemini-3-flash-previewGoogle29%0%71%
Mistral-Small-3.2-24B-Instruct-2506DeepInfra29%0%71%
phi-4DeepInfra29%0%71%
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra29%0%71%
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra29%0%71%
gemini-2.5-flashGoogle29%0%71%
gemini-3-pro-previewGoogle29%0%71%
Qwen3-Coder-480B-A35B-InstructDeepInfra29%0%71%
claude-sonnet-4-6Anthropic29%0%71%
Llama-3.3-70B-InstructDeepInfra29%0%71%
MiniMax-M2.5DeepInfra29%0%71%
DeepSeek-V3.2DeepInfra29%0%71%

Performance by Complexity

Average score broken down by task complexity tier.

ModelProviderTrivialMediumComplex
gpt-5.4OpenAI
gemini-3-flash-previewGoogle
Mistral-Small-3.2-24B-Instruct-2506DeepInfra
phi-4DeepInfra
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra
gemini-2.5-flashGoogle
gemini-3-pro-previewGoogle
Qwen3-Coder-480B-A35B-InstructDeepInfra
claude-sonnet-4-6Anthropic
Llama-3.3-70B-InstructDeepInfra
MiniMax-M2.5DeepInfra
DeepSeek-V3.2DeepInfra

Head-to-Head Comparison

Select two models to compare side-by-side.

vs
VS

Own Hardware vs Cloud

Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.

Single Mac
4/13
Mac/DGX Cluster
3/13
Cloud Only
6/13
All Open $/yr
$1
Proprietary $/yr
$10

Buy or rent?

Top self-hostable model scores 99.9% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$11/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.

Cloud only (proprietary)

gemini-2.5-flash, gemini-3-flash-preview, claude-sonnet-4-6, gpt-5.4, gemini-3-pro-preview, MiniMax-M2.5. These are closed-source — no self-hosting option.

Cluster options (large MoE models)

Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.9%
Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.9%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.8%

Single-Mac starter kit

phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 79.9% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 99.9% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 99.9% best)
NVIDIA-Nemotron-3-Super-120B-A12B → Mac Studio M4 Ultra 128GB (~$5,500, 70GB Q4, 99.9% best)

Key Insights

Open models beat premium

Nemotron wins at $0.10/$0.50 per MTok. Opus costs $0.00 lifetime for rank #11.

DeepSeek-V3.2 is a value star

$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.

Precise specs commoditise models

14/15 builders passed all tests. When the spec is clear, model choice barely matters.

Iteration hurts more than helps

Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.

Premium = consistency, not quality

Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.

Total spend: $1.96

Across 9 tournaments, 7 rounds, 13 models. Opus is 0% of total spend.

How We Test

Tournament Protocol

Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.

Scoring

Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.

Fair Comparison

All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.

Cost Tracking

Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.