Bob Tournament Runner

LLM models compete to write production Python. Best code wins.

Auto-updated after each tournament · Volume Vision · Download raw data (JSONL)

Tournaments
5
Total Rounds
1
Models
13
Total Spend
$0.14

Score vs Cost

Quality plotted against cost per round. Top-left = best value. Bubble size = number of tournaments.

Tournament Results

Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
gpt-5.4
Score
79.2%
Builders
15
Rounds
2
complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner
gpt-5.4
Score
82.6%
Builders
15
Rounds
3
complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
95.9%
Builders
15
Rounds
4
complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
82.8%
Builders
15
Rounds
4
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
99.9%
Builders
14
Rounds
1

Latest Leaderboard

Best score per builder in the most recent tournament.

# Builder Model Provider Params Arch Tier Self-Host Score ScoreTests Avg Tests Latency $/round
1nemo-2NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 99.9%
35/35100% 2s $0.0017
2llama-1Llama-3.3-70B-Instruct DeepInfra 70B Dense mid Mac Studio M4 Max 64GB 99.9%
35/35100% 2s $0.0014
3mistral-smMistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Dense budget Mac Mini M4 Pro 24GB 99.9%
35/35100% 2s $0.0011
4gemini25-flashgemini-2.5-flash Google Proprietary budget Cloud only 99.9%
35/35100% 2s $0.0021
5gemini3-flashgemini-3-flash-preview Google Proprietary budget Cloud only 99.9%
35/35100% 2s $0.0021
6gpt-1gpt-5.4 OpenAI Proprietary frontier Cloud only 99.9%
35/35100% 4s $0.0350
7sonnet-1claude-sonnet-4-6 Anthropic Proprietary premium Cloud only 99.8%
35/35100% 3s $0.0510
8gemini3-progemini-3-pro-preview Google Proprietary premium Cloud only 99.8%
35/35100% 5s $0.0325
9qwen-turboQwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B MoE (Turbo) mid 2× Mac Ultra 192GB cluster 99.7%
35/35100% 5s $0.0019
10qwen-1Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B MoE premium 2× Mac Ultra 192GB cluster 99.7%
35/35100% 9s $0.0045
11nemo-1NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 99.7%
35/35100% 10s $0.0017
12phi4-1phi-4 DeepInfra 14B Dense budget Mac Mini M4 16GB 79.9%
35/35100% 2s $0.0005
13dsv3-1DeepSeek-V3.2 DeepInfra 685B MoE budget 3× Mac Ultra 192GB cluster 79.7%
35/35100% 10s $0.0017
14minimax-1MiniMax-M2.5 DeepInfra Proprietary mid Unknown 79.5%
35/35100% 16s $0.0034

Lifetime Model Performance

Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).

ModelProviderParamsSelf-HostT / RoundsConfidencePass RateAvg ScoreTrendLifetime $Score/$Value Score
Total Spend$0.14

Insufficient Data

Models with fewer than 6 tournaments. Scores are preliminary.

ModelProviderParamsSelf-HostT / RoundsConfidencePass RateAvg ScoreTrendLifetime $Score/$Value Score
Llama-3.3-70B-Instruct DeepInfra 70B Mac Studio M4 Max 64GB 1T / 1r VERY LOW 100% 99.9% $0.00
Mistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Mac Mini M4 Pro 24GB 1T / 1r VERY LOW 100% 99.9% $0.00
gemini-2.5-flash Google Cloud only 1T / 1r VERY LOW 100% 99.9% $0.00
gemini-3-flash-preview Google Cloud only 1T / 1r VERY LOW 100% 99.9% $0.00
gpt-5.4 OpenAI Cloud only 1T / 1r VERY LOW 100% 99.9% $0.03
claude-sonnet-4-6 Anthropic Cloud only 1T / 1r VERY LOW 100% 99.8% $0.05
NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B Mac Studio M4 Ultra 128GB 1T / 2r VERY LOW 100% 99.8% $0.00
gemini-3-pro-preview Google Cloud only 1T / 1r VERY LOW 100% 99.8% $0.03
Qwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B 2× Mac Ultra 192GB cluster 1T / 1r VERY LOW 100% 99.7% $0.00
Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B 2× Mac Ultra 192GB cluster 1T / 1r VERY LOW 100% 99.7% $0.00
phi-4 DeepInfra 14B Mac Mini M4 16GB 1T / 1r VERY LOW 100% 79.9% $0.00
DeepSeek-V3.2 DeepInfra 685B 3× Mac Ultra 192GB cluster 1T / 1r VERY LOW 100% 79.7% $0.00
MiniMax-M2.5 DeepInfra Unknown 1T / 1r VERY LOW 100% 79.5% $0.00

ELO Ratings

Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.

#ModelProviderELOConfidenceT / Rounds
1Llama-3.3-70B-InstructDeepInfra1639VERY LOW1T / 1r
2Mistral-Small-3.2-24B-Instruct-2506DeepInfra1625VERY LOW1T / 1r
3gemini-2.5-flashGoogle1595VERY LOW1T / 1r
4gemini-3-flash-previewGoogle1569VERY LOW1T / 1r
5gpt-5.4OpenAI1542VERY LOW1T / 1r
6claude-sonnet-4-6Anthropic1519VERY LOW1T / 1r
7NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra1507VERY LOW1T / 2r
8gemini-3-pro-previewGoogle1491VERY LOW1T / 1r
9Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra1465VERY LOW1T / 1r
10Qwen3-Coder-480B-A35B-InstructDeepInfra1438VERY LOW1T / 1r
11phi-4DeepInfra1403VERY LOW1T / 1r
12DeepSeek-V3.2DeepInfra1368VERY LOW1T / 1r
13MiniMax-M2.5DeepInfra1340VERY LOW1T / 1r

Pass@1 vs Self-Repair

How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.

ModelProviderPass@1RepairedFailedDistribution
Llama-3.3-70B-InstructDeepInfra100%0%0%
Mistral-Small-3.2-24B-Instruct-2506DeepInfra100%0%0%
gemini-2.5-flashGoogle100%0%0%
gemini-3-flash-previewGoogle100%0%0%
gpt-5.4OpenAI100%0%0%
claude-sonnet-4-6Anthropic100%0%0%
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra100%0%0%
gemini-3-pro-previewGoogle100%0%0%
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra100%0%0%
Qwen3-Coder-480B-A35B-InstructDeepInfra100%0%0%
phi-4DeepInfra100%0%0%
DeepSeek-V3.2DeepInfra100%0%0%
MiniMax-M2.5DeepInfra100%0%0%

Performance by Complexity

Average score broken down by task complexity tier.

ModelProviderTrivialMediumComplex
Llama-3.3-70B-InstructDeepInfra
Mistral-Small-3.2-24B-Instruct-2506DeepInfra
gemini-2.5-flashGoogle
gemini-3-flash-previewGoogle
gpt-5.4OpenAI
claude-sonnet-4-6Anthropic
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra
gemini-3-pro-previewGoogle
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra
Qwen3-Coder-480B-A35B-InstructDeepInfra
phi-4DeepInfra
DeepSeek-V3.2DeepInfra
MiniMax-M2.5DeepInfra

Head-to-Head Comparison

Select two models to compare side-by-side.

vs
VS

Own Hardware vs Cloud

Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.

Single Mac
4/13
Mac/DGX Cluster
3/13
Cloud Only
6/13
All Open $/yr
$0
Proprietary $/yr
$1

Buy or rent?

Top self-hostable model scores 99.9% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$1/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.

Cloud only (proprietary)

gemini-2.5-flash, gemini-3-flash-preview, claude-sonnet-4-6, gpt-5.4, gemini-3-pro-preview, MiniMax-M2.5. These are closed-source — no self-hosting option.

Cluster options (large MoE models)

Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.7%
Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.7%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 79.7%

Single-Mac starter kit

phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 79.9% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 99.9% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 99.9% best)
NVIDIA-Nemotron-3-Super-120B-A12B → Mac Studio M4 Ultra 128GB (~$5,500, 70GB Q4, 99.9% best)

Key Insights

Open models beat premium

Nemotron wins at $0.10/$0.50 per MTok. Opus costs $0.00 lifetime for rank #11.

DeepSeek-V3.2 is a value star

$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.

Precise specs commoditise models

14/15 builders passed all tests. When the spec is clear, model choice barely matters.

Iteration hurts more than helps

Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.

Premium = consistency, not quality

Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.

Total spend: $0.14

Across 5 tournaments, 1 rounds, 13 models. Opus is 0% of total spend.

How We Test

Tournament Protocol

Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.

Scoring

Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.

Fair Comparison

All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.

Cost Tracking

Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.