Bob Tournament Runner

LLM models compete to write production Python. Best code wins.

Auto-updated after each tournament · Volume Vision · Download raw data (JSONL)

Tournaments
7
Total Rounds
14
Models
19
Total Spend
$6.57

Score vs Cost

Quality plotted against cost per round. Top-left = best value. Bubble size = number of tournaments.

Tournament Results

Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.

trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
gpt-4.1
Score
85.0%
Builders
12
Rounds
3
medium

skill_tracker.py

Breadcrumb file reader/parser. 60 lines. JSON parsing, timestamps.

Winner
gpt-4.1
Score
85.0%
Builders
12
Rounds
3
complex

skill_quality.py

Supabase telemetry. 210 lines. 5 RPCs, JSONL fallback, fcntl locking.

Winner
gpt-4.1
Score
85.0%
Builders
12
Rounds
5
complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
Mistral-Small-3.2-24B-Instruct-2506
Score
54.8%
Builders
15
Rounds
2
complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner
gpt-5.4
Score
55.5%
Builders
15
Rounds
3
complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
73.8%
Builders
15
Rounds
4
?

3A-ops-report

Winner
claude-sonnet-4-6
Score
58.4%
Builders
15
Rounds
4

Latest Leaderboard

Best score per builder in the most recent tournament.

# Builder Model Provider Params Arch Tier Self-Host Score ScoreTests Avg Tests Latency $/round
1gpt-1gpt-5.4 OpenAI Proprietary frontier Cloud only 55.5%
35/35100% 73s $0.1050
2mistral-smMistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Dense budget Mac Mini M4 Pro 24GB 53.2%
33/3596% 76s $0.0043
3gemini25-flashgemini-2.5-flash Google Proprietary budget Cloud only 52.4%
33/3596% 105s $0.0063
4qwen-1Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B MoE premium 2× Mac Ultra 192GB cluster 50.5%
32/3592% 173s $0.0135
5qwen-turboQwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B MoE (Turbo) mid 2× Mac Ultra 192GB cluster 49.3%
32/3592% 239s $0.0056
6dsv3-1DeepSeek-V3.2 DeepInfra 685B MoE budget 3× Mac Ultra 192GB cluster 48.7%
32/3592% 772s $0.0050
7nemo-2NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 40.4%
21/3530% 62s $0.0051
8sonnet-1claude-sonnet-4-6 Anthropic Proprietary premium Cloud only 40.3%
33/3596% 112s $0.1530
9minimax-1MiniMax-M2.5 DeepInfra Proprietary mid Unknown 40.0%
35/35100% 472s $0.0102
10phi4-1phi-4 DeepInfra 14B Dense budget Mac Mini M4 16GB 39.9%
32/3592% 69s $0.0020
11llama-1Llama-3.3-70B-Instruct DeepInfra 70B Dense mid Mac Studio M4 Max 64GB 39.4%
33/3596% 241s $0.0054
12nemo-1NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 37.2%
29/3584% 82s $0.0051

Lifetime Model Performance

Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).

ModelProviderParamsSelf-HostT / RoundsConfidencePass RateAvg ScoreTrendLifetime $Score/$Value Score
Total Spend$6.57

Insufficient Data

Models with fewer than 6 tournaments. Scores are preliminary.

ModelProviderParamsSelf-HostT / RoundsConfidencePass RateAvg ScoreTrendLifetime $Score/$Value Score
claude-opus-4-6 Anthropic Cloud only 2T / 9r VERY LOW 67% 58.6% $3.06
gpt-5.4 OpenAI Cloud only 5T / 13r LOW 91% 55.3% $0.94
Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B 2× Mac Ultra 192GB cluster 4T / 16r LOW 80% 54.7% $0.12
Mistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Mac Mini M4 Pro 24GB 3T / 4r LOW 95% 51.7% $0.02
Qwen2.5-72B-Instruct DeepInfra 72B Mac Studio M4 Ultra 128GB 1T / 15r VERY LOW 40% 49.1% $0.06
Qwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B 2× Mac Ultra 192GB cluster 3T / 4r LOW 94% 48.9% $0.02
claude-sonnet-4-6 Anthropic Cloud only 5T / 13r LOW 68% 46.4% $1.48
gemini-2.5-flash Google Cloud only 3T / 4r LOW 72% 44.9% $0.03
Llama-3.3-70B-Instruct DeepInfra 70B Mac Studio M4 Max 64GB 5T / 31r LOW 61% 43.7% $0.10
gpt-4.1 OpenAI Cloud only 1T / 5r VERY LOW 40% 41.0% $0.31
Llama-3.1-Nemotron-70B-Instruct DeepInfra Unknown 1T / 15r VERY LOW 40% 40.9%
phi-4 DeepInfra 14B Mac Mini M4 16GB 3T / 4r LOW 91% 39.7% $0.01
DeepSeek-V3.2 DeepInfra 685B 3× Mac Ultra 192GB cluster 4T / 8r LOW 49% 32.1% $0.01
MiniMax-M2.5 DeepInfra Unknown 3T / 4r LOW 74% 32.0% $0.03
NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B Mac Studio M4 Ultra 128GB 4T / 20r LOW 55% 31.9% $0.09
DeepSeek-R1-Distill-Llama-70B DeepInfra 70B Mac Studio M4 Max 64GB 1T / 2r VERY LOW 50% 30.9% $0.00
GLM-5.1 DeepInfra Unknown 3T / 4r LOW 24% 13.6% $0.01
gemini-3-pro-preview Google Cloud only 3T / 4r LOW 4% 5.6% $0.26
gemini-3-flash-preview Google Cloud only 3T / 4r LOW 0% 4.2% $0.03

ELO Ratings

Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.

#ModelProviderELOConfidenceT / Rounds
1Qwen2.5-72B-InstructDeepInfra1765VERY LOW1T / 15r
2claude-opus-4-6Anthropic1752VERY LOW2T / 9r
3gpt-5.4OpenAI1740LOW5T / 13r
4gpt-4.1OpenAI1652VERY LOW1T / 5r
5Qwen3-Coder-480B-A35B-InstructDeepInfra1628LOW4T / 16r
6GLM-5.1DeepInfra1618LOW3T / 4r
7gemini-2.5-flashGoogle1594LOW3T / 4r
8claude-sonnet-4-6Anthropic1591LOW5T / 13r
9Mistral-Small-3.2-24B-Instruct-2506DeepInfra1550LOW3T / 4r
10Llama-3.1-Nemotron-70B-InstructDeepInfra1547VERY LOW1T / 15r
11Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra1510LOW3T / 4r
12MiniMax-M2.5DeepInfra1432LOW3T / 4r
13gemini-3-pro-previewGoogle1354LOW3T / 4r
14DeepSeek-R1-Distill-Llama-70BDeepInfra1330VERY LOW1T / 2r
15gemini-3-flash-previewGoogle1329LOW3T / 4r
16phi-4DeepInfra1327LOW3T / 4r
17DeepSeek-V3.2DeepInfra1319LOW4T / 8r
18Llama-3.3-70B-InstructDeepInfra1289LOW5T / 31r
19NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra1171LOW4T / 20r

Pass@1 vs Self-Repair

How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.

ModelProviderPass@1RepairedFailedDistribution
claude-opus-4-6Anthropic11%56%33%
gpt-5.4OpenAI15%46%38%
Qwen3-Coder-480B-A35B-InstructDeepInfra19%38%44%
Mistral-Small-3.2-24B-Instruct-2506DeepInfra25%0%75%
Qwen2.5-72B-InstructDeepInfra0%40%60%
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra0%0%100%
claude-sonnet-4-6Anthropic8%31%62%
gemini-2.5-flashGoogle0%0%100%
Llama-3.3-70B-InstructDeepInfra10%39%52%
gpt-4.1OpenAI0%40%60%
Llama-3.1-Nemotron-70B-InstructDeepInfra0%40%60%
phi-4DeepInfra0%0%100%
DeepSeek-V3.2DeepInfra0%38%62%
MiniMax-M2.5DeepInfra25%25%50%
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra5%20%75%
DeepSeek-R1-Distill-Llama-70BDeepInfra0%50%50%
GLM-5.1DeepInfra0%0%100%
gemini-3-pro-previewGoogle0%0%100%
gemini-3-flash-previewGoogle0%0%100%

Performance by Complexity

Average score broken down by task complexity tier.

ModelProviderTrivialMediumComplex
claude-opus-4-6Anthropic73.7%
gpt-5.4OpenAI54.9%
Qwen3-Coder-480B-A35B-InstructDeepInfra54.9%
Mistral-Small-3.2-24B-Instruct-2506DeepInfra52.9%
Qwen2.5-72B-InstructDeepInfra
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra49.0%
claude-sonnet-4-6Anthropic48.9%
gemini-2.5-flashGoogle42.4%
Llama-3.3-70B-InstructDeepInfra45.5%
gpt-4.1OpenAI
Llama-3.1-Nemotron-70B-InstructDeepInfra
phi-4DeepInfra39.1%
DeepSeek-V3.2DeepInfra35.8%
MiniMax-M2.5DeepInfra26.0%
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra31.0%
DeepSeek-R1-Distill-Llama-70BDeepInfra30.9%
GLM-5.1DeepInfra0.0%
gemini-3-pro-previewGoogle7.4%
gemini-3-flash-previewGoogle5.6%

Head-to-Head Comparison

Select two models to compare side-by-side.

vs
VS

Own Hardware vs Cloud

Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.

Single Mac
6/19
Mac/DGX Cluster
3/19
Cloud Only
10/19
All Open $/yr
$3
Proprietary $/yr
$45

Buy or rent?

Top self-hostable model scores 85.0% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$49/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.

Cloud only (proprietary)

claude-sonnet-4-6, Llama-3.1-Nemotron-70B-Instruct, claude-opus-4-6, gpt-4.1, gpt-5.4, gemini-2.5-flash, gemini-3-pro-preview, gemini-3-flash-preview, GLM-5.1, MiniMax-M2.5. These are closed-source — no self-hosting option.

Cluster options (large MoE models)

Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 73.7%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 72.1%
Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 49.3%

Single-Mac starter kit

phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 41.6% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 54.8% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 85.0% best)
DeepSeek-R1-Distill-Llama-70B → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 59.1% best)

Key Insights

Open models beat premium

Nemotron wins at $0.10/$0.50 per MTok. Opus costs $3.06 lifetime for rank #11.

DeepSeek-V3.2 is a value star

$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.

Precise specs commoditise models

14/15 builders passed all tests. When the spec is clear, model choice barely matters.

Iteration hurts more than helps

Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.

Premium = consistency, not quality

Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.

Total spend: $6.57

Across 7 tournaments, 14 rounds, 19 models. Opus is 47% of total spend.

How We Test

Tournament Protocol

Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.

Scoring

Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.

Fair Comparison

All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.

Cost Tracking

Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.