Bob Tournament Runner

LLM models compete to write production Python. Best code wins.

Auto-updated after each tournament · Volume Vision · Download raw data (JSONL)

Tournaments
20
Total Rounds
52
Models
14
Total Spend
$10.45

Score vs Cost

Quality plotted against cost per round. Top-left = best value. Bubble size = number of tournaments.

Tournament Results

Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.

complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
gpt-5.4
Score
79.2%
Builders
15
Rounds
2
complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner
gpt-5.4
Score
82.6%
Builders
15
Rounds
3
complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
95.9%
Builders
15
Rounds
4
complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
82.8%
Builders
15
Rounds
4
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
99.9%
Builders
14
Rounds
1
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
99.9%
Builders
14
Rounds
1
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
99.9%
Builders
14
Rounds
4
trivial

greeting.py

Simple greeting module. ~20 lines.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
99.9%
Builders
14
Rounds
1
complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
gpt-5.4
Score
79.7%
Builders
14
Rounds
5
complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
gemini-3-flash-preview
Score
71.5%
Builders
14
Rounds
5
complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
0.0%
Builders
11
Rounds
3
complex

issue_classifier.py

GitHub issue triage classifier. Multi-label, priority scoring, team routing.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
0.0%
Builders
11
Rounds
3
complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner
Llama-3.3-70B-Instruct-Turbo
Score
0.0%
Builders
11
Rounds
3
complex

skill_quality.py

Supabase telemetry. 210 lines. 5 RPCs, JSONL fallback, fcntl locking.

Winner
Qwen3-Coder-480B-A35B-Instruct
Score
0.0%
Builders
11
Rounds
3
medium

skill_tracker.py

Breadcrumb file reader/parser. 60 lines. JSON parsing, timestamps.

Winner
DeepSeek-V3.2
Score
0.0%
Builders
11
Rounds
3
medium

skill_tracker.py

Breadcrumb file reader/parser. 60 lines. JSON parsing, timestamps.

Winner
NVIDIA-Nemotron-3-Super-120B-A12B
Score
86.5%
Builders
11
Rounds
5
complex

session_metrics.py

Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.

Winner
gpt-5.4
Score
78.9%
Builders
11
Rounds
5
complex

quality_pipeline.py

EMA penalty calculator + health monitor + pre-filter. 131 lines.

Winner
gemini-3-flash-preview
Score
86.2%
Builders
11
Rounds
5
complex

skill_quality.py

Supabase telemetry. 210 lines. 5 RPCs, JSONL fallback, fcntl locking.

Winner
Mistral-Small-3.2-24B-Instruct-2506
Score
85.4%
Builders
11
Rounds
5
complex

ops_report.py

Operational report generator. System health aggregation, alert correlation, markdown output.

Winner
gpt-5.4
Score
72.0%
Builders
11
Rounds
5

Latest Leaderboard

Best score per builder in the most recent tournament.

# Builder Model Provider Params Arch Tier Self-Host Score ScoreTests Avg Tests Latency $/round
1nemo-1NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B MoE mid Mac Studio M4 Ultra 128GB 86.5%
35/3595% 9s $0.0037
2qwen-turboQwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B MoE (Turbo) mid 2× Mac Ultra 192GB cluster 86.4%
35/35100% 8s $0.0035
3mistral-smMistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Dense budget Mac Mini M4 Pro 24GB 86.4%
35/35100% 5s $0.0012
4sonnet-1claude-sonnet-4-6 Anthropic Proprietary premium Cloud only 86.3%
35/35100% 6s $0.0640
5gpt-1gpt-5.4 OpenAI Proprietary frontier Cloud only 86.2%
35/35100% 6s $0.0636
6gemini3-flashgemini-3-flash-preview Google Proprietary budget Cloud only 86.1%
35/35100% 5s $0.0104
7llama-1Llama-3.3-70B-Instruct-Turbo DeepInfra Unknown unknown Unknown 86.1%
35/35100% 17s $0.0016
8dsv3-1DeepSeek-V3.2 DeepInfra 685B MoE budget 3× Mac Ultra 192GB cluster 83.7%
35/35100% 78s $0.0005
9qwen-1Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B MoE premium 2× Mac Ultra 192GB cluster 74.9%
35/35100% 8s $0.0061
10phi4-1phi-4 DeepInfra 14B Dense budget Mac Mini M4 16GB 74.8%
35/35100% 5s $0.0006

Lifetime Model Performance

Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).

Model Provider Params Self-Host T / Rounds Confidence Pass Rate Avg Score TrendLifetime $ Score/$ Value Score
phi-4 DeepInfra 14B Mac Mini M4 16GB 14T / 52r HIGH 63% 41.9% $0.05 9 66
Mistral-Small-3.2-24B-Instruct-2506 DeepInfra 24B Mac Mini M4 Pro 24GB 14T / 52r HIGH 64% 51.8% $0.07 8 72
DeepSeek-V3.2 DeepInfra 685B 3× Mac Ultra 192GB cluster 14T / 52r HIGH 31% 23.1% $0.04 6 45
Llama-3.3-70B-Instruct-Turbo DeepInfra Unknown 11T / 45r HIGH 60% 42.1% $0.08 5 63
NVIDIA-Nemotron-3-Super-120B-A12B DeepInfra 120B Mac Studio M4 Ultra 128GB 14T / 64r HIGH 53% 44.3% $0.21 2 58
Qwen3-Coder-480B-A35B-Instruct-Turb... DeepInfra 480B 2× Mac Ultra 192GB cluster 14T / 52r HIGH 58% 47.5% $0.26 2 61
Qwen3-Coder-480B-A35B-Instruct DeepInfra 480B 2× Mac Ultra 192GB cluster 14T / 52r HIGH 60% 44.0% $0.38 1 57
MiniMax-M2.5 DeepInfra Unknown 14T / 52r HIGH 28% 19.6% $0.28 1 33
gemini-3-flash-preview Google Cloud only 14T / 52r HIGH 69% 54.5% $0.83 1 64
gpt-5.4 OpenAI Cloud only 14T / 52r HIGH 70% 55.6% $3.80 0 58
claude-sonnet-4-6 Anthropic Cloud only 14T / 52r HIGH 64% 46.7% $3.62 0 51
Total Spend$10.45

Insufficient Data

Models with fewer than 6 tournaments. Scores are preliminary.

Model Provider Params Self-Host T / Rounds Confidence Pass Rate Avg Score TrendLifetime $ Score/$ Value Score
gemini-3-pro-preview Google Cloud only 4T / 12r LOW 59% 50.0% $0.64
gemini-2.5-flash Google Cloud only 4T / 12r LOW 55% 49.0% $0.19
Llama-3.3-70B-Instruct DeepInfra 70B Mac Studio M4 Max 64GB 3T / 7r LOW 55% 45.8%

ELO Ratings

Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.

#ModelProviderELOConfidenceT / Rounds
1gpt-5.4OpenAI1911HIGH14T / 52r
2gemini-3-flash-previewGoogle1797HIGH14T / 52r
3Mistral-Small-3.2-24B-Instruct-2506DeepInfra1741HIGH14T / 52r
4claude-sonnet-4-6Anthropic1683HIGH14T / 52r
5Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra1547HIGH14T / 52r
6gemini-3-pro-previewGoogle1519LOW4T / 12r
7Llama-3.3-70B-InstructDeepInfra1452LOW3T / 7r
8MiniMax-M2.5DeepInfra1398HIGH14T / 52r
9phi-4DeepInfra1397HIGH14T / 52r
10Qwen3-Coder-480B-A35B-InstructDeepInfra1393HIGH14T / 52r
11Llama-3.3-70B-Instruct-TurboDeepInfra1313HIGH11T / 45r
12DeepSeek-V3.2DeepInfra1305HIGH14T / 52r
13gemini-2.5-flashGoogle1290LOW4T / 12r
14NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra1255HIGH14T / 64r

Pass@1 vs Self-Repair

How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.

ModelProviderPass@1RepairedFailedDistribution
gpt-5.4OpenAI13%33%54%
gemini-3-flash-previewGoogle10%21%69%
Mistral-Small-3.2-24B-Instruct-2506DeepInfra10%23%67%
gemini-3-pro-previewGoogle17%0%83%
gemini-2.5-flashGoogle17%0%83%
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra10%23%67%
claude-sonnet-4-6Anthropic12%29%60%
Llama-3.3-70B-InstructDeepInfra29%0%71%
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra8%11%81%
Qwen3-Coder-480B-A35B-InstructDeepInfra10%23%67%
Llama-3.3-70B-Instruct-TurboDeepInfra7%24%69%
phi-4DeepInfra10%23%67%
DeepSeek-V3.2DeepInfra8%19%73%
MiniMax-M2.5DeepInfra4%10%87%

Performance by Complexity

Average score broken down by task complexity tier.

ModelProviderTrivialMediumComplex
gpt-5.4OpenAI
gemini-3-flash-previewGoogle
Mistral-Small-3.2-24B-Instruct-2506DeepInfra
gemini-3-pro-previewGoogle
gemini-2.5-flashGoogle
Qwen3-Coder-480B-A35B-Instruct-Turb...DeepInfra
claude-sonnet-4-6Anthropic
Llama-3.3-70B-InstructDeepInfra
NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra
Qwen3-Coder-480B-A35B-InstructDeepInfra
Llama-3.3-70B-Instruct-TurboDeepInfra
phi-4DeepInfra
DeepSeek-V3.2DeepInfra
MiniMax-M2.5DeepInfra

Head-to-Head Comparison

Select two models to compare side-by-side.

vs
VS

Own Hardware vs Cloud

Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.

Single Mac
4/14
Mac/DGX Cluster
3/14
Cloud Only
7/14
All Open $/yr
$3
Proprietary $/yr
$24

Buy or rent?

Top self-hostable model scores 99.9% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$27/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.

Cloud only (proprietary)

gemini-2.5-flash, gemini-3-flash-preview, claude-sonnet-4-6, gpt-5.4, gemini-3-pro-preview, MiniMax-M2.5, Llama-3.3-70B-Instruct-Turbo. These are closed-source — no self-hosting option.

Cluster options (large MoE models)

Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.9%
Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.9%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.8%

Single-Mac starter kit

phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 79.9% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 99.9% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 99.9% best)
NVIDIA-Nemotron-3-Super-120B-A12B → Mac Studio M4 Ultra 128GB (~$5,500, 70GB Q4, 99.9% best)

Key Insights

Open models beat premium

Nemotron wins at $0.10/$0.50 per MTok. Opus costs $0.00 lifetime for rank #11.

DeepSeek-V3.2 is a value star

$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.

Precise specs commoditise models

14/15 builders passed all tests. When the spec is clear, model choice barely matters.

Iteration hurts more than helps

Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.

Premium = consistency, not quality

Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.

Total spend: $10.45

Across 20 tournaments, 52 rounds, 14 models. Opus is 0% of total spend.

How We Test

Tournament Protocol

Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.

Scoring

Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.

Fair Comparison

All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.

Cost Tracking

Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.