LLM models compete to write production Python. Best code wins.
Auto-updated after each tournament · Volume Vision · Download raw data (JSONL)
Quality plotted against cost per round. Top-left = best value. Bubble size = number of tournaments.
Each tournament gives the same task to all models. Multiple rounds with test feedback. Best composite score wins.
GitHub issue triage classifier. Multi-label, priority scoring, team routing.
Session telemetry aggregator. Event streams, time-windowed stats, percentile calc.
EMA penalty calculator + health monitor + pre-filter. 131 lines.
Operational report generator. System health aggregation, alert correlation, markdown output.
Simple greeting module. ~20 lines.
Best score per builder in the most recent tournament.
| # ▴ | Builder ▴ | Model ▴ | Provider ▴ | Params ▴ | Arch ▴ | Tier ▴ | Self-Host ▴ | Score ▴ | Score | Tests ▴ | Avg Tests ▴ | Latency ▴ | $/round ▴ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | nemo-2 | NVIDIA-Nemotron-3-Super-120B-A12B | DeepInfra | 120B | MoE | mid | Mac Studio M4 Ultra 128GB | 99.9% | 35/35 | 100% | 2s | $0.0017 | |
| 2 | llama-1 | Llama-3.3-70B-Instruct | DeepInfra | 70B | Dense | mid | Mac Studio M4 Max 64GB | 99.9% | 35/35 | 100% | 2s | $0.0014 | |
| 3 | mistral-sm | Mistral-Small-3.2-24B-Instruct-2506 | DeepInfra | 24B | Dense | budget | Mac Mini M4 Pro 24GB | 99.9% | 35/35 | 100% | 2s | $0.0011 | |
| 4 | gemini25-flash | gemini-2.5-flash | — | Proprietary | budget | Cloud only | 99.9% | 35/35 | 100% | 2s | $0.0021 | ||
| 5 | gemini3-flash | gemini-3-flash-preview | — | Proprietary | budget | Cloud only | 99.9% | 35/35 | 100% | 2s | $0.0021 | ||
| 6 | gpt-1 | gpt-5.4 | OpenAI | — | Proprietary | frontier | Cloud only | 99.9% | 35/35 | 100% | 4s | $0.0350 | |
| 7 | sonnet-1 | claude-sonnet-4-6 | Anthropic | — | Proprietary | premium | Cloud only | 99.8% | 35/35 | 100% | 3s | $0.0510 | |
| 8 | gemini3-pro | gemini-3-pro-preview | — | Proprietary | premium | Cloud only | 99.8% | 35/35 | 100% | 5s | $0.0325 | ||
| 9 | qwen-turbo | Qwen3-Coder-480B-A35B-Instruct-Turb... | DeepInfra | 480B | MoE (Turbo) | mid | 2× Mac Ultra 192GB cluster | 99.7% | 35/35 | 100% | 5s | $0.0019 | |
| 10 | qwen-1 | Qwen3-Coder-480B-A35B-Instruct | DeepInfra | 480B | MoE | premium | 2× Mac Ultra 192GB cluster | 99.7% | 35/35 | 100% | 9s | $0.0045 | |
| 11 | nemo-1 | NVIDIA-Nemotron-3-Super-120B-A12B | DeepInfra | 120B | MoE | mid | Mac Studio M4 Ultra 128GB | 99.7% | 35/35 | 100% | 10s | $0.0017 | |
| 12 | phi4-1 | phi-4 | DeepInfra | 14B | Dense | budget | Mac Mini M4 16GB | 79.9% | 35/35 | 100% | 2s | $0.0005 | |
| 13 | dsv3-1 | DeepSeek-V3.2 | DeepInfra | 685B | MoE | budget | 3× Mac Ultra 192GB cluster | 79.7% | 35/35 | 100% | 10s | $0.0017 | |
| 14 | minimax-1 | MiniMax-M2.5 | DeepInfra | — | Proprietary | mid | Unknown | 79.5% | 35/35 | 100% | 16s | $0.0034 |
Models with 6+ tournaments. Sorted by value score. Confidence: HIGH (10+T), MED (6+T), LOW (3+T).
| Model | Provider | Params | Self-Host | T / Rounds | Confidence | Pass Rate | Avg Score | Trend | Lifetime $ | Score/$ | Value Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Total Spend | $0.14 | ||||||||||
Models with fewer than 6 tournaments. Scores are preliminary.
| Model | Provider | Params | Self-Host | T / Rounds | Confidence | Pass Rate | Avg Score | Trend | Lifetime $ | Score/$ | Value Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama-3.3-70B-Instruct | DeepInfra | 70B | Mac Studio M4 Max 64GB | 1T / 1r | VERY LOW | 100% | 99.9% | $0.00 | — | — | |
| Mistral-Small-3.2-24B-Instruct-2506 | DeepInfra | 24B | Mac Mini M4 Pro 24GB | 1T / 1r | VERY LOW | 100% | 99.9% | $0.00 | — | — | |
| gemini-2.5-flash | — | Cloud only | 1T / 1r | VERY LOW | 100% | 99.9% | $0.00 | — | — | ||
| gemini-3-flash-preview | — | Cloud only | 1T / 1r | VERY LOW | 100% | 99.9% | $0.00 | — | — | ||
| gpt-5.4 | OpenAI | — | Cloud only | 1T / 1r | VERY LOW | 100% | 99.9% | $0.03 | — | — | |
| claude-sonnet-4-6 | Anthropic | — | Cloud only | 1T / 1r | VERY LOW | 100% | 99.8% | $0.05 | — | — | |
| NVIDIA-Nemotron-3-Super-120B-A12B | DeepInfra | 120B | Mac Studio M4 Ultra 128GB | 1T / 2r | VERY LOW | 100% | 99.8% | $0.00 | — | — | |
| gemini-3-pro-preview | — | Cloud only | 1T / 1r | VERY LOW | 100% | 99.8% | $0.03 | — | — | ||
| Qwen3-Coder-480B-A35B-Instruct-Turb... | DeepInfra | 480B | 2× Mac Ultra 192GB cluster | 1T / 1r | VERY LOW | 100% | 99.7% | $0.00 | — | — | |
| Qwen3-Coder-480B-A35B-Instruct | DeepInfra | 480B | 2× Mac Ultra 192GB cluster | 1T / 1r | VERY LOW | 100% | 99.7% | $0.00 | — | — | |
| phi-4 | DeepInfra | 14B | Mac Mini M4 16GB | 1T / 1r | VERY LOW | 100% | 79.9% | $0.00 | — | — | |
| DeepSeek-V3.2 | DeepInfra | 685B | 3× Mac Ultra 192GB cluster | 1T / 1r | VERY LOW | 100% | 79.7% | $0.00 | — | — | |
| MiniMax-M2.5 | DeepInfra | — | Unknown | 1T / 1r | VERY LOW | 100% | 79.5% | $0.00 | — | — |
Bradley-Terry rankings from pairwise round comparisons. Higher = wins more head-to-heads.
| # | Model | Provider | ELO | Confidence | T / Rounds |
|---|---|---|---|---|---|
| 1 | Llama-3.3-70B-Instruct | DeepInfra | 1639 | VERY LOW | 1T / 1r |
| 2 | Mistral-Small-3.2-24B-Instruct-2506 | DeepInfra | 1625 | VERY LOW | 1T / 1r |
| 3 | gemini-2.5-flash | 1595 | VERY LOW | 1T / 1r | |
| 4 | gemini-3-flash-preview | 1569 | VERY LOW | 1T / 1r | |
| 5 | gpt-5.4 | OpenAI | 1542 | VERY LOW | 1T / 1r |
| 6 | claude-sonnet-4-6 | Anthropic | 1519 | VERY LOW | 1T / 1r |
| 7 | NVIDIA-Nemotron-3-Super-120B-A12B | DeepInfra | 1507 | VERY LOW | 1T / 2r |
| 8 | gemini-3-pro-preview | 1491 | VERY LOW | 1T / 1r | |
| 9 | Qwen3-Coder-480B-A35B-Instruct-Turb... | DeepInfra | 1465 | VERY LOW | 1T / 1r |
| 10 | Qwen3-Coder-480B-A35B-Instruct | DeepInfra | 1438 | VERY LOW | 1T / 1r |
| 11 | phi-4 | DeepInfra | 1403 | VERY LOW | 1T / 1r |
| 12 | DeepSeek-V3.2 | DeepInfra | 1368 | VERY LOW | 1T / 1r |
| 13 | MiniMax-M2.5 | DeepInfra | 1340 | VERY LOW | 1T / 1r |
How often does a model pass all tests on first attempt vs needing iteration? Green = Pass@1, Yellow = Repaired, Red = Failed.
| Model | Provider | Pass@1 | Repaired | Failed | Distribution |
|---|---|---|---|---|---|
| Llama-3.3-70B-Instruct | DeepInfra | 100% | 0% | 0% | |
| Mistral-Small-3.2-24B-Instruct-2506 | DeepInfra | 100% | 0% | 0% | |
| gemini-2.5-flash | 100% | 0% | 0% | ||
| gemini-3-flash-preview | 100% | 0% | 0% | ||
| gpt-5.4 | OpenAI | 100% | 0% | 0% | |
| claude-sonnet-4-6 | Anthropic | 100% | 0% | 0% | |
| NVIDIA-Nemotron-3-Super-120B-A12B | DeepInfra | 100% | 0% | 0% | |
| gemini-3-pro-preview | 100% | 0% | 0% | ||
| Qwen3-Coder-480B-A35B-Instruct-Turb... | DeepInfra | 100% | 0% | 0% | |
| Qwen3-Coder-480B-A35B-Instruct | DeepInfra | 100% | 0% | 0% | |
| phi-4 | DeepInfra | 100% | 0% | 0% | |
| DeepSeek-V3.2 | DeepInfra | 100% | 0% | 0% | |
| MiniMax-M2.5 | DeepInfra | 100% | 0% | 0% |
Average score broken down by task complexity tier.
| Model | Provider | Trivial | Medium | Complex |
|---|---|---|---|---|
| Llama-3.3-70B-Instruct | DeepInfra | — | — | — |
| Mistral-Small-3.2-24B-Instruct-2506 | DeepInfra | — | — | — |
| gemini-2.5-flash | — | — | — | |
| gemini-3-flash-preview | — | — | — | |
| gpt-5.4 | OpenAI | — | — | — |
| claude-sonnet-4-6 | Anthropic | — | — | — |
| NVIDIA-Nemotron-3-Super-120B-A12B | DeepInfra | — | — | — |
| gemini-3-pro-preview | — | — | — | |
| Qwen3-Coder-480B-A35B-Instruct-Turb... | DeepInfra | — | — | — |
| Qwen3-Coder-480B-A35B-Instruct | DeepInfra | — | — | — |
| phi-4 | DeepInfra | — | — | — |
| DeepSeek-V3.2 | DeepInfra | — | — | — |
| MiniMax-M2.5 | DeepInfra | — | — | — |
Select two models to compare side-by-side.
Could we run the winning models on our own hardware instead of paying cloud APIs? Q4 quantisation estimates. Single-Mac, multi-Mac cluster, or Nvidia DGX options.
Top self-hostable model scores 99.9% — competitive with cloud. Single-Mac option: Mac Studio M4 Ultra 128GB (~$5,500). Cloud API spend is ~$1/yr at current volume — hardware pays for itself when used across all local AI workloads (council, audit, generation) 24/7.
gemini-2.5-flash, gemini-3-flash-preview, claude-sonnet-4-6, gpt-5.4, gemini-3-pro-preview, MiniMax-M2.5. These are closed-source — no self-hosting option.
Qwen3-Coder-480B-A35B-Instruct-Turb... (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.7%
Qwen3-Coder-480B-A35B-Instruct (280GB Q4) — 2× Mac Ultra 192GB cluster (~$14,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 99.7%
DeepSeek-V3.2 (400GB Q4) — 3× Mac Ultra 192GB cluster (~$21,000) or 1× DGX H100 (640GB) (~$30K used). Best score: 79.7%
phi-4 → Mac Mini M4 16GB (~$800, 8GB Q4, 79.9% best)
Mistral-Small-3.2-24B-Instruct-2506 → Mac Mini M4 Pro 24GB (~$1,400, 14GB Q4, 99.9% best)
Llama-3.3-70B-Instruct → Mac Studio M4 Max 64GB (~$3,000, 40GB Q4, 99.9% best)
NVIDIA-Nemotron-3-Super-120B-A12B → Mac Studio M4 Ultra 128GB (~$5,500, 70GB Q4, 99.9% best)
Nemotron wins at $0.10/$0.50 per MTok. Opus costs $0.00 lifetime for rank #11.
$0.0008/round — cheapest in the pool. 100% tests on first entry. Immediately competitive.
14/15 builders passed all tests. When the spec is clear, model choice barely matters.
Round 2: only 3/12 passed vs 10/12 in Round 1. Models break working code on retry.
Opus: 4/4 consistency but rank #11. You pay for reliability, not output quality.
Across 5 tournaments, 1 rounds, 13 models. Opus is 0% of total spend.
Each tournament gives the same production Python task to all models simultaneously. Models get the task spec, existing tests, and a scoring rubric. They generate code, which is automatically tested against the full test suite.
Composite score: test pass rate + checklist compliance + code quality. Multiple rounds allow iteration — models see test failures and can fix their code. Score penalties apply for each additional iteration.
All models run in isolated git worktrees with identical inputs. No prompt engineering per model. Same temperature, same token limits (adjusted for model capability). Real production tasks, not synthetic benchmarks.
Costs estimated from provider pricing schedules (input/output tokens per million). Actual token counts vary by model verbosity. Self-host costs based on Q4 quantisation memory requirements for Apple Silicon.