The v1.1.0 post had a single "one model wins all" PROD line; v1.8 has per-axis baselines from three live cohorts (May 16-17 '26).
โ๏ธ /consultants coder โ per-language routing (mlang v1.0.1, suite hash ddef8095, 6 langs ร 13 questions).
| Lang |
Primary |
Fallback |
| c |
glm-5.1:cloud |
deepseek-v4-pro:cloud |
| cpp |
deepseek-v4-flash:cloud |
kimi-k2.6:cloud |
| csharp |
deepseek-v4-pro:cloud |
kimi-k2.6:cloud |
| go |
kimi-k2.6:cloud |
deepseek-v4-pro:cloud |
| python |
glm-5.1:cloud |
kimi-k2.6:cloud |
| rust |
deepseek-v4-flash:cloud |
deepseek-v4-pro:cloud |
No single model wins across languages. Out-of-cohort langs (ts/java/ruby/swift/shell) fall to global default glm-5.1:cloud โ kimi-k2.6:cloud.
๐ tool_executor โ 48 trials, suite 7921555c:
โข gemma4:31b-cloud โ 87.5% / Q=5.00 โ winner (tiebreak: quality โ wall โ call count)
โข kimi-k2.6 / deepseek-v4-pro / glm-5.1 โ all 87.5% but lost tiebreakers
โข gemini-3-flash-preview โ 75% (qualifies)
โข qwen3-coder-next โ 62.5%, DISQUALIFIED. The Python coder winner is NOT a good tool_executor โ "best at writing code" โ "best at mechanical tool chains for reading code".
โฑ Stall thresholds (M11a, suite c8306c62):
โข kimi-k2.6:cloud cold TTFT = 150s โ needs stall=390s. The global 300s default would falsely STARTUP_STALL it.
โข qwen3-coder-next fastest startup (390ms).
โข deepseek-v4-flash slowest p99 wall (255s) โ hard_cap raised to 780s.
๐งฎ Rubric: pass_rate โฅ 70% AND avg_quality โฅ 3.5. Baselines append-only at docs/consultants-skill-eval-baselines.md. Re-bench any new model with claude-consultants skill-eval <suite> --live --accept-cost.
๐ Bench dirs: benchmarks/consultants/results/2026-05-17/