Post
55
new dataset: which local LLM best *drives an agent*? benchmarked 4 models for pairing with Hermes Agent (@NousResearch ) - a CodeAct agent that writes python to call its tools. RTX 5090, llama.cpp. two phases, hybrid:
>>> phase A (synthetic): scored 4 axes — code-as-action, long-context, instruction-following under Hermes' real ~3.5K-token prompt, multi-step loops. top was a near-tie (within noise): an 18B frankenmerge (Qwopus) edged Qwen3.6-27B, and Hermes' own 36B came LAST.
>>> phase B (real harness): installed Hermes, ran the top 3 through 14 multi-step tasks x3 repeats. the tie broke — and an efficiency gap appeared:
Qwen is perfect AND 2.4-3.7x more token-efficient — something a synthetic test can't see (only the real agent loop can). verdict: Qwen3.6-27B for local Hermes.
dataset: witcheer/hermes-pairing-bench
collection: witcheer/rtx-5090-benchmark-rig-6a17e365b534abb474250e11
>>> phase A (synthetic): scored 4 axes — code-as-action, long-context, instruction-following under Hermes' real ~3.5K-token prompt, multi-step loops. top was a near-tie (within noise): an 18B frankenmerge (Qwopus) edged Qwen3.6-27B, and Hermes' own 36B came LAST.
>>> phase B (real harness): installed Hermes, ran the top 3 through 14 multi-step tasks x3 repeats. the tie broke — and an efficiency gap appeared:
Qwen3.6-27B 100% | 3.0 turns | 364 tok
Qwopus-18B 85.7% | 3.6 turns | 870 tok
Nemotron-30B 85.7% | 4.4 turns | 1334 tokQwen is perfect AND 2.4-3.7x more token-efficient — something a synthetic test can't see (only the real agent loop can). verdict: Qwen3.6-27B for local Hermes.
dataset: witcheer/hermes-pairing-bench
collection: witcheer/rtx-5090-benchmark-rig-6a17e365b534abb474250e11