witcheer PRO
witcheer
AI & ML interests
Local AI Maxxing.
Recent Activity
updated a dataset about 23 hours ago
witcheer/hermes-pairing-bench posted an update 2 days ago
new dataset: which local LLM best *drives an agent*? benchmarked 4 models for pairing with Hermes Agent (@NousResearch) - a CodeAct agent that writes python to call its tools. RTX 5090, llama.cpp. two phases, hybrid:
>>> phase A (synthetic): scored 4 axes — code-as-action, long-context, instruction-following under Hermes' real ~3.5K-token prompt, multi-step loops. top was a near-tie (within noise): an 18B frankenmerge (Qwopus) edged Qwen3.6-27B, and Hermes' own 36B came LAST.
>>> phase B (real harness): installed Hermes, ran the top 3 through 14 multi-step tasks x3 repeats. the tie broke — and an efficiency gap appeared:
```
Qwen3.6-27B 100% | 3.0 turns | 364 tok
Qwopus-18B 85.7% | 3.6 turns | 870 tok
Nemotron-30B 85.7% | 4.4 turns | 1334 tok
```
Qwen is perfect AND 2.4-3.7x more token-efficient — something a synthetic test can't see (only the real agent loop can). verdict: Qwen3.6-27B for local Hermes.
dataset: https://huggingface.co/datasets/witcheer/hermes-pairing-bench
collection: https://hf.co/collections/witcheer/rtx-5090-benchmark-rig-6a17e365b534abb474250e11 updated a collection 4 days ago
RTX 5090 Benchmark RigOrganizations
None yet