AI evaluation, LLM benchmarking, agent evaluation, reproducible eval workflows, model comparison, regression testing, failure analysis, eval datasets, and open-source developer tooling