Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams Paper • 2606.01770 • Published 3 days ago • 12
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation Paper • 2605.23271 • Published 14 days ago • 79
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining Paper • 2605.14747 • Published 22 days ago • 145
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools Paper • 2605.20682 • Published 16 days ago • 83
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection Paper • 2602.07892 • Published 24 days ago • 2
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence Paper • 2605.12882 • Published 23 days ago • 270
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation Paper • 2605.11739 • Published 23 days ago • 59
Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon Paper • 2605.09708 • Published 26 days ago • 5
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents Paper • 2605.05185 • Published 30 days ago • 102
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver Paper • 2604.08377 • Published Apr 9 • 291
Adam's Law: Textual Frequency Law on Large Language Models Paper • 2604.02176 • Published Apr 2 • 504
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding Paper • 2603.19235 • Published Mar 19 • 95
InCoder-32B: Code Foundation Model for Industrial Scenarios Paper • 2603.16790 • Published Mar 17 • 311
Believe Your Model: Distribution-Guided Confidence Calibration Paper • 2603.03872 • Published Mar 4 • 40
From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models Paper • 2602.22859 • Published Feb 26 • 150
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training Paper • 2602.10693 • Published Feb 11 • 221
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise Paper • 2602.12783 • Published Feb 13 • 246
TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents Paper • 2602.07274 • Published Feb 6 • 210
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs Paper • 2602.10388 • Published Feb 11 • 245