The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
Xiaobo Wang
Yofuria
AI & ML interests
Reward Modeling, Agent Memory, LLM Alignment
Recent Activity
updated a collection 1 day ago
UAPO updated a collection 1 day ago
UAPO updated a collection 1 day ago
UAPO