Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
要約
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific …