Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
要約
Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy…