You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
要約
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extr…