Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
要約
Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the “zero-advantage problem”: when all sampled roll…