A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
要約
Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend…