論文 Hugging Face 発表: 2026-05-03 HF ↑2

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

著者: Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin ほか5名

要約

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs’ performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads …

#rl#llm#agent#benchmark

同じカテゴリの記事