論文 Hugging Face 発表: 2026-05-11 HF ↑11

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

著者: Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di ほか3名

要約

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio …

#agent#llm#rl

同じカテゴリの記事