論文 arXiv 発表: 2026-05-12

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

著者: Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao

要約

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model’s intrinsic potential to…

#multimodal#llm#diffusion#agent#alignment

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

要約

同じカテゴリの記事

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

World-R1: テキストから動画生成における3D制約の強化学習による整合

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents