SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
要約
As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward h…