Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
要約
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrit…