GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Introduction

Reinforcement learning (RL) for post-training diffusion language models (dLLMs) is hindered by the intractability of the policy likelihood. ELBO-based methods are efficient and well aligned with pre-training, but introduce bias through training--inference mismatch, which can degrade performance.

$$\mathcal{L}_{\text{PG}}=\mathbb{E}_{ x_0 \sim \textcolor{blue}{\pi^\text{rm}_{\text{old}}}} \big[ \frac{\hat{\pi}_\theta(x_0)}{\textcolor{red}{\hat{\pi}_\text{old}(x_0)}} A \big],\ \mathcal{L}_{\text{PPO}}=\mathbb{E}_{ x_0 \sim \textcolor{blue}{\pi^\text{rm}_{\text{old}}}} \bigg[ \sum_{n=1}^N \min\big(\frac{\hat{\pi}_\theta(x_0^{(n)})}{\textcolor{red}{\hat{\pi}_\text{old}(x_0^{(n)})}} A, \text{clip}(\frac{\hat{\pi}_\theta(x_0^{(n)})}{\textcolor{red}{\hat{\pi}_\text{old}(x_0^{(n)})}}, 1\pm \epsilon) A \big)\bigg], $$ where the sampling distribution induced by re-masking, $\textcolor{blue}{\pi^{\rm rm}_{\rm old}}$, differs from the approximate likelihood $\textcolor{red}{\hat{\pi}_{\rm old}(x_0)}$, such as the exponential of the ELBO, $\exp L(x_0)$. This mismatch, $\textcolor{blue}{\pi^{\rm rm}_{\rm old}} \neq \textcolor{red}{\hat{\pi}_{\rm old}(x_0)}$, introduces bias into RL training and can degrade performance.

In this work, we propose Guided Denoiser Self-Distillation (GDSD) avoid this issue by directly distilling the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. We evaluate on planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, where GDSD matches or outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics.

Method: Guided Denoiser Self-Distillation (GDSD)

GDSD follows ELBO-based methods: randomly masking sequences and compute the denoising probabilitites $p(x_0|x_t)$. Unlike ELBO-based methods, GDSD directly matches the denoising logits of dLLMs to the target denoiser $p^*$, an advantage-guided self-teacher:

GDSD training procedure

GDSD is a principled self-distillation procedure equivalent to reverse-KL regularized RL $$\max_{\pi} \; \mathbb{E}_{x_0 \sim \pi} [ \psi A(x_0) - D_{{\text{KL}}}(\pi \| \pi_{\text{old}} ) - \beta D_{\text{KL}} ( \pi \,\|\, \pi_{{\text{ref}}} ) ]$$, which has the closed-form optimum $\pi^{*}(x_0) \propto \pi_{\text{old}}(x_0)^{{(1 - \beta)}} \cdot \pi_{{\text{ref}}}(x_0)^{{\beta}} \cdot \exp\big( \psi A(x_0) \big)$, and the optimal denoiser: $$ p^*(x_0|x_t) = p_\text{old}^\text{ref}(x_0|x_t) \cdot \exp(\psi A(x_0) - A_t(x_t)),$$ where the mixture denoiser $p_\text{old}^\text{ref}(x_0|x_t)\propto p_{\text{old}}(x_0|x_t)^{1-\beta} p_{\text{ref}}(x_0|x_t) ^{\beta}$. GDSD employs normalization-free logit matching to approximate $p^*$, successfully avoiding estimating the intractable likelihood and normalization constant $A_t$. Thus, GDSD is efficient and stable.

Results

Training

Our method GDSD achieves more stable training than ELBO-based methods.

Test Accuracy

Our method GDSD obtains better test accuracy than ELBO-based methods.

Method comparison on Dream-7B — Dream-7B

Our methods GDSD outperforms ELBO-based methods, achiving up to 91.7% on Sudoku, and 84.8% on Countdown using Dream-7B.
GDSD on LLaDA-8B matches or outperforms other ELBO-based methods, confirming that ELBO is not necessary for RL on dLLMs.

Method comparison on LLaDA-8B — LLaDA-8B

BibTeX

@misc{tang2026gdsdreinforcementlearningguided,
      title={GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models}, 
      author={Xiaohang Tang and Keyue Jiang and Che Liu and Qifang Zhao and Xiaoxiao Xu and Sangwoong Yoon and Ilija Bogunovic},
      year={2026},
      eprint={2605.29398},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.29398}, 
}