Understanding Discrete Diffusion Models

Published on: 2025-04-01

Description: An exploration of discrete diffusion models, their mathematical foundations, and applications in machine learning

Written by: Fadi Atieh

#machine-learning

#diffusion-models

#mathematics


Motivation

Discrete diffusion is where we’re headed, but the road runs through continuous diffusion first. Two points routinely confuse newcomers:

  1. Direction of the story. Tutorials start with “add Gaussian noise, then train a network to denoise.” Mathematically it’s the reverse: we assume a generative latent‑variable model that maps pure noise to data via a neural network, and we design a forward noising process that mirrors those reverse dynamics. In DDPMs we pick a fixed Gaussian Markov chain because it keeps the algebra clean—but in principle that forward kernel qϕ(z1:Tx)q_\phi(z_{1:T}|x) could be learned alongside the decoder. The community freezes ϕ\phi only for tractability.

  2. What we actually optimize. We can’t hit the exact log‑likelihood, so we minimize the negative Evidence Lower Bound (ELBO) instead. That ELBO decomposes into KL terms that are analytic only because the forward process is Gaussian and fixed—another reason the schedule is usually hard‑coded.

If these VI concepts feel hazy, a quick refresher will pay dividends before we dive into diffusion proper.


Variational Inference in a Nutshell

What worked for me is to begin with the foundational paper Kingma & Welling (2013) – Auto‑Encoding Variational Bayes (AEVB). We’ll explain what all of the terms in the title come from. But first, let’s describe the problem at hand.

Setup

We observe an i.i.d. dataset {xi}i=1N\{x_i\}_{i=1}^N. Each data point is generated from a latent variable ziz_i, where ziz_i are called latent because they are hidden and unobservable. We posit a generative model

pθ(xi,zi)=pθ(zi)pθ(xizi).p_\theta(x_i, z_i) = p_\theta(z_i) p_\theta(x_i|z_i).

We call it generative because it describes a probabilistic distribution from which to generate the data. Under these modeling assumptions, pθ(zi),pθ(xizi)p_\theta(z_i), p_\theta(x_i| z_i) (and thus pθ(xi,zi)p_\theta(x_i, z_i)) are given and usually have analytic expressions. In many applications, our goal is to maximize the log-likelihood of the data,

log(pθ(xi))=log(pθ(xi,zi)dzi).\log(p_\theta(x_i)) = \log\Big(\int p_\theta(x_i, z_i) dz_i\Big).

Unfortunately, the integral is usually intractable.

From Marginal Likelihood to ELBO

Bayes’ rule tells us that

pθ(zixi)=pθ(xizi)pθ(zi)pθ(xi)p_\theta(z_i|x_i) = \frac{p_\theta(x_i|z_i)p_\theta(z_i)}{p_\theta(x_i)}

so an intractable marginal pθ(xi)p_\theta(x_i) is equivalent to an intractable posterior pθ(zixi)p_\theta(z_i|x_i). This equivalence implies that approximating the posterior amounts to approximating the marginal.

This is where variational inference comes in. It introduces a family of tractable distributions qϕ(zixi)q_\phi(z_i|x_i) parameterized by ϕ\phi to approximate the true posterior. The approach is pragmatic: since we can’t compute pθ(zixi)p_\theta(z_i|x_i) exactly, we’ll find the closest approximation within a tractable family of our choosing.

The term “variational” refers to how we vary ϕ\phi to find the optimal approximation within this family. By selecting a distribution qϕ(zixi)q_\phi(z_i|x_i) that balances expressiveness with computational feasibility, we can make progress where direct computation fails.

Through a simple application of logarithms and expectations, we can derive:

logpθ(xi)=ELBO(θ,ϕ;xi)+KL(qϕ(zixi)pθ(zixi)),\log p_\theta(x_i) = \text{ELBO}(\theta, \phi; x_i) + \text{KL}(q_\phi(z_i|x_i) \| p_\theta(z_i|x_i)),

where the ELBO is given by

ELBO(θ,ϕ;xi)=Eqϕ(zixi)[logpθ(xi,zi)logqϕ(zixi)]\text{ELBO}(\theta, \phi; x_i) = \mathbb{E}_{q_\phi(z_i|x_i)}[\log p_\theta(x_i, z_i) - \log q_\phi(z_i|x_i)]

The KL term is non‑negative, so the first term—the Evidence Lower Bound—is indeed a lower bound on the true log‑likelihood. Maximizing the ELBO therefore gives us “evidence” that the log-likelihood is at least as much. And the punch line is that all the terms in the ELBO are tractable.

The ELBO’s Challenge

I need to confess a slight oversimplification. While we can evaluate both qϕ(zixi)q_\phi(z_i|x_i) and pθ(xi,zi)p_\theta(x_i, z_i), computing the expectation in the ELBO:

Eqϕ(zixi)[logpθ(xi,zi)logqϕ(zixi)]\mathbb{E}_{q_\phi(z_i|x_i)}[\log p_\theta(x_i, z_i) - \log q_\phi(z_i|x_i)]

is still an integral that might not have a closed form. A natural solution would be to approximate this expectation using Monte Carlo sampling:

1Ll=1L[logpθ(xi,zi(l))logqϕ(zi(l)xi)],zi(l)qϕ(zixi)\frac{1}{L}\sum_{l=1}^L [\log p_\theta(x_i, z_i^{(l)}) - \log q_\phi(z_i^{(l)}|x_i)], \quad z_i^{(l)} \sim q_\phi(z_i|x_i)

where LL is the number of samples we draw.

But here’s the catch: our goal isn’t just to evaluate the ELBO—we need to optimize it. This means we need access to its gradients with respect to both θ\theta and ϕ\phi. The gradient with respect to θ\theta is straightforward since θ\theta appears inside the expectation. However, ϕ\phi presents a challenge because it determines the distribution we’re taking the expectation over.

To compute gradients with respect to ϕ\phi, we need to be able to move the gradient operator inside the expectation. This is only possible if the distribution we’re taking the expectation over doesn’t depend on the parameters we’re differentiating with respect to. Circumventing this constraint is exactly what the reparametrization trick will help us resolve.