Published on: 2025-04-01
Description: An exploration of discrete diffusion models, their mathematical foundations, and applications in machine learning
Written by: Fadi Atieh
Discrete diffusion is where we’re headed, but the road runs through continuous diffusion first. Two points routinely confuse newcomers:
Direction of the story. Tutorials start with “add Gaussian noise, then train a network to denoise.” Mathematically it’s the reverse: we assume a generative latent‑variable model that maps pure noise to data via a neural network, and we design a forward noising process that mirrors those reverse dynamics. In DDPMs we pick a fixed Gaussian Markov chain because it keeps the algebra clean—but in principle that forward kernel could be learned alongside the decoder. The community freezes only for tractability.
What we actually optimize. We can’t hit the exact log‑likelihood, so we minimize the negative Evidence Lower Bound (ELBO) instead. That ELBO decomposes into KL terms that are analytic only because the forward process is Gaussian and fixed—another reason the schedule is usually hard‑coded.
If these VI concepts feel hazy, a quick refresher will pay dividends before we dive into diffusion proper.
What worked for me is to begin with the foundational paper Kingma & Welling (2013) – Auto‑Encoding Variational Bayes (AEVB). We’ll explain what all of the terms in the title come from. But first, let’s describe the problem at hand.
We observe an i.i.d. dataset . Each data point is generated from a latent variable , where are called latent because they are hidden and unobservable. We posit a generative model
We call it generative because it describes a probabilistic distribution from which to generate the data. Under these modeling assumptions, (and thus ) are given and usually have analytic expressions. In many applications, our goal is to maximize the log-likelihood of the data,
Unfortunately, the integral is usually intractable.
Bayes’ rule tells us that
so an intractable marginal is equivalent to an intractable posterior . This equivalence implies that approximating the posterior amounts to approximating the marginal.
This is where variational inference comes in. It introduces a family of tractable distributions parameterized by to approximate the true posterior. The approach is pragmatic: since we can’t compute exactly, we’ll find the closest approximation within a tractable family of our choosing.
The term “variational” refers to how we vary to find the optimal approximation within this family. By selecting a distribution that balances expressiveness with computational feasibility, we can make progress where direct computation fails.
Through a simple application of logarithms and expectations, we can derive:
where the ELBO is given by
The KL term is non‑negative, so the first term—the Evidence Lower Bound—is indeed a lower bound on the true log‑likelihood. Maximizing the ELBO therefore gives us “evidence” that the log-likelihood is at least as much. And the punch line is that all the terms in the ELBO are tractable.
I need to confess a slight oversimplification. While we can evaluate both and , computing the expectation in the ELBO:
is still an integral that might not have a closed form. A natural solution would be to approximate this expectation using Monte Carlo sampling:
where is the number of samples we draw.
But here’s the catch: our goal isn’t just to evaluate the ELBO—we need to optimize it. This means we need access to its gradients with respect to both and . The gradient with respect to is straightforward since appears inside the expectation. However, presents a challenge because it determines the distribution we’re taking the expectation over.
To compute gradients with respect to , we need to be able to move the gradient operator inside the expectation. This is only possible if the distribution we’re taking the expectation over doesn’t depend on the parameters we’re differentiating with respect to. Circumventing this constraint is exactly what the reparametrization trick will help us resolve.