#Machine-Learning/Deep-Learning/Generative-Model/Diffusion-Model #Note

Diffusion Models

Original paper: Denoising Diffusion Probabilistic Models
Survey: Understanding Diffusion Models A Unified Perspective

Intuition

A diffusion model is trained through a diffusion process that progressively adds noise to the original data. The model then learns how to reconstruct the original data from this noisy input.

graph LR;
A[Original Data] -->|Forward Diffusion| B[Noisy Data];
B -->|Reverse Denoising| A

Once the model has sufficiently learned to reconstruct the data distribution from a typically Gaussian noise distribution, it gains the capability to generate new, novel data.

Forward Process

At each time step $t$ , the transition probability from input $x_{t - 1}$ to output $x_{t}$ is defined as a Gaussian:

q (x_{t} ∣ x_{t - 1}) = N (\sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

where $β_{t} \in (0, 1)$ .

Reparameterization

We can generate $x_{t}$ using the reparameterization trick:

x_{t} = \sqrt{1 - β_{t}} x_{t - 1} + \sqrt{β_{t}} ϵ_{t}, ϵ_{t} \sim N (0, I) .

Furthermore, we can generate $x_{t}$ from $x_{0}$ with one reparameterization step:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ,

where

{\bar{α}}_{t} := \prod_{i = 1}^{t} α_{i}, α_{i} = 1 - β_{i}, ϵ \sim N (0, I) .

proof

Define $α_{t} = 1 - β_{t}$ ,
$\begin{aligned} x_{t} & = \sqrt{α_{t}} x_{t - 1} + \sqrt{β_{t}} ϵ_{t} \\ = \sqrt{α_{t}} (\sqrt{α_{t - 1}} x_{t - 2} + \sqrt{β_{t - 1}} ϵ_{t - 1}) + \sqrt{β_{t}} ϵ_{t} \\ = \sqrt{α_{t}} (\sqrt{α_{t - 1}} (\sqrt{α_{t - 2}} x_{t - 3} + \sqrt{β_{t - 2}} ϵ_{t - 2}) + \sqrt{β_{t - 1}} ϵ_{t - 1}) + \sqrt{β_{t}} ϵ_{t} \\ = \dots \\ = (\prod_{i = τ}^{t} \sqrt{α_{i}}) x_{τ - 1} + \sum_{i = τ}^{t} (\prod_{j = i + 1}^{t} \sqrt{α_{j}}) \sqrt{β_{i}} ϵ_{i} . \end{aligned}$

The second term is the sum of i.i.d. zero-mean Gaussians, and is thus also a zero-mean Gaussian, with variance
$\begin{aligned} \sum_{i = τ}^{t} (\prod_{j = i + 1}^{t} α_{j}) β_{i} & = \sum_{i = τ}^{t} \underset{Π_{i + 1 : t}}{\underset{⏟}{(\prod_{j = i + 1}^{t} α_{j})}} (1 - α_{i}) \\ = \sum_{i = τ}^{t} Π_{i + 1 : t} - Π_{i : t} \\ = Π_{τ + 1 : t} - Π_{τ : t} + Π_{τ + 2 : t} - Π_{τ + 1 : t} + \dots + 1 - Π_{t : t} \\ = 1 - Π_{τ : t} = 1 - \prod_{j = τ}^{t} α_{j} \end{aligned}$
Therefore, the forward diffusion process from time step $τ$ to $t$ can be done with only one reparameterization
$x_{t} = (\prod_{i = τ}^{t} \sqrt{α_{i}}) x_{τ - 1} + \sqrt{1 - \prod_{i = τ}^{t} α_{i}} ϵ,$
where $ϵ \sim N (0, I)$ .

Asymptotics

Since $α_{i} \in (0, 1)$ , as $t \to \infty$

{\bar{α}}_{t} \to 0.

Therefore, regardless of $x_{0}$ ,

x_{\infty} \sim N (0, I) .

Reverse Process

The posterior distribution, $q (x_{t - 1} ∣ x_{t})$ , is generally intractable:

q (x_{t - 1} ∣ x_{t}) = \frac{q (x_{t} ∣ x_{t - 1}) q (x_{t - 1})}{q (x_{t})},

where

q (x_{t}) = \int_{x_{0} \in X} q (x_{t} ∣ x_{0}) q (x_{0}) d x_{0} .

Therefore, we learn a network parameterized by $θ$ to approximate $q (x_{t - 1} ∣ x_{t})$ with a Gaussian distribution:

p_{θ} (x_{t - 1} ∣ x_{t}) := N (μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) .

We approximate the log-likelihood using the ELBO:

\begin{aligned} - \log p_{θ} (x_{0}) & = & - E_{q (x_{0})} [\log p_{θ} (x_{0})] \\ \leq & - E_{q} [\log \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} ∣ x_{0})}] \\ = & E_{q} [\underset{L_{T}}{\underset{⏟}{D_{KL} (q (x_{T} ∣ x_{0}) ∣∣ p_{θ} (x_{T}))}} \\ + \underset{L_{1 : T - 1}}{\underset{⏟}{\sum_{t > 1}^{T} D_{KL} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))}} \\ \underset{L_{0}}{\underset{⏟}{- \log p_{θ} (x_{0} ∣ x_{1})}}] . \end{aligned}

derivation

$\begin{aligned} \log \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} ∣ x_{0})} & = \log \frac{p_{θ} (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} ∣ x_{t})}{\prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1})} \\ = \log p_{θ} (x_{T}) + \sum_{t = 1}^{T} \log \frac{p_{θ} (x_{t - 1} ∣ x_{t})}{q (x_{t} ∣ x_{t - 1})} \\ = \log p_{θ} (x_{T}) + \sum_{t > 1}^{T} \log \frac{p_{θ} (x_{t - 1} ∣ x_{t})}{q (x_{t} ∣ x_{t - 1})} + \log \frac{p_{θ} (x_{0} ∣ x_{1})}{q (x_{1} ∣ x_{0})} \end{aligned}$

Here $q (x_{t} ∣ x_{t - 1})$ is not tractable since $x_{t - 1}$ is unknown.

Since $x_{0 : T}$ is a Markov Chain, for $t > 1$ ,
$\begin{aligned} x_{t} ⊥ x_{0 : t - 2} ∣ x_{t - 1} \\ ⟹ & q (x_{t} ∣ x_{t - 1}) = q (x_{t} ∣ x_{t - 1}, x_{0}) = \frac{q (x_{t - 1} ∣ x_{t}, x_{0}) q (x_{t} ∣ x_{0})}{q (x_{t - 1} ∣ x_{0})} . \end{aligned}$
Plugging in,
$\begin{aligned} \log \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} ∣ x_{0})} & = & \log p_{θ} (x_{T}) + \sum_{t > 1}^{T} \log \frac{p_{θ} (x_{t - 1} ∣ x_{t})}{q (x_{t - 1} ∣ x_{t}, x_{0})} \frac{q (x_{t - 1} ∣ x_{0})}{q (x_{t} ∣ x_{0})} \\ + \log \frac{p_{θ} (x_{0} ∣ x_{1})}{q (x_{1} ∣ x_{0})} \\ = & \log \frac{p_{θ} (x_{T})}{q (x_{T} ∣ x_{0})} + \sum_{t > 1}^{T} \log \frac{p_{θ} (x_{t - 1} ∣ x_{t})}{q (x_{t - 1} ∣ x_{t}, x_{0})} \\ + \log p_{θ} (x_{0} ∣ x_{1}) . \end{aligned}$

The $L_{T}$ Term

Since $x_{T}$ is directly sampled from a standard Gaussian distribution, $L_{T}$ has no learnable parameters for fixed $β_{1 : T}$ , thus ignored from the training objective.

The $L_{1 : T - 1}$ Term

The posterior conditioned on $x_{0}$ , $q (x_{t - 1} ∣ x_{t}, x_{0})$ , is a tractable Gaussian:

q (x_{t - 1} ∣ x_{t}, x_{0}) = N ({\tilde{μ}}_{t} (x_{t}, x_{0}), {\tilde{β}}_{t} I),

where

\begin{aligned} {\tilde{μ}}_{t} (x_{t}, x_{0}) & := \frac{1}{\sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}} \sqrt{α_{t}}} ϵ; \\ {\tilde{β}}_{t} & := \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t} . \end{aligned}

proof

Using Bayes' theorem,
$\begin{aligned} q (x_{t - 1} ∣ x_{t}, x_{0}) & \propto q (x_{t} ∣ x_{t - 1}, x_{0}) q (x_{t - 1} ∣ x_{0}) \\ \propto \exp (- \frac{1}{2} \frac{‖ x_{t} - \sqrt{1 - β_{t}} x_{t - 1} ‖^{2}}{β_{t}}) \cdot \exp (- \frac{1}{2} \frac{‖ x_{t - 1} - \sqrt{{\bar{α}}_{t - 1}} x_{0} ‖^{2}}{1 - {\bar{α}}_{t - 1}}) \\ = \exp [- \frac{1}{2} (\frac{‖ x_{t} - \sqrt{1 - β_{t}} x_{t - 1} ‖^{2}}{β_{t}} + \frac{‖ x_{t - 1} - \sqrt{{\bar{α}}_{t - 1}} x_{0} ‖^{2}}{1 - {\bar{α}}_{t - 1}})] . \end{aligned}$
The coefficient of $x_{t - 1}^{2}$ is
$\frac{1}{{\tilde{β}}_{t}} = \frac{1 - β_{t}}{β_{t}} + \frac{1}{1 - {\bar{α}}_{t - 1}} = \frac{1 - {\bar{α}}_{t}}{(1 - {\bar{α}}_{t - 1}) β_{t}} .$
Therefore,
${\tilde{β}}_{t} = \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t} .$
The coefficient of $x_{t - 1}$ is
$- \frac{2 {\tilde{μ}}_{t}}{{\tilde{β}}_{t}} = - \frac{2 \sqrt{1 - β_{t}} x_{t}}{β_{t}} - \frac{2 \sqrt{{\bar{α}}_{t - 1}} x_{0}}{1 - {\bar{α}}_{t - 1}} = - 2 (\frac{(1 - {\bar{α}}_{t - 1}) \sqrt{α_{t}} x_{t} + β_{t} \sqrt{{\bar{α}}_{t - 1}} x_{0}}{β_{t} (1 - {\bar{α}}_{t - 1})}) .$
Plugging ${\tilde{β}}_{t}$ ,
$\begin{aligned} {\tilde{μ}}_{t} & = (\frac{(1 - {\bar{α}}_{t - 1}) \sqrt{α_{t}} x_{t} + β_{t} \sqrt{{\bar{α}}_{t - 1}} x_{0}}{β_{t} (1 - {\bar{α}}_{t - 1})}) {\tilde{β}}_{t} \\ = \frac{(1 - {\bar{α}}_{t - 1}) \sqrt{α_{t}}}{1 - {\bar{α}}_{t}} x_{t} + \frac{β_{t} \sqrt{{\bar{α}}_{t - 1}}}{1 - {\bar{α}}_{t}} x_{0} \\ = \frac{(1 - {\bar{α}}_{t - 1}) \sqrt{α_{t}}}{1 - {\bar{α}}_{t}} x_{t} + \frac{β_{t} \sqrt{{\bar{α}}_{t - 1}}}{1 - {\bar{α}}_{t}} \frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ}{\sqrt{{\bar{α}}_{t}}} \\ = \frac{1}{\sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}} \sqrt{α_{t}}} ϵ \end{aligned}$

For simplicity, $Σ_{θ} (x_{t}, t)$ is set to untrained time-dependent constants $σ_{t}^{2} I$ , where $σ_{t}^{2} = {\tilde{β}}_{t}$ . Since $q (x_{t - 1} ∣ x_{t}, x_{0})$ and $p_{θ} (x_{t - 1} ∣ x_{t})$ are both tractable Gaussians, the KL divergence has a closed form solution (see KL divergence between Gaussians):

L_{t - 1} = \frac{1}{2 σ_{t}^{2}} ‖ {\tilde{μ}}_{t} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) ‖^{2} + C .

The $L_{0}$ Term

To obtain discrete log-likelihood, this term is set to an independent discrete decoder derived from Gaussian $N (x_{0}; μ_{θ} (x_{1}, 1), σ_{1}^{2} I)$ . For discrete $x_{t}$ of dimension $D$ ,

p_{θ} (x_{0} ∣ x_{1}) = \prod_{i = 1}^{D} \int_{F (x) = (x_{0})_{i}} N (x; (μ_{θ})_{i}, σ_{1}^{2}) d x,

where $F (x)$ is a discretization function that maps $x \in R$ into discrete $(x_{0})_{i}$ .

Training

The formulation of ${\tilde{μ}}_{t} (x_{t}, x_{0})$ indicates that instead of directly predicting a mean $μ_{θ} (x_{t}, t)$ , it is better to predict a $ϵ_{θ} (x_{t}, t)$ , such that

L_{t - 1} - C \propto ‖ ϵ - ϵ_{θ} (x_{t}, t) ‖^{2} .

Combining with the reparameterization step, we have the training objective:

‖ ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ) ‖^{2} .

We thus have the training algorithm:

\begin{algorithm}
\caption{Training}
\begin{algorithmic}
\Repeat
    \State $\mathbf{x}_0 \sim q(\mathbf{x}_0)$
    \State $t \sim \text{Uniform}(\{1, \ldots, T\})$
    \State $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
    \State Take gradient descent step on $\nabla_\theta \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\sqrt{\bar\alpha_t} \mathbf{x}_0 + \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}, t) \|^2$
\Until{converged}
\end{algorithmic}
\end{algorithm}

Sampling

The sampling algorithm also uses the reparameterization trick to sample $x_{t - 1}$ given $x_{t}$ :

x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)) + σ_{t} z,

where $z \sim N (0, I)$ if $t > 1$ , and $z$ is set to zero when $t = 1$ to generate deterministic results.


\begin{algorithm}
\caption{Sampling}
\begin{algorithmic}
\State $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
\For{$t = T, \cdots, 1$}
    \State $\mathbf{z} \sim \begin{cases} 
    \mathcal{N}(\mathbf{0}, \mathbf{I}) & \text{if } t > 1 \\
    \mathbf{0} & \text{otherwise}
    \end{cases}$
    \State $\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}} \boldsymbol\epsilon_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$
\EndFor
\State \Return $\mathbf{x}_0$
\end{algorithmic}
\end{algorithm}

Network Architecture

The network for estimating $ϵ_{θ} (x_{t}, t)$ is typically a U-Net architecture, where parameters are shared across time steps, and the time step $t$ is taken as input.

Extensions and Applications

Conditional Generation

Additional conditioning information $y$ is applied to control generation.
The network now samples $x_{t - 1}$ from $p_{θ} (x_{t - 1} ∣ x_{t}, t, y)$ .
The condition $y$ is taken as input in the similar fashion to $t$ .
ControlNet

Guidance Methods

Gradient guidance added to shift the predicted mean.
For more information, see Guidance Methods.

Latent Diffusion Models

Train a VAE to map the input data to latent space.
Diffusion and denoising are performed within the latent space.
Lower computational cost.

Diffusion Models

Intuition

Forward Process

Reparameterization

Asymptotics

Reverse Process

The LT Term

The L1:T−1 Term

The L0 Term

Training

Sampling

Network Architecture

Extensions and Applications

Conditional Generation

Guidance Methods

Latent Diffusion Models

The $L_{T}$ Term

The $L_{1 : T - 1}$ Term

The $L_{0}$ Term